An effective context-based method for Vietnamese ...

Viewer
Transcript

Proceedings of First International Workshop on Vietnamese Language and Speech Processing (VLSP 2012) In Conjunction with 9th IEEE-RIVF Conference on Computing and Communication Technologies (RIVF 2012)

An effective context-based method for Vietnamese-word segmentation Ngoc Anh, Tran

Thanh Tinh, Dao

Phuong Thai, Nguyen

Dept. Information Technology Le Quy Don Technical University Hanoi, Vietnam [email protected]

Dept. Information Technology Le Quy Don Technical University Hanoi, Vietnam [email protected]

Dept. Information Technology UET, Vietnam National University Hanoi, Vietnam [email protected]

languages (English, French, German,...). Even in the dictionary of vocabulary, part-of-speech (POS) [2, 5, 6], the meaning and grammar of words are vague. Therefore, the determination of word boundaries is vague, they are determined by the context, according to their position with the others in front, behind in the phrase or sentence [2]. Moreover, the conceptual distinction between "word" and "phrase" or the "word" and "chunk" is vague as the word is a compound word. Vague concept above is ambiguity in the language, is common in Vietnamese and isolating languages, non-form, does not transform such as Chinese, Thai, Lao, Khmer... Vietnamese word segmentation problem can be stated as follows: Given a sentence represented as a sequence of n syllables: S = s1 s2 s3 ... sn-1 sn Find a correct word segmentation: S = w1 w2 w3 ... wm-1 wm For example, with the input: Mọi người chuẩn bị đón tiếp tân Thủ tướng The output will be: | Mọi | người | chuẩn bị | đón tiếp | tân | Thủ tướng | For years, this problem has been studied by experts in the field of Vietnamese language processing. Basically, the authors used the methods of word segmentation such as ngram statistical language model [10, 12, 13, 14, 17, 19]; bionics approach (genetic algorithms and statistical ngram) [14]; machine learning in artificial neural networks (ANN) [10], hidden Markov model (HMM) [9], maximum Entropy model (MEM) [11, 17], conditional random fields (CRF) [20], support vector machine (SVM) [8, 20], using the dictionary with the methods: the longest match (LM) [3], FMM / BMM [8, 9, 10, 19], with the finite state automata (FSA) [3, 19] or weighted finite state transductor (WFST) [10]. Based on the research approach, can be classified into two main directions: Independent methods only use a single technique for word segmentation, such as: ngram statistics [12] (achieved a F1score of 51%-65%), [13] (90%); bionics approach with genetic algorithms and ngram statistics [14] (50%-61%); machine learning methods: MEM [11] (94.4%), CRF [20] (94%), SVM [20] (94.23%); and using dictionary with FSA [3]. Combination methods employ different techniques for better results, such as: using dictionary with ngram [19] (95.6%); using dictionary and machine learning [8, 9] (97%);

Abstract— Vietnamese word segmentation is one of the

fundamental problems of Vietnamese language processing. In the form, a Vietnamese word is composed of one or more syllables together. On the other hand, the word boundaries and meanings depend on its location and context, its left and right words. Thus, the determination of word boundaries is a challenge, especially to deal with ambiguity. In this paper, we have classified ambiguous forms by the maximum matching method. Then, we propose an approachable method of Vietnamese word segmentation, based on the combination of an improved maximum matching method, word-bigram probability, and syllable-ngram model for context-based word segmentation disambiguation. We also propose a combined measure for dynamic programming that allows the combination of diverse knowledge sources, including dictionary, the word bigram model, and the syllable ngram model to figure out the best way for Vietnamese word segmentation. The test results on sample Corpus with the F-scores achieve from 98.71% to 98.94%, significantly higher than the previous results. Keywords: Vietnamese word segmentation; context-based word segmentation disambiguition; dictionary; improved maximum matching; word bigram model, syllable ngram model; Abbreviations:

MM FMM BMM NER LE MI Pbi

Maximum Matching Foward Maximum Matching Backward Maximum Matching Named Entity Recognition LEngth of word = Number of syllables Mutual Information of syllables Probability.of word bigram I.

INTRODUCTION

Word is the smallest meaning unit of language. Thus, identifying the its boundaries correctly and quickly is a basic and important problem, direct effect on the results of research and application in natural language processing problems, such as spell checking, text summarization, machine translation, automatic answering... For the Vietnamese, in the form, the word is made from one or more syllables together, so space is not used to distinguish the boundaries of word as the flexional

34

using dictionary with machine learning and bigram statistics [10] (97%); and using MEM to combines dictionary and ngram [17] (95.3%). Thus, researchers have combined different methods to improve accuracy and speed of word segmentation. Especially towards the combined approach proved very effective, the F1-score is from 95% to 97%. Vietnamese word segmentation is considered a difficult problem and can be applied in many real-world problems. So this research direction is still interested from many experts. The method of using the dictionary (maximum matching) [3, 8, 9, 10, 19] are faced with the challenge of ambiguities. However, the authors have just mentioned and resolved an ambiguous form of three syllables (si-1, si, si+1) when there are two selections (si-1 si) (si+1) / (si-1) (si si+1). This ambiguity is called overlap ambiguity. The authors used ngrams, bigram probability of (si-1 si) (si+1) and (si-1) (si si+1) for selection, but did not consider the context of ambiguity. Some other authors did not specify ambiguities, but they used the general method for solving: combine the maximum matching with machine learning such as: SVM [8], HMM [9], ANN [10]. A survey of Vietnamese Corpus and word segmentation methods, we can mention two of the remaining issues should be studied and resolved as follows: First, the maximum matching method FMM / BMM generates ambiguity error. This is an issue left open, not yet solved. Therefore, this error needs to be resolved by improving the maximum matching method (see Section II.A.1) Second, the previous researches do not rely on Vietnamese features to determine the boundary of Vietnamese word, i.e. the word boundary depends on its position and context, its left and right words. (See Section II.A.2) In this paper, our research results in the order as follows: main resolving method is presented in Section II, we show forms of ambiguity and resolving them, Section III presents the experimental results, and evaluation and final conclusions are presented in Section IV. II.

Unlike example 2.1, when using FMM will have the correct result: | thi đấu | vòng tròn |, and when BMM is used, result will be not valid: | thi | đấu vòng tròn |. The above examples, we found that could disambiguate this form by improved maximal matching, choose pair word which the deviation of the number of syllables is at least. In example 2.1: LE(“Mô hình hoá”) = 3, LE(“học”) = 1 DeviationFMM(“Mô hình hoá”, “học”) = 2 And DeviationBMM(“Mô hình”, “hoá học”) = 0 So, we will choose: | Mô hình | hoá học |. Similarly, in example 2.2, we get: | thi đấu | vòng tròn |. 2) The contextual ambiguities A survey of the Vietnamese corpus, there are three forms of contextual ambiguities follows: a) 1st context form This is a case after we use the improved maximal matching, the deviation of the number of syllables is equal (DeviationFMM = DeviationBMM). In this case, we will process them by the word bigram and syllable MI. For example, consider the following ambiguities example: "học sinh học". For this example, both | học | sinh học | and | học sinh | học | have the deviation of the number of syllables are equal (= 1). [19] has taken solution of bigram probability to choose, but that does not consider the context, so the results are not completely valid. Example 2.3. Tôi học sinh học rất chăm chỉ Example 2.4. Nhiều học sinh học rất chăm chỉ With these two examples, [19] has taken out an only result: | học sinh | học |, such as: | Tôi | học sinh | học | rất | chăm chỉ | (incorrect) | Nhiều | học sinh | học | rất | chăm chỉ | (correct) As we knew, Vietnamese word boundary depends on its position and context, its left and right words. For instance, in example 2.3, “| Tôi | học |” better than “| Tôi | học sinh |, whereas in example 2.4, “| Nhiều | học sinh | better than | Nhiều | học |. So, the correct result would be: | Tôi | học | sinh học | rất | chăm chỉ | | Nhiều | học sinh | học | rất | chăm chỉ | Thus, the overlap ambiguities which the deviation of the number of syllables are equal, we can disambiguate by the probability of word bigram in the context of its left and right words. In many cases where the probabilities of word bigram are equal, or the training corpus is not large enough, make the word bigram equal to 0. Then, we have to use the syllable ngram statistics to choose: for example, using the mutual information MI [14] to compare and choose. b) 2nd context form Ambiguities are occurred by words in forms of idioms or terms. These words have the number of syllables from 4 or more, with overlap at the begin or end of word. Then, we have to break these idioms or terms. Example 2.5. đất nước đang phát triển If using FMM, we have : | đất nước | đang | phát triển | If using BMM, then: | đất | nước đang phát triển | Example 2.6. hai bàn tay trắng xoá If using FMM, we have: | hai bàn tay trắng | xoá |

AMBIGUOUS FORMS AND DISAMBIGUITION

A. Ambiguous forms Ambiguity in Vietnamese word segmentation is the central issue that needs to be resolved. However, previous researches have not shown a full classification of ambiguity, thus they have not yet provided appropriate treatment for each ambiguous form. By a survey of several Vietnamese corpora, we found that there were many different forms of ambiguity easily identified by applying FMM / BMM, and a number of special cases. These can be divided into two main categories: ambiguities by FMM / BMM, and ambiguities in context. 1) FMM / BMM method generates ambiguities Consider the following example by FMM / BMM: Example 2.1. Mô hình hoá học If you use FMM, result is invalid: | Mô hình hoá | học |, whereas using BMM, result is correct: | Mô hình | hoá học |. Example 2.2. thi đấu vòng tròn

35

If using BMM, then: | hai | bàn tay | trắng xoá | Clearly, for example 2.5, the FMM is the better BMM, but for example 2.6, the BMM is the better FMM. Therefore, with this ambiguous form we must consider first syllable “nước” that is in the first word “đất nước”, and the last syllable “trắng” is in the last word “trắng xoá”, we have to break these idioms or terms. We can call a, b, c, d, e, ... which are the syllables in the sequence. In case they are idiom or term in form (bcde) or (abcd): if exists word (ab) in (a)(bcde) or exist word (de) in (abcd)(e) then we will break (bcde) or (abcd). c) 3rd context form Ambiguity occurs for a compound word with two syllables when we decide to break or not this compound word. This is a form which the FMM and BMM can not detect. Depending on its location and context in the sentence that will be broken. Example 2.7. Tôi đến trường học thêm Both the FMM and BMM take out an incorrect result: | Tôi | đến | trường học | thêm | But actually, the correct result should be: | Tôi | đến | trường | học | thêm | In this context, “| đến | trường |” and “| học | thêm |” better than “| đến | trường học |” and “| trường học | thêm |”. Example 2.8. nàng gục đầu vào chàng If using FMM, we have: | nàng | gục | đầu vào | chàng | If using BMM, then: | nàng | gục | đầu vào | chàng | Clearly, both FMM and BMM are not correct. Correct result will be: | nàng | gục | đầu | vào | chàng | The left word of "đầu vào" shows that, “| gục | đầu |” better than “| gục | đầu vào |” and “| vào | chàng | better than “| đầu vào | chàng |”. Thus, "đầu vào" will break to “| đầu | vào |”. Example 2.9. xe quá tải, nhất là xe khách The compound word "nhất là" at the begin of chunk (after the comma) should not break: | xe | quá | tải | , | nhất là | xe khách | But to break if it is not at the begin of chunk, in the following example: Example 2.10. xe quá tải nhất là xe khách In this case, compound word "nhất là" has to be broken: | xe | quá | tải | nhất | là | xe khách | Several other cases are not at the end of chunk, compound word will be broken. Example 2.11. nó mà mặc áo này thì đẹp phải biết ! The compound word "phải biết" at the end of chunk to emphasize, should not break: | nó | mà | mặc | áo | này | thì | đẹp | phải biết | ! | But if it is not at the end of chunk, it must be broken as the following example: Example 2.12. họ phải biết rằng họ làm cho ai | Họ | phải | biết | rằng | họ | làm | cho | ai |

possibilities of word segmentation, are arranged in order of better by the MM method, as the following: 1, (a)(b)(c)(d) / 2, (a)(bc)(d) / (ab)(c)(d) / (a)(b)(cd) / 3, (ab)(cd) / (*) 4, (a)(bcd) / (abc)(d) / 5, (abcd) The MM methods are made by the above way. However, the ambiguities occurs is referred to cases 3, 4. This is the fault of the maximum matching method that we illustrate by example. Therefore, we need to rearrange the order of selection priority schemes as follows: 1, (a)(b)(c)(d) / 2, (a)(bc)(d) / (ab)(c)(d) / (a)(b)(cd) / 3, (a)(bcd) / (abc)(d) / (**) 4, (ab)(cd) / 5, (abcd) From here, we need to build a mathematical function to satisfy the maximum matching method, it also satisfies the priority order (**). Clearly, this function depends on the number of syllables of the words. We call LE(wi) is the number of syllables of the word wi. There are increasing or decreasing functions for LE(wi) to choose. For simplicity, we can choose a score function of wi as follows: 1 score ( wi )  (1) LE ( wi ) This is a decreasing function, its minimum value satisfies the MM method, i.e.: score(w = a) = 1/LE(a) = 1.0 score(w = ab) = 1/LE(ab) = 1/2 = 0.5 score(w = abc) = 1/LE(abc) = 1/3 = 0.333 score(w = abcd) = 1/LE(abcd) = 1/4 = 0.25 Clearly, min{score(w)}  max{LE(w)} satisfy the MM methods such as FMM or BMM. In addition, it also satisfies the order of priority (**) by the scores: score = 1+1+1+1 = 4.00 1, (a)(b)(c)(d)/ 2, (a)(bc)(d)/(ab)(c)(d)/(a)(b)(cd)/ score = 1+0.5+1 = 2.50 score = 1+0.33 = 1.33 3, (a)(bcd)/(abc)(d)/ score = 0.5+0.5 = 1.00 4, (ab)(cd)/ score = 0.25 5, (abcd) So, min{score} will choose the best result. With this approach, we can build a dynamic programming formula to improve the MM method as follows: m m  1  min{SC k ( S )}  min  score( wki )  min   (2)  i 1   i 1 LE ( wki )  Where: S is the syllable sequence, S = s1 s2 s3 ... sn-1 sn SCk(S) is the sum of scores of k-scheme. wki is the i-word segmented by k-scheme.

B. Disambiguition 1) Improved maximum matching This ambiguity relates to the number of syllables. Formally, a word can includes one or many syllables. In the MM method, the words are chosen by the maximum number of syllables. Suppose there are four syllables a, b, c, d for eight

2) Disambiguition in context a) Disambiguition in 1st context The first contextual ambiguity occurs when the deviation of the number of syllables is equal (the sums of its scores are equal), we use the probability of word bigram P(wk | wk-1) to

k

36

k

select. In case the probabilities word bigram are also equal, we use the syllable MI to choose. - The probability of word bigram P(wk | wk-1): to compute a bigram probability of a word wk given a previous word wk-1, we will compute the count of the bigram C(wk-1 wk) and normalize by the sum of all the bigrams that share the same first word wk-1. In addition, the sum of all bigram counts that start with a given word wk-1 is equal to the count of the unigram C(wk-1). Thus, we have: C(wk 1wk ) C(wk 1wk ) P(wk | wk 1 )   (3) wC(wk 1w) C(wk 1 )

Because the Vietnamese corpus is not large enough, many bigrams equal to zero. So, we need to consider the pairs of bigram in order: left bigram, right bigram, ambiguous bigram. Then, taking the geometrical mean of the contextual bigrams by (6). - Algorithm of disambiguation in 1st context: Step 1. If (P(a|wL) != P(ab|wL)) && (P(a|wL)*P(ab|wL) == 0 ) { Pbibc  P(a|wL); Pbiab  P(ab|wL); } ElseIf (P(wR|bc) != P(wR|c)) && (P(wR|bc)*P(wR|c) == 0) { Pbibc  P(wR|bc); Pbiab  P(wR|c); } ElseIf (P(bc|a) != P(c|ab)) && (P(bc|a)*P(c|ab) == 0) { Pbibc  P(bc|a); Pbiab  P(c|ab); } Else { Pbibc  [P(a|wL)*P(bc|a)*P(wR|bc)]1/3; Pbiab  [P(ab|wL)*P(c|ab)*P(wR|c)]1/3; If (Pbibc == Pbiab) { Pbibc  P(bc|a); Pbiab  P(c|ab); } } Step 2. If (Pbibc == Pbiab) { Pbibc  MI(bc); Pbiab  MI(ab); } Step 3. If (Pbibc > Pbiab) { Out[(a)(bc)]; } Else { Out[(ab)(c)]; }

where: C(wk-1 wk) is the count of bigram (wk-1 wk) in corpus C(wk-1) is the count of unigram (wk-1) in corpus - The mutual information of syllables (MI): when the words do not appear in the corpus: C(wk-1 wk) = 0, so P(wk | wk-1) = 0, we have to use ngram statistical model of syllables to compute. In theory of information, we can use the mutual information to measure the link between the syllables together. Geometrically, we can describe the following schema: we call Isk-1 and Isk are information of syllables sk-1 and sk in corpus. Isk-1Isk Isk-1

Isk b) Disambiguition in 2nd context When facing ambiguities in 2nd context, we have to break the ambiguous words: if exists word (ab) in (a)(bcde) or exist word (de) in (abcd)(e) then we will break (bcde) or (abcd). The existence of word is defined by (1), we always have: score(w) ≤ 1. Therefore, we choose the value + to describe the word does not exist. Resolution of ambiguities as follows: + score(ab) <1, and score(bcde...) <1: score(bcde...) = + + score(abcd...) <1, and score(de) <1: score(... abcd) = + c) Disambiguition in 3rd context This ambiguities form happens a lot with just a handful of special compound words. Therefore, we can use a list of CW (compound words) to hold these special words. A survey of Vietnamese corpus, we find the two-syllable compound words are common in this ambiguities form: CW ={“nhất là”, “phải biết”, “trường học”, “được việc”,..} Assuming sequence S is segmented: S = w1 w2 w3 ... wk ... wm-1 wm The word wk=(ab) is being considered: wk  CW, 1  k  m, LE(wk) = 2, we call word wL and wR are the left and right words of wk. We consider a sequence (wL, ab, wR) to break the compound word (ab) into two single words (a) and (b). We call: + PN(ab) is the probability for not breaking (ab) + PB(ab) is the probability for breaking (ab) into (a)(b) We have the following formula: PN(ab) = P(ab | wL)*P(wR | ab) (7) PB(ab) = P(a | wL)*P(b | a)*P(wR | b) We need to consider the bigrams in order: left bigram if (ab) is at end of sequence S, right bigram if (ab) is at begin of

Isk-1Isk Figure 1. Schema of mutual information of two syllables (sk-1sk)

So, the mutual information of syllables can be defined as the measure the linking between two syllables: | Is  Is k | MI ( s k 1 sk )  k 1 (4) | Is k 1  Is k | Here, we have: | Isk-1 | = C(sk-1) and | Isk | = C(sk) | Isk-1Isk |= C(sk-1sk) | Isk-1Isk | = | Isk-1 | + | Isk-1 | – | Isk-1Isk | = C(sk-1) + C(sk) – C(sk-1sk) Thus, C ( s k 1 s k ) MI ( s k 1 s k )  (5) C ( s k 1 )  C ( s k )  C ( s k 1 s k ) where, MI(sk-1sk) is the linking of two syllables (sk-1sk) C(sk-1sk) is the count of syllable bigram (sk-1sk) C(sk) is the count of syllable unigram (sk). - Resolution of ambiguities in 1st context: We call * Pbi(bc) is the probability of choice for (a)(bc) * Pbi(ab) is the probability of choice for (ab)(c) * wL, wR are the left and right words of (a b c). And we have probabilities of word bigrams in context: + P(a | wL), P(bc | a), P(wR | bc) is used to choose (a)(bc) + P(ab | wL), P(c | ab), P(wR | c) is used to choose (ab)(c) So, the probability of selection will be computed by contextual bigram as following: Pbi(bc) = [P(a | wL)*P(bc | a)*P(wR | bc)]1/3 (6) Pbi(ab) = [P(ab | wL)*P(c | ab)*P(wR | c)]1/3

37

sequence S, then using the full bigram by (7) for others. I.e.:

 mk  min{ SC k ( S )}  min  score PM ( wki )  (12)  i 1  To speed up for the executed program and reduce the memory of data, we do some works as following: - We design the dictionary by the minimum weight finite state automaton (MWFSA) [1,15,19], the value at the final states of MWFSA is the sum of the weights, and is the order of words in the dictionary. We use these two automata, one for the dictionary of 7000 syllables and one for the dictionary of 41000 words. The syllable automaton is used for ngram statistics and computing the MI of syllables, and the word automaton is used for maximum matching and computing the probabilities of word bigrams. - In TABLE I, the words include 5 or more syllables which are about 0.01%, they do not significantly affect the accuracy. So, we choose a 5-syllables window for word segmentation. Therefore, the time complexity of the dynamic programming algorithm by (12) is linear. - The algorithm of the formula (12) as follows: Step 1. a[0]  0; Step 2. For i  1 To n { a[i]  + ; first  0; If (i > WinSize) first  i – WinSize; For j  first To i – 1 { //WinSize times w  vw[j]; If (a[i] > a[j] + score[j, i]) { a[i]  a[j] + score[j, i]; For k  j + 1 To i – 1 //WinSize–1 times w  w + " " + vw[k]; } q[i]  w; //here is the result } } Where: + n is number of syllables in input sentence + WinSize is size of syllable windows + vw[j] is the jth syllable + score[j, i] is score of word that include jth to ith sylables + a[ ] is a template array. + q[ ] is the result of word segmentation.

If PN(ab) < PB(ab) then break wk = (ab): score(wk) = + - Algorithm of disambiguation in 3rd context If (wk  CW) && (LE(wk) == 2) { If (P(ab|wL) != P(a|wL))&&(P(ab|wL)*P(a|wL) == 0 ) { PNab  P(ab|wL); PBab  P(a|wL); }ElseIf (P(wR|ab) != P(wR|b))&&(P(wR|ab)*P(wR|b) == 0) { PNab  P(wR | ab); PBab  P(wR | b); }Else { PNab  P(ab | wL)*P(wR | ab); PBab  P(a | wL)*P(b | a)*P(wR | b); If (PNab == PBab) { PNab  P(ab | wL); PBab  P(a | wL)*P(b | a); } If (PNab == PBab) { PNab  P(wR | ab); PBab  P(b | a)*P(wR | b); } } If (PNab < PBab) { score(wk)  +; } /* break out */ }

C. The integrated method To reduce the errors of word segmentation, we regard the priority order of disambiguation as following: Step 1. Disambiguation by the improved MM. Step 2. Disambiguation in 2nd context. Step 3. Disambiguation in 1st context. Step 4. Disambiguation in 3rd context. By the formula (2), we will change the measure to the same with the minimum function. If the measure of MI only used to compare between (ab)(c) and (a)(bc) then: max{MI ( ab), MI (bc)}  min{1  MI (ab), 1  MI (bc)} So, we can integrate the measure of MI to formula (2): score( wi )  [1  MI ( wi )] / LE( wi ) (8) In this case, the dynamic programming formula will be:  mk  min{ SC k ( S )}  min  [1  MI ( w ki )] / LE ( w ki )  (9)  i 1  The ambiguities in 1st context are resolved by (6), and compare between Pbi (ab) and Pbi (bc). We also change these probabilities to the minimum function. We have:

Clearly, for WinSize = 5, the time complexity of above algorithm is O(n).

max{ Pbi (ab), Pbi (bc)}  min{1  Pbi ( ab), 1  Pbi (bc)}

Thus, the integrating the probabilities Pbi to (2): score(wi )  [1  Pbi ( wi )] / LE( wi ) (10) This approach will gets the dynamic programming formula:  mk  min{SCk (S )}  min  [1  Pbi (wki )] / LE (wki ) (11)  i 1  Finally, we can integrate the measures of LE, Pbi and MI: + Pbi(ab)  Pbi(bc): scorePM ( wi )  [1  Pbi (wi )] / LE ( wi ) + Pbi(ab) = Pbi (bc): scorePM (wi )  [1  MI (wi )] / LE (wi ) Therefore, the dynamic programming formula will be:

III.

EXPERIMENTS

A. The corpus for testing and assessment 1) The corpus for training and testing - SP732 [7]: the training corpus includes 2639 text files, with a total of 1,541,188 segmented words, the size of 10MB (SP7.3, the project of KC.01.01/06-10). This corpus is used for the ngram statistics, computing probabilities of word bigram and the mutual information of syllables. The statistical and training results are saved into a knowledge database file for later use.

38

TABLE I.

THE STATISTICS OF THE WORDS BY NUMBER OF SYLLABLES

NER+MM+LE NER+MM+LE+MI NER+MM+LE+Pbi NER+MM+LE+Pbi+MI

4786 4611 3069 3043

97.84 97.92 98.61 98.62

98.68 98.75 98.78 98.81

98.26 98.33 98.70 98.71

So, the errors (by recall) decrease by integrated methods top-down. Clearly, our results (98.71%-98.94%) are much better than previous results (95%-97%). IV.

In this paper, we have analyzed, surveyed and showed out two basic forms of ambiguity in Vietnamese word segmentation, such as: ambiguity by the MM method and ambiguity in the word context. On these bases, we proposed the integrated methods to resolve them. If only using the dictionary, by improvement the MM method (NER+MM+LE), our F-scores achieve from 98.26% to 98.86%. When integrating with disambiguation in word context by the probabilities of word bigram and the mutual information of syllables, our integrated method achieved F-scores from 98.71% to 98.94%, significantly higher than the previous results. By intergrating of information resources in the formula (12), for increasing of accuracy of Vietnamese word segmentation, we can add others: identifying coordinated compound words, identifying reduplicative words. Our approach can be applied to other languages that have difficulties in segmentation of ambiguous words.

- QTAG: the testing corpus includes 7 text files with a total of 74,756 words, the size of 455KB, for testing with the dictionary of 37,000 words. - SP731 [7]: the testing corpus includes about 10,000 sentences with a total of 221,215 words, the size 1.35MB (SP7.3, the project KC.01.01/06-10), for testing with the dictionary of 41,000 words. Before using, we take spelling checker for QTAG, SP731, SP732, and use an Unicode code by TCVN 6909:2001 for alls. We also perform pre-processing them to recognize Named Entities (NER): person names, place names, organization names, abbreviations, dates, numbers, email, url,... 2) The assessment of accuracy The assessment of accuracy based on the below way: + Recall (R): number of correctly segmented words divided by the total of words of corpus. + Precision (P): number of correctly segmented words divided by the total of segmented words in solution. + Balance F-score (F):

F

REFERENCES In Vietnamese: [1]

2 PR PR

B. Experiment with QTAG and dictionary of 37,000 words Experiment and considering QTAG with the errors by recall, and accuracy of P, R and F in Integrated methods, we have the results as following: TABLE II.

[2] [3] [4]

EXPERIMENT WITH QTAG AND DIC. 37,000 WORDS

Integrated methods FMM BMM NER+BMM NER+MM+LE NER+MM+LE+MI NER+MM+LE+Pbi NER+MM+LE+Pbi+MI

Errors(R)

R (%)

P (%)

F (%)

1588 1548 1033 957 914 831 821

97.87 97.93 98.62 98.72 98.78 98.89 98.90

97.46 97.52 98.95 99.00 99.06 98.97 98.97

97.67 97.72 98.79

[5] [6] [7]

98.86 98.92 98.93

[8]

98.94

[9]

Dang Duc Pham, Giang Binh Tran, Son Bao Pham (2007), "A Hybrid Approach to Vietnamese Word Segmentation using Part of Speech tags", KSE 2009 - The 1st International Conference on Knowledge and Systems Engineering, pp.154-161. [10] Dien Dinh, Kiem Hoang, Toan Nguyen Van (2001), "Vietnamese Word Segmentation". The sixth Natural Language Processing Pacific Rim Symposium, Tokyo, Japan, 11/2001. pp. 749-756. [11] Dien Dinh, Thuy Vu (2006), "A maximum entropy approach for Vietnamese word segmentation", Proc. of the 4th IEEE International

EXPERIMENT WITH SP731 AND DIC. 41,000 WORDS

Integrated methods FMM BMM NER+BMM

Errors(R)

R (%)

P (%)

F (%)

9375 9168 4929

95.76 95.86 97.77

94.48 94.57 98.64

95.12 95.21 98.20

Anh Tran Ngoc, Tinh Dao Thanh (2011), “Kỹ thuật mã hoá âm tiết tiếng Việt và các mô hình n-grams, ứng dụng kiểm lỗi cách dùng từ và cụm từ tiếng Việt”, Journal on Information & Communications Techonologies, Vol. 6(26), 9-2011, pp.280-289. Ban Diep Quang, Thung Hoang Van (2006), Ngữ pháp tiếng Việt, Volume 1 & 2, Education P.H., Hanoi. Huyen N.T.M., Luong V.X., Phuong L.H. (2003), "Tách từ bằng từ điển và Gán nhãn từ loại bằng xác suất", Proc. of ICT.RDA, 2003. Huyen N.T.M., Linh H.T.T., Luong V.X. (2009), "Hướng dẫn nhận diện đơn vị từ trong ngôn ngữ tiếng Việt", Report of SP8.2, Volume 2-VLSP, Project of KC01.01/06-10. Lan Nguyen. (2006), Từ điển Từ và Ngữ Việt Nam, HCM General P.H. [Phe Hoang], Luong V.X., Linh H.T.T., Thuy P.T., Thu D.M., Hoa D.T. (2009), Từ điển tiếng Việt, VietFlex Centre, Da Nang P.H. Thai N.P., Luong V.X., Huyen N.T.M., Phuong L.H., Thu. D.M., Ngoc N.T.M, Ngan L.K., Van N.M. (2009), Report of SP7.3 - VietTreeBank, Volume 1-VLSP, Project of KC01.01/06-10. Vu H.C.D., Nguyen N.L, Dien D., Hung N.Q. (2006), "Ứng dụng thuật toán so khớp cực đại và cơ chế véctơ hỗ trợ trong bài toán tách từ tiếng Việt". Proc. of NCICT2006 (@’06).

In English:

C. Experiment with SP731 and dictionary of 41,000 words By the same experiment with SP731, for below results: TABLE III.

CONCLUSIONS

39

[12] [13]

[14]

[15]

[16]

Conference on Computer Science - Research, Innovation and Vision of the Future 2006, HCM City, Vietnam, pp. 247-252. Ha Le An (2003), "A method for word segmentation in Vietnamese", Proceedings of the Corpus Linguistics 2003 Conference, pp. 282-287. Hieu L.T., Vu L.A., Kien L.T. (2010), "An Unsupervised Learning and Statistical Approach for Vietnamese Word Recognition and Segmentation", Proc. of ACIIDS, 2010. pp.195-204. Hung Nguyen, Thanh V.Nguyen, Hoang K.Tran, Thanh T.T.Nguyen (2006), "Word Segmentation for Vietnamese Text Categorization: An online corpus approach", RIVF2006, the 4th International Conference on Computer Sciences. Jan Daciuk, Stoyan Mihov, Bruce W. Watson và Richard E. Watson (2000), Incremental Construction of Minimal Acycle Finite-State Automata. Jurafsky and Martin (2009), Speech & Language Processing: An Introduction to Speech Recognition, Computational Linguistics and Natural Language Processing, Prentice Hall.

[17] Oanh Tran, Cuong Le, Thuy Ha (2010), "Improving Vietnamese Word Segmentation and POS Tagging using MEM with Various Kinds of Resources", Journal of NLP 17(3):41-60. [18] Phuong L.H, Huyen N.T.M., Roussanaly A.(2009), "Finite State Description of Vietnamese Reduplication", Proc. of the 7th Workshop on Asian Language Resources. [19] Phuong L.H., Huyen N.T.M., Roussanaly A., Vinh H.T. (2008), "A Hybrid Approach to Word Segmentation of Vietnamese Texts", Proc. of the 2nd International Conference on Language and Automata Theory and Applications, Springer LNCS 5196, Tarragona, Spain. [20] Tu Nguyen Cam, Kien Nguyen Trung, Hieu Phan Xuan, Minh Nguyen Le, Thuy Ha Quang (2008), "Vietnamese Word Segmentation with CRFs and SVMs An Investigation" (CRFs / SVMs), Proceedings of th 20th he PACLI Wuhan, China, pp. 215-222.

40

An effective context-based method for Vietnamese ...

5, (abcd). (**). From here, we need to build a mathematical function to ..... + a[ ] is a template array. ... names, abbreviations, dates, numbers, email, url,... 2) The ...

Download PDF

425KB Sizes 3 Downloads 246 Views

Report

An effective context-based method for Vietnamese ...

Recommend Documents