Unsupervised Morphological Disambiguation using Statistical Language Models Mehmet Ali Yatbaz and Deniz Yuret Dept. of Computer Engineering Koç Üniversitesi

Introduction

Algorithm

•The morphological disambiguation can be defined as the selecting the correct parse of a word in a given context from the possible candidate parses of the word. • The main challenge of the supervised morphological disambiguation is the difficulty of acquiring a sufficient amount of consistent morphologically parsed training data. •Another issue is, unlike English, in agglutinative languages the number of theoretically possible parses can be inﬁnite although the number of features is ﬁnite. Below you can see three possible morphological parses for the Turkish word “masalı”. Stems masal masal masa

Meaning (= the story) (= his story) (= with tables)

Unsupervised Morphological Disambiguator

1. Construct a morphological dictionary for all the words in V. 2. Construct Swi by simplifying Twi where wi is the ith target word. 3. Calculate P(vij|ci) where vij is the jth replacement of wi. 4. Calculate P(t|ci) for all t in Swi using the probabilities calculated in Step 3. 5. Select t that maximizes P(t|ci). Test Set 446 Sentences 5365 Tokens Ambiguous Tokens 45.4% 1.85 Average Parses

Model • The main idea of our model is it assigns parses to the contexts instead of words itself. • Thus, our model selects the parse t of the target word w that is most likely in the target word context, cw. • To achieve this, the model finds t that maximizes P(t|cw) using the replacement words from the vocabulary, V.

arg max P(t | cw ) = ∑ P(t | v, cw ) P(v | cw ) t∈Tw

v∈V

Experimental Results We define an unsupervised and a supervised baseline. 1. Unsupervised Baseline: Randomly pick a parse of w from Tw. Disambiguate 39.4% of the ambiguous words. 2. Supervised Baseline: Select a parse of w from Tw by using majority voting. Disambiguate 71.0% of the ambiguous words.

Effect of Corpus Size on our model: We used three corpora with different sizes to train 4-gram language model. We randomly select 1% and 10% of the original training corpus.

Estimation

Corpus Size 4M 40M 400M

P(v|cw) is estimated using the n-gram language model trained on a 400 million words Turkish web corpus. • cw is defined as the 2n–1 word window w−n+1…wo…wn−1. • Finally,

P ( wo = v) ∝ P ( w− n +1...w0 ...wn −1 ) n−2 = P ( w− n +1 ) P ( w− n + 2 | w− n +1 )...P ( wn −1 | w− n +1 ) −1 n−2 0 ∝ P ( w0 | w− n +1 )...P ( w1 | w− n + 2 )...P ( wn −1 | w0 ) P(t|v,cw) is estimated using two assumptions 1. Pruning assumption: Every w has a possible parse set Tw . Parses that are not in Tw have zero probability in the context of w. 2. Uniformity assumption: The distribution of parses given a replacement word v and context cw is uniform on Tw .

1   P(t | v, c w ) =  | Tw ∩ Tv |  0 

t ∈ Tw ∩ Tv otherwise

Parse Simplification •The estimation quality of P(t|cw) highly depends on the parse Tw . • Instead of using the parses directly we construct a discriminative minimal set Sw by selecting the minimum number of rightmost features of each parses. Stems masal masal masa

Tagged Trained Set 50673 948404 42.1% 1.76

Simplified Parses Pnon+Acc P3sg+Nom With

Accuracy 60.4 63.1 64.5

As the corpus size becomes smaller, the accuracy of the model decreases significantly (in terms of 95% confidence interval). Thus, the performance of the model can be improved by using a larger Turkish corpora.

Effect of Replacement Word Number on our model: We calculate P(v|cw) of each replacement word and select 10, 100, 200 and 2000 replacement words that have the highest P(v|cw) and use only these words to estimate P(t|cw). Number of replacements Accuracy 63.4 Top 10 64.3 Top 100 64.4 Top 200 64.5 Top 2000 This experiment shows instead of calculating P(v|cw) for all vocabulary, top k P(v|cw) values can be used since the results are not different (in terms of 95% confidence interval).

Conclusion • Our model assigns parses to context instead of assigning them to words. •The probabilities of morphological analysis are calculated using a language model. Therefore it can be applied to any language without predefining any language dependent rules. •We were able to achieve 64.5% accuracy. This accuracy might be improved by relaxing the uniformity assumption and letting it to converge to the actual probabilities.

## cvPcvtP ctP )|(),|( )|( max arg - deniz

Thus, the performance of the model can be improved by using a larger Turkish corpora. Effect of Replacement Word Number on our model: We calculate P(v|c w. ) ...

#### Recommend Documents

deniz hukuku.pdf
Page 1 of 54. Ege, Doğu Akdeniz ve Karadeniz: Hukuki Durum Tespi: Dr. Nilüfer Oral. KÜDENFOR. Deniz Hukuku Grubu Koordinatör. Page 1 of 54 ...

Deniz Dizdar.pdf
what constrains, or even precludes, inefficiencies in ex-post contracting equilib- rium. The model and analysis build on (CMP), but I allow for more general in- vestment choices and match surplus functions, and for more general forms of ex- ante hete

db-arg-lp.pdf
Page 1 of 12. A pragmatic characterisation of linear pooling. January 8, 2018. Abstract. How we should determine a group's collective probabilistic judg- ments, given the probabilistic judgments of the individuals in the. group? A standard answer is

ARG vs Uru.pdf
There was a problem previewing this document. Retrying... Download. Connect more apps... Try one of the apps below to open or edit this item. ARG vs Uru.pdf.

CSI/CTP/Info.pdf
our interest form! Date: Thursday, March 1, 2018. Location: Building 1N. Room: 114. Time: 2:30 PM â 3:30 PM. Interest Form: http://bit.ly/interestformCTP. Questions? Visit cunytechprep.nyc | Email: [email protected]. techtalentpipeline.nyc | N

ARG FOW Decklist Sheet.pdf

CTP-4 LA MERCÃ.pdf
Retrying... Download. Connect more apps... Try one of the apps below to open or edit this item. CTP-4 LA MERCÃ.pdf. CTP-4 LA MERCÃ.pdf. Open. Extract.

arg ygo decklist 2015ARG FORMAT.pdf
Page 1 of 1. Last Name Initial: Player Name: Date: Deck Name: Event: MAIN DECK Total Amount of Cards: Monster Cards Qty Spell Cards Qty Trap Cards Qty. SIDE DECK Total Amount of Side Deck Cards: EXTRA DECK Total Amount of Extra Deck Cards: Please pri

arg ygo decklist 2015ARG FORMAT.pdf
ARG DECK REGISTRATION SHEET. Page 1 of 1. arg ygo decklist 2015ARG FORMAT.pdf. arg ygo decklist 2015ARG FORMAT.pdf. Open. Extract. Open with.

arg ygo decklist 2015ARG FORMAT.pdf
Page 1 of 1. Last Name Initial: Player Name: Date: Deck Name: Event: MAIN DECK Total Amount of Cards: Monster Cards Qty Spell Cards Qty Trap Cards Qty.

23.19 overall max 42.39 overall max 42.15 overall max ... - Onion Wiki
Mar 23, 2016 - 6. 7. 8. 8. 7. 6. 5. 4. 3. 2. 1. SERVO EXPANSION. TITLE. DIMENSION IN MM. UNLESS NOTED. GO. TOLERANCES. USED ON. NEXT ASSY.

50.9 overall max 84.3 overall max 15.1 overall max 2.40 ... - GitHub
APPROVED. PROJECTION TYPE. REV. PART NO. B. SCALE. SHEET. OM-D-ARD. DO NOT SCALE DRAWING .X. 0.1 .XX 0.06 .XXX 0.010. ANG. MACH 0.5.

22.94 overall max 41.7 overall max 28.9 overall max ethernet port ...
Mar 25, 2016 - THE INFORMATION CONTAINED IN THIS DRAWING IS THE SOLE PROPERTY OF. ONION CORPORATION. ANY REPRODUCTION IN PART ...

pdf-12104\max-ehrmanns-poems-by-max-ehrmann.pdf