A SVM based approach to Telugu Parts Of Speech ...

Viewer
Transcript

INFORMATION PAPER International Journal of Recent Trends in Engineering, Vol. 1, No. 2, May 2009

A SVM based approach to Telugu Parts Of Speech Tagging using SVMTool G.Sindhiya Binulal, P. Anand Goud, K.P.Soman CEN, Amrita Vishwa Vidyapeetham Coimbatore, India {sindhiyabinulal.golla, puli.anand.goud}@gmail.com [email protected] Abstract-There are different approaches to the problem of labeling a part of speech (POS) tag to each word of a natural language sentence. Parts of speech tagging is one of the most well studied problems in the field of Natural Language Processing (NLP).Parts of speech tagging is the sequence labeling problem. Labeling a POS tag to each word of an un-annotated corpus by hand is very time consuming which results in finding a method to automate the job. In this paper SVMTool is applied to the problem of part of speech tagging for TELUGU language. Pos tagging can be seen as multiclass classification problem. This paper mainly explains about how binary classifier can be used for multiclass classification problem. Telugu is written the way it is spoken. The tagset used in this paper consists of 10 tags. The training corpus consists of 25000 words. The obtained accuracy is around 95% for Telugu language. Better results can be achieved by increasing the corpus size.

Keywords: SVMTool, Tagged Corpus, Tagset I.

INTRODUCTION

Parts of speech tagging is the process of marking up the words in a natural language sentence as corresponding to a particular part of speech tags or lexical classes or word classes, based on both its definition, as well as its context. Support vector machines (SVM) have become a popular tool for discriminative classification. Generally tagging is required to be as accurate as possible and as efficient as possible. The SVMTool is intended to comply with all the requirements of modern NLP technology, by combining simplicity, flexibility, robustness, portability and efficiency with state of the art accuracy. This is achieved by working in the SVM learning frame work and by offering NLP researches a highly customizable sequential tagger generator [1]. We have applied the SVMTool to the problem of part of speech (POS) tagging. Pos tagging can be used in Text To Speech (TTS) applications, information retrieval, parsing, information extraction, translation and many more.

II. THEORY OF SUPPORT VECTOR MACHINES A. Two class support vector machines: Let S = {x1, x2, L , xm } and d i ∈ (−1, +1) be the training set with m samples. Each xi is an n dimensional input vector and each d i corresponds to the class label associated to xi . In short SVMs are hyper planes that separate the training data by a maximal margin as shown in figure1. All vectors lying on one side of the hyper plane are labeled as -1 and all vectors lying on other side are labeled +1. The training instances that lie closest to the hyper plane are called support vectors [5]. The task of a classification algorithm is to learn a mapping of xi → d i using data from S. SVMs handle this by T n constructing a hyperplane w x − γ = 0 , where w ∈ R represents the normal vector associated with the hyperplane and is the bias, that maximally separates positive and negative training samples. The margin corresponds to the distance from the separating hyperplane to the closest samples of each class. It is obviously inversely proportional to || w || .Thus, to have the maximal margin hyperplane one needs to minimize the Euclidean norm of vector w [5].

This paper starts with the theory of Support Vector Machines (SVM) and later explains about how SVMTool can be applied to the problem of pos tagging. Training and testing data is collected from the Eenadu Telugu news paper.

γ

wT x − γ = 0

ε

Maximum margin

Figure1: Two class Support Vector Machine

183

© 2009 ACADEMY PUBLISHER

INFORMATION PAPER International Journal of Recent Trends in Engineering, Vol. 1, No. 2, May 2009

Figure2: Three class SVM reduced to Three binary SVM problems.

Formally, this goal can be translated into the following quadratic programming problem: m 1 min w T w + C ∑ ξi w ,γ ,ξ 2 i =1

s u b je c t to d i ( w

T

x

i

B. Multi class SVM (1)

− γ ) + ξ

i

− 1 ≥ 0,

ξ

i

≥ 0,

Tagging a word in context is a multi-class classification problem. Number of tags is the number of classes. The basic idea is to split N class problem into N binary class problems. SVMTool follows simple one-v/s-rest binarization, i.e., a SVM is trained for every PoS tag in order to distinguish between examples of this class and all the rest [1]. This is explained with an example: (see figure 2)

1 ≤ i ≤ m 1 ≤ i ≤ m

Where C is the tuning parameter which controls the compromise between training error and maximum margin. And ξi are slack variables that permit to deal with non-linearly separable data. Considering its dual formulation can more easily solve the problem presented in (1)

min LD (u) = u

m 1 m m T d d x x u u − ui ∑∑ i j i j i j ∑ 2 i =1 j =1 i =1

m

su b ject to

∑du i =1

i

i

When tagging a word, the most confident tag according to the predictions of all binary SVMs is selected.

(2)

III. TAG SET USED In this paper we used a very simplified tag set which is developed for Prosodic Pause Prediction in Speech recognition applications [3].

=0

0 ≤ ui ≤ C

Anand (NN) mariyu (CC) shrinu (NN) manchi (ADJ) snehithulu (NN).

1≤ i ≤ m

Where ui are named Lagrange Multipliers. * * With the optimal values γ and w (indirectly) obtained by solving Equation 2, the discriminant Function can be expressed as: nsv

TABLE I TELUGU POS TAGSET TAG NN

TYPE Proper nouns, compound nouns, pronouns Present, future, past all forms of verb

EXAMPLES All names, Athadu, Ame, Adi, edi

ADJ

All adjective forms

nallani, aMxamEna,

ADV

All adverbs

meVllagA, woVMxaragA

QW

Question words

emiti, eVlA

NEG

Negative words

kAxu, lexu

CC

Conjunction words

mariya, leka, ani, kAni, gAni

QFNUM

All numbers

Rendu,12, mUgguru

PREP

Pre & postposition words

koVraku/kosaM, koVraku,kosaM

SYM

All symbols

, ,:,;,?

VRB

f ( x) = ∑ ui*di xT xi − γ *

(3)

i =1

Where nsv is the number of samples associated with nonzero ui in (3), the so-called support vectors.

Vesi, poi,_peVtti, ceVbuwU

wiMtU,

ADJ v/s rest NN v/s rest Class NN

Class ADJ manchi

Anand Shrinu Snehithulu

CC v/s rest

IV.

mariyu Class CC

184

© 2009 ACADEMY PUBLISHER

SVMTOOL

INFORMATION PAPER International Journal of Recent Trends in Engineering, Vol. 1, No. 2, May 2009 The SVMTool software package consists of three main components, namely the model learner (SVMTlearn), the tagger (SVMTagger) and the evaluator (SVMTeval). Before tagging, SVM models (weight vectors and biases) are learned from a training corpus using the SVMTlearn component. Different models are learned for the different tagging strategies. During tagging time, the SVMTagger component can be used to choose the tagging strategy that is most suitable for the purpose of the tagging. Finally, given a correctly tagged corpus, and the corresponding SVMTool predicted annotation, the SVMTeval component displays tagging results and reports.

performance in terms of accuracy. It is a very useful component for the tuning of the system parameters, such as the C parameter, the feature patterns and filtering, the model compression et cetera. Based on a given morphological dictionary (e.g., the automatically generated at training time) results may be presented also for different sets of words (known words vs. unknown words, ambiguous words vs. unambiguous words). A different view of these same results can be seen from the class of ambiguity perspective, too, i.e., words sharing the same kind of ambiguity may be considered together. Also words sharing the same degree of disambiguation complexity, determined by the size of their ambiguity classes, can be grouped.

A. Training data format V. RESULTS Training data must be in column format, i.e. a word per line corpus in a sentence by sentence fashion. The column separator is the only one blank space. The word is expected to be the first column of the line. The tag to predict takes the second column in the output. Following is a sample of the training data:

The experiments were conducted with our corpus (25,000 words). The corpus is divided into training set (20,000 words) and test set (5,000 words). We have got an overall accuracy of 95%. For unknown words we got an accuracy of 86.25%. VI. CONCLUSION In this paper we have described the SVMTool based approach for automatic tagging of Telugu language corpus. We have found that automatic tagging can be done very efficiently using SVMTool for Telugu language and it provides good accuracy. The output from this tagger can be given to Morphanalyzer for further classification. This can be done as future work.

REFERENCES

[1] Jes´us Gim´enez and Llu´ıs M`arquez SVMTtool: Technical manual v1.3, August 2006.. [2] T.Sree Ganesh, Telugu Parts Of Speech Tagging in WSD, University of Hyderabad, LANGUAGE IN INDIA,vol-6, 8 August 2006

. <.> B. SVMTagger Given a text corpus (one word per line) and the path to a previously learned SVM model (including the automatically generated dictionary), it performs the POS tagging of a sequence of words. At the initial stages the out put tag for some words may not be perfect. This has to be corrected manually.

[3] Veera Raghavendra (200650023) Srinivas Desai (200650024) Prosodic Pause Prediction Using Reduced Tag Set and Sub Word Units [4] PVS. Avinesh, Karthik Gali Parts-Of-Speech Tagging and Chucnking Using Conditional Random Fields and Transformation Based Learning by Department of Computer science IIIT-Hyderabad. [5] Bruno Feres de Souza and André Ponce de Leon F. de Carvalho, Gene selection based on multi-class support vector machines and genetic algorithms, Genetics and Molecular Research 4 (3): 599-607 (2005)

C. SVMTeval Given a SVMTool predicted tagging output and the corresponding gold-standard, SVMTeval evaluates the 185

© 2009 ACADEMY PUBLISHER

gmm based bayesian approach to speech ...