Developing a Tagset for Manipuri Part of Speech Tagging

Viewer
Transcript

JOURNAL OF COMPUTER SCIENCE AND ENGINEERING, VOLUME 5, ISSUE 1, JANUARY 2011 25

Developing a Tagset for Manipuri Part of Speech Tagging Kh Raju Singha, Bipul Syam Purkayastha, Kh Dhiren Singha and Arindam Roy Abstract—Tagset is a very important element for part of speech tagging in any natural language. It is a fundamental component for developing an effective machine translation system of any natural language. This Paper proposes developing a tagset for Manipuri parts of speech tagging in accordance with EAGLE guidelines and IL-POST framework for morphosyntactic annotation of corpora. Index Terms—Tagset, Tagging, Computational Linguistics, Morphosyntactic, Lexical item, Affixes, Morphology, Word Class.

—————————— ——————————

1 INTRODUCTION

2 LITERATURE SURVEY

P

UPenn, Brown and C5 are the tagsets for English designed in early 1970s which were mostly simple lists of tags corresponding to the morphosyntactic features, and varied greatly in terms of granularity [2]. CLAWS2 tagset [3] marked an important change in structure of POS tagsets from a flat structure with unitary tags to a hierarchical structure that allowed for decomposable tags. This enabled to tag all the lexical items of the language with distinct grammatical behaviour and a systematic approach. Several POS tagsets have been developed by a number of research groups working on Indian Languages. Among them very few are available publicly e.g. IIT-tagset, AUKBC Tamil tagset etc. These tagsets are motivated by specific research agenda, they differ considerably in terms of morphosyntactic categories and features, tag definitions, level of granularity, annotation guidelines etc [4]. D.S. Thoudam and S. Bandyopadhayay [5] developed a morphology driven Manipuri POS tagger using a flat tagset consisting of 13 tags. A hierarchical tagset with language specific attribute values is required to tag the lexical items of a language having large number of affixes, monosyllabic, agglutinative typological features like Manipuri language.

art of speech tagger is one of the important components in the development of any serious application in the field of Computational Linguistics (CL), Natural Language Processing (NLP) in the present world. Part of Speech tagging is a technique for automatic annotation of lexical categories. It assigns an appropriate tag for each word in a sentence of a language as corresponding to part of speech, based on both its definition, as well as its context [1]. A POS tagger takes a sentence as input and assigns a unique part of speech tag to each lexical item of the sentence. POS tagging is used as an early stage of linguistic text analysis in many applications including subcategory acquisition; text to speech synthesis; and alignment of parallel corpora [1]. The POS tagger can be used in other areas of Natural Language Processing such as semantic analysis, information retrieval, shallow parsing, information extraction and machine translation etc. For the creation of a POS tagger, it is necessary to build a tagset for a language. Just like English, Hindi and other languages, Manipuri language also has some lexical categories. This paper expounds the creation process of a tagset for Manipuri language in accordance with EAGLES guidelines and IL-POST (Indian Language POS Tagset) framework for morphosyntactic annotation of corpora. The roadmap of the paper is as follows: Section 2 is on literature survey, Section 3 gives a brief description of the Manipuri language, Section 4 discusses the Typological features of Manipuri language, Section 5 discusses how word class can be formed in Manipuri, Section 6 describes the the proposed morphosyntactic tagset for Manipuri and Section 7 concludes the paper. ————————————————

• Kh Raju Singha is with the Department of Computer Science, Assam University, Silchar. • Bipul Syam Purkayastha is with the Department of Computer Science, Assam University, Silchar. • Kh Dhiren Singha is with the Department of Linguistics, Assam University, Silchar. • Arindam Roy is with the Department of Computer Science, Assam University, Silchar.

3 MANIPURI LANGUAGE Manipuri (Meiteilon or Meiteiron) is one of the oldest languages in the South-East Asia which has its own script (Meitei Mayek) and literature. At Present, Manipuri used to write in Bengali Script from 1709 A.D. onwards i.e; during the reign of king Pamheiba. Manipuri is widely spoken in Manipur, Assam, Tripura, Bangladesh and Myanmar, which has been included in the eight schedule of Indian Constitution since 1992. The total number of people who return Manipuri as their mother toungue was 1,500,000 out of which 1,466,705 speakers reside in India (Census of India, 2001). It is the first Tibeto-Burman language which has obtained its due place and recognition in Indian Constitution [6]. Linguistically, it belongs to the Kuki-Chin group of the Tibeto-Burman family of Langua-

© 2011 JCSE http://sites.google.com/site/jcseuk/

26

-ges [7] influenced and enrinched by the Indo-Aryan languages of Sanskrit origin and English.

4 TYPOLOGICAL FEATURES OF MANIPURI LANGUAGE 1. Manipuri has a tonal number of 32 phonemes consisting of 6 vowels, 24 consonants and two tones: high and low. e.g., í / ঈ ‘blood’ ì / ই ‘hay’. 2. Gender distinction in Manipuri is determined on the natural recognition of sex, i.e; gender is not grammatically marked in this language. e.g., nupamaca / nupimaca-du-na isei sak- lam-i boy / girl -dist-Nmz. song sing-Evi. –Asp. ‘That boy / girl were singing a song.’ 3. Number is also not grammatically significant in Manipuri i.e., there is no subject predicate agreement as far number is concerned. e.g; nupamaca / nupimaca-sing-na isei sak-lam-i boy /girl - Pl. – Nmz song sing-Evi-ASP. ‘Boys/girls were singing (a) song. 4. Case relations in Manipuri are expressed by means of postpositions. e.g., ai-na ma-hak-pu keithel-da u-ram-i I –Nmz he-hon-Acc. market-Loc. See-Evi.-ASP. ‘I saw him in the market.’ 5. The words in Manipuri are highly monosyllabic i.e., even a vowel can be a syllable or a morpheme or a word in the language. e.g., ù ‘tree’, í ‘blood’. 6. Like many other Southeast Asian languages, Manipuri is also an agglunating type of languages having large number of affixes particularly in the form of prefixes and suffixes. e.g., a-pamba-du tau-han-ning-i Prefix-like-Nmz. do-Caus.-wish-ASP. ‘(I) wish to make you to do whatever you like’ 7. In Manipuri, words are mainly formed by morpho logical process like derivation, compounding and reduplication. e.g., ca-ba ‘eating’ bound root- Nmz. cak-sang ‘kitchen’ rice-hut khun ‘village’ khun-khun ‘villages’ 8. The normal order of words in an unmarked sentence is subject-object-verb (SOV). e.g., cauba-na laphoi ca-i

Cauba-Nmz banana eat-ASP ‘Cauba takes banana.’ 9. There is a lack of relative pronoun, the relative clause is expressed by means of particle. e.g., Imphal-dagi lak-pa nupamca-du ai- gi marup-ni. -ablative come-participle boy-Dem. I-Gen. friend-Cop. ‘The boy who came from Imphal is my friend.’ 10. Negation is mainly formed by affixation i.e., by suffixation. e.g., ai ca thak-te I tea drink – Neg. ‘I do not take tea.’

5 WORD CLASSES IN MANIPURI It is very difficult to determine word classes in Manipuri as we found in English, Hindi, Russian and Tamil etc. except time-stable nouns like চীঙ/ching/mountain, ঈশিং/ising/water, নুমি/numit/sun etc. The determination of word classes in Manipuri is maily employed by the corresponding affixes. Some bound roots can be formed noun, verb, adjective and adverb by the morphological process like affixation as shown below with examples in the underlying representation. Nouns 1. Formation of noun by prefixation to the bound root: prefix+ bound root→noun e.g., খু+চেন→খুচেন / ‘the way of running’ 2. Formation of noun by suffixation to the bound root: bound root+suffix→noun /‘running’ e.g., চেন+বা→ 3. Formation of noun by combination of two nouns: noun+noun→noun e.g., চাক/’rice’+শঙ/’hut’ → /’kitchen’ 4. Formation of noun by combination noun and bound root: noun+bound root→noun /‘window’ e.g., থোং+নাউ→ Verbs Formation of verb by suffixation to the bound root: bound root+suffix (Tense/aspect/evi marker) →verb e.g., চেন+লি → / ‘running’ Adjectives Formation of adjective by prefixation and suffixation to the bound root: prefix+bound root+suffix→adjective e.g., অ+চেন+বা→ /’running (adj)’ অ+ঙাং+বা→ /’red (adj)’ Apart from the above, some adjectives particularly colour terms are free morphemes as exemplified below: হীগোক/’blue’, নাপু/’yellow’, লৈঙাং/’saffron’, etc. Adverbs Formation adverb by suffixation to the bound root: bound root+suffix (case marker) →adverb e.g., তপ+না→ /’slowly (adv)’. In addition to the above classes of words, there are some other word classes viz; pronouns, quantifiers, specifiers, demonstratives, participles, punctuations and residuals which

27

are morphosyntactically determined without any complexity in Manipuri language. The root itself can be considered as a word class in Manipuri. So we categorized 12 major word classes in Manipuri viz; root, noun, pronoun, verb, adjective, quantifier, specifier, adverb, demonstrative, particle, participle, punctuation and residual.

ACC

Accusative

REFL

Reflexive

INS

Instrumental

RECI

Reciprocal

DAT

Dative

PURP

Purposive

GEN

Genitive

COMM

Commutative

ABL

Ablative

COP

Copula

SOC

Sociative

SURP

Surprise

LOC

Locative

DUB

Dubitative

6 PROPOSED MORPHOSYNTACTIC TAGSET FOR MANIPURI

NMZ

Nominalizer

CONF

Confirmation

This section presents the procedure to design a tagset for Manipuri language. The total numbers of tags in the tagset are 97 including generic attributes and language specific attribute values. The proposed tagset is based on the ILPOST framework. It has been customized for Manipuri to meet the morpho-syntactic requirements of the language and in accordance with language specific and writing conventions followed in Manipuri. The tagset has two tables: table-1 contains 29 tags of sub categories for 12 major categories. There are 2 tags for root, 5 tags for noun, 6 tags for pronoun, 1 tag for verb, 1 tag for adjective, 1 tag for quantifier, 1 tag for specifier, 1 tag for adverb, 2 tags for demonstrative, 3 tags for participle, 3 tags for particle, 1 tag for punctuation and 2 tags for residsual. Table-2 contains the morpho-syntactic features or attributes of the sub categories. There are 32 attributes having 68 attribute value tags.

SIM

Simmilaritive

CMPL

Complaint

PRG

Progressive

INSIS

Insistent

PRF

Perfective

DTRB

Distributive

PROS

Prospective

DEF

Definiteness

INC

Inceptive

EMPH

Emphatic

HAB

Habitual

NEG

Negative

DEC

Declarative

PRX

Proximal

SUP

Suplicative

DST

Distal

PROH

Prohibitive

INL

Inclusive

IMP

Imperative

EXL

Exclusive

PERM

Permissive

HON

Honorificity

OPT

Optative

ORD

Ordinal

MANIPURI TAGSET : TABLE 1 Tag

Description

Tag

Description

NC

Common noun

DAB

Absolute demonstrative

NCP

Compound noun

DWH

Wh-demonstrative

NP

Proper noun

RB

Adverb

ND

Derived noun

PLRL

Relative participle

NST

Spatio-Temporal

PLV

Verbal participle

PPN

Personal pronoun

PLC

Conditional participle

PPS

Possessive pronoun

RPCD

Co-ordinating particle

PDM

Demonstrativepronoun

RPSB

Subordinating particle

PRF

Reflexive pronoun

RPINT

Interjection particle

PRC

Reciprocal pronoun

RDF

Foreign word residual

PIN

Interrogative pronoun

RDS

Symbol residual

V

Verb

RTB

Bound Root

JJ

Adjective

RTF

Free Root

QT

Quantifier

PUN

Punctuation

SPEC

Specifier

MANIPURI TAGSET : TABLE 2 Description

ALL

Allative

EXAS

Exasperation

APP

Approximate

PERSU

Persuation

INT

Interrogative

CRD

Cardinal

POT

Potential

NMN

Non – numeral

6.1 Examples of Manipuri Text Using the Tagset The following real text is tagged according to the proposesd tagset of Manipuri language: Sample Text: লাইরিক তমবা হায়বসি মীওইবগী পুন্সিগি য়াম্না মরু

ওইবা থৌদাং অমনি। চাউখলবা ঙসিগি মতম অসিদা লাইরিক হৈত্রবদি মি তাংবগা চপ মান্নৈ হাইনরিবনি। Tagged Text: লাইরিক\NC তমবা\ND হায়ব\ND সি\PRX মীওইব\NC গী\GEN পুন্সি\NC গি\GEN য়ামনা\SPEC মরু\NC ওইবা\ND থৌদাং\NC অম\CRD নি\COP ।\T চাউ\RTB খ\UP ল\PROS বা\NMZ ঙসি\NST গি\GEN মতম\NC অসি\PDM দা\LOC লাইরিক\NC হৈ\RTB ত্র\NEG ব\NMZ দি\DEF মি\NC তাংব\ND গা\SOC চপ\QNT মান্নৈ\V হাই\RTB ন\PURP রি\PRG ব\NMZ নি\COP ।\T

7 CONCLUSION AND FUTURE WORK In the present paper a morphosyntactic tagset of Manipuri language has been proposed as part of the larger goal of computer processing of the Manipuri language. The future work would be to design and test the algorithm of Part of Speech Tagging in Manipuri. This process will include the manual tagging of Manipuri language, the creation of lexicon and then define a set of handcoded rules to get a single part of speech for each word. The authors are interested in updating the tagset including multi word expression, reduplication and named entity lexical items of the Manipuri lnguage.

Tag

Description

Tag

MAS

Masculine gender

OBL

Obligation

FEM

Feminine gender

VOL

Volition

SG

Singular number

EVI

Evidential

DU

Dual number

CERT

Certainty

PL

Plural number

UP

Upward

REFERENCES

1

First person

DOWN

Downward

[1]

2

Second person

IN

Inward

3

Third person

NPOT

Non-potential

ERG

Ergative

OUT

Outward

NOM

Nominative

CAUS

Causative

[2]

Sandipan Dandapat, Sudeshna Sarkar and Anupam Basu “A Hybrid Model for Part-of-Speech Tagging and its Application to Bengali”, Transactions on Engineering, Computing and Technology V1 December 2004 ISSN 1305-5313. Hardie, A. “The Computational Analysis of Morphosyntactic

28

[3]

[4]

[5]

[6] [7]

[8]

Categories in Urdu.” PhD Thesis submitted to Lancaster University, 2004. Santorini, B. “Part-of-speech tagging guidelines for the Penn Treebank Project.” Technical report MS-CIS-90-47, Department of Computer and Information, 1990. Baskaran S. et al.”Framework for a Common Parts-of-Speech Tagset for Indic Languages. (Draft)” http://research.microsoft.com/~baskaran/POSTagset/, 2007. Thoudam Doren Singh & Sivaji Bandyopadhyay “Morphology Driven Manipuri POS Tagger”, Proceedings of the IJCNLP-08 Workshop on NLP for Less Privileged Languages, pages 91–98, Hyderabad, India, January 2008. Kh. Dhiren Singha, “Loan Words in Manipuri”, Bilingualism and North-East India, an Assam University Publication, 2008. Grierson, G.A. (ed.) (1903-28). Linguistic Survey of India. Vol. III, Pt. III (reprinted 1967-68). Delhi-Varanasi: Motilal Banarsidas. IIIT-tagset. A Parts-of-Speech tagset for Indian languages. http://shiva.iiit.ac.in/SPSAL2007/iiit_tagset_guidelines.pdf.

S. Imoba. “Manipuri to English Dictionary”. S. Ibetombi Devi, Imphal, 2004. [10] Ch. Yashawanta Singh “Manipuri Grammar.” Rajesh Publications, New Delhi, 2000. [11] Leech, G and Wilson, A. “Recommendations for the Morphosyntactic Annotation of Corpora. EAGLES Report EAG-TCWGMAC/R,” 1996. [12] Eric Brill. “A simple rule-based part of speech tagger. In Proceedings Third Conference on Applied Natural Language Processing, ACL, Trento, Italy, 1992. [13] D. S. Thoudam and S. Bandyopadhyay. “Word Class Sentence Type Identification in Manipuri Morphological Analyzer”. In Proceedings of MSPIL, IIT Bombay, pp 11-17, 2006. [14] P.C. Thoudam. “Problems in the Analysis of Manipuri Language.” www.ciil-ebooks.net, CIIL, Mysore, 2006. [15] HSK _ Corpus Linguistics “Development of tag sets or part-ofspeech tagging”, MILES, Release 18.02x on Tuesday January 22 18:53:50 BST, 2008. [16] Ihsan Rabbi et al., 2008. “Developing a Tagset for Pashto Part of Speech Tagging” Second International Conference on Electrical Engineering 25-26 March, 2008. [17] John Fry, “PART-OF-SPEECH TAGGED CORPORA”, Linguistics 115: Corpus Linguistics, Fall 2007, SJSU. [9]

Kh Raju Singha is a Ph.D. student in the Department of Computer Science, Assam University, Silchar. Bipul Syam Purkayastha is working as a Professor in the Department of Computer Science, Assam University, Silchar. He is a member of IEEE and ACM journal. Kh Dhiren Singha is working as an Associate Professor in the Department of Linguistics, Assam University, Silchar. He is a member of the Linguistic Society of India and International journal of Dravidian Linguistics. Arindam Roy is working as an Assistant Professor in the Department of Computer Science, Assam University, Silchar.