JOURNAL OF COMPUTER SCIENCE AND ENGINEERING, VOLUME 5, ISSUE 1, JANUARY 2011 25
Developing a Tagset for Manipuri Part of Speech Tagging Kh Raju Singha, Bipul Syam Purkayastha, Kh Dhiren Singha and Arindam Roy Abstract—Tagset is a very important element for part of speech tagging in any natural language. It is a fundamental component for developing an effective machine translation system of any natural language. This Paper proposes developing a tagset for Manipuri parts of speech tagging in accordance with EAGLE guidelines and IL-POST framework for morphosyntactic annotation of corpora. Index Terms—Tagset, Tagging, Computational Linguistics, Morphosyntactic, Lexical item, Affixes, Morphology, Word Class.
—————————— ——————————
1 INTRODUCTION
2 LITERATURE SURVEY
P
UPenn, Brown and C5 are the tagsets for English designed in early 1970s which were mostly simple lists of tags corresponding to the morphosyntactic features, and varied greatly in terms of granularity [2]. CLAWS2 tagset [3] marked an important change in structure of POS tagsets from a flat structure with unitary tags to a hierarchical structure that allowed for decomposable tags. This enabled to tag all the lexical items of the language with distinct grammatical behaviour and a systematic approach. Several POS tagsets have been developed by a number of research groups working on Indian Languages. Among them very few are available publicly e.g. IIT-tagset, AUKBC Tamil tagset etc. These tagsets are motivated by specific research agenda, they differ considerably in terms of morphosyntactic categories and features, tag definitions, level of granularity, annotation guidelines etc [4]. D.S. Thoudam and S. Bandyopadhayay [5] developed a morphology driven Manipuri POS tagger using a flat tagset consisting of 13 tags. A hierarchical tagset with language specific attribute values is required to tag the lexical items of a language having large number of affixes, monosyllabic, agglutinative typological features like Manipuri language.
art of speech tagger is one of the important components in the development of any serious application in the field of Computational Linguistics (CL), Natural Language Processing (NLP) in the present world. Part of Speech tagging is a technique for automatic annotation of lexical categories. It assigns an appropriate tag for each word in a sentence of a language as corresponding to part of speech, based on both its definition, as well as its context [1]. A POS tagger takes a sentence as input and assigns a unique part of speech tag to each lexical item of the sentence. POS tagging is used as an early stage of linguistic text analysis in many applications including subcategory acquisition; text to speech synthesis; and alignment of parallel corpora [1]. The POS tagger can be used in other areas of Natural Language Processing such as semantic analysis, information retrieval, shallow parsing, information extraction and machine translation etc. For the creation of a POS tagger, it is necessary to build a tagset for a language. Just like English, Hindi and other languages, Manipuri language also has some lexical categories. This paper expounds the creation process of a tagset for Manipuri language in accordance with EAGLES guidelines and IL-POST (Indian Language POS Tagset) framework for morphosyntactic annotation of corpora. The roadmap of the paper is as follows: Section 2 is on literature survey, Section 3 gives a brief description of the Manipuri language, Section 4 discusses the Typological features of Manipuri language, Section 5 discusses how word class can be formed in Manipuri, Section 6 describes the the proposed morphosyntactic tagset for Manipuri and Section 7 concludes the paper. ————————————————
• Kh Raju Singha is with the Department of Computer Science, Assam University, Silchar. • Bipul Syam Purkayastha is with the Department of Computer Science, Assam University, Silchar. • Kh Dhiren Singha is with the Department of Linguistics, Assam University, Silchar. • Arindam Roy is with the Department of Computer Science, Assam University, Silchar.
3 MANIPURI LANGUAGE Manipuri (Meiteilon or Meiteiron) is one of the oldest languages in the South-East Asia which has its own script (Meitei Mayek) and literature. At Present, Manipuri used to write in Bengali Script from 1709 A.D. onwards i.e; during the reign of king Pamheiba. Manipuri is widely spoken in Manipur, Assam, Tripura, Bangladesh and Myanmar, which has been included in the eight schedule of Indian Constitution since 1992. The total number of people who return Manipuri as their mother toungue was 1,500,000 out of which 1,466,705 speakers reside in India (Census of India, 2001). It is the first Tibeto-Burman language which has obtained its due place and recognition in Indian Constitution [6]. Linguistically, it belongs to the Kuki-Chin group of the Tibeto-Burman family of Langua-
© 2011 JCSE http://sites.google.com/site/jcseuk/
26
-ges [7] influenced and enrinched by the Indo-Aryan languages of Sanskrit origin and English.
4 TYPOLOGICAL FEATURES OF MANIPURI LANGUAGE 1. Manipuri has a tonal number of 32 phonemes consisting of 6 vowels, 24 consonants and two tones: high and low. e.g., í / ঈ ‘blood’ ì / ই ‘hay’. 2. Gender distinction in Manipuri is determined on the natural recognition of sex, i.e; gender is not grammatically marked in this language. e.g., nupamaca / nupimaca-du-na isei sak- lam-i boy / girl -dist-Nmz. song sing-Evi. –Asp. ‘That boy / girl were singing a song.’ 3. Number is also not grammatically significant in Manipuri i.e., there is no subject predicate agreement as far number is concerned. e.g; nupamaca / nupimaca-sing-na isei sak-lam-i boy /girl - Pl. – Nmz song sing-Evi-ASP. ‘Boys/girls were singing (a) song. 4. Case relations in Manipuri are expressed by means of postpositions. e.g., ai-na ma-hak-pu keithel-da u-ram-i I –Nmz he-hon-Acc. market-Loc. See-Evi.-ASP. ‘I saw him in the market.’ 5. The words in Manipuri are highly monosyllabic i.e., even a vowel can be a syllable or a morpheme or a word in the language. e.g., ù ‘tree’, í ‘blood’. 6. Like many other Southeast Asian languages, Manipuri is also an agglunating type of languages having large number of affixes particularly in the form of prefixes and suffixes. e.g., a-pamba-du tau-han-ning-i Prefix-like-Nmz. do-Caus.-wish-ASP. ‘(I) wish to make you to do whatever you like’ 7. In Manipuri, words are mainly formed by morpho logical process like derivation, compounding and reduplication. e.g., ca-ba ‘eating’ bound root- Nmz. cak-sang ‘kitchen’ rice-hut khun ‘village’ khun-khun ‘villages’ 8. The normal order of words in an unmarked sentence is subject-object-verb (SOV). e.g., cauba-na laphoi ca-i
Cauba-Nmz banana eat-ASP ‘Cauba takes banana.’ 9. There is a lack of relative pronoun, the relative clause is expressed by means of particle. e.g., Imphal-dagi lak-pa nupamca-du ai- gi marup-ni. -ablative come-participle boy-Dem. I-Gen. friend-Cop. ‘The boy who came from Imphal is my friend.’ 10. Negation is mainly formed by affixation i.e., by suffixation. e.g., ai ca thak-te I tea drink – Neg. ‘I do not take tea.’
5 WORD CLASSES IN MANIPURI It is very difficult to determine word classes in Manipuri as we found in English, Hindi, Russian and Tamil etc. except time-stable nouns like চীঙ/ching/mountain, ঈশিং/ising/water, নুমি/numit/sun etc. The determination of word classes in Manipuri is maily employed by the corresponding affixes. Some bound roots can be formed noun, verb, adjective and adverb by the morphological process like affixation as shown below with examples in the underlying representation. Nouns 1. Formation of noun by prefixation to the bound root: prefix+ bound root→noun e.g., খু+চেন→খুচেন / ‘the way of running’ 2. Formation of noun by suffixation to the bound root: bound root+suffix→noun /‘running’ e.g., চেন+বা→ 3. Formation of noun by combination of two nouns: noun+noun→noun e.g., চাক/’rice’+শঙ/’hut’ → /’kitchen’ 4. Formation of noun by combination noun and bound root: noun+bound root→noun /‘window’ e.g., থোং+নাউ→ Verbs Formation of verb by suffixation to the bound root: bound root+suffix (Tense/aspect/evi marker) →verb e.g., চেন+লি → / ‘running’ Adjectives Formation of adjective by prefixation and suffixation to the bound root: prefix+bound root+suffix→adjective e.g., অ+চেন+বা→ /’running (adj)’ অ+ঙাং+বা→ /’red (adj)’ Apart from the above, some adjectives particularly colour terms are free morphemes as exemplified below: হীগোক/’blue’, নাপু/’yellow’, লৈঙাং/’saffron’, etc. Adverbs Formation adverb by suffixation to the bound root: bound root+suffix (case marker) →adverb e.g., তপ+না→ /’slowly (adv)’. In addition to the above classes of words, there are some other word classes viz; pronouns, quantifiers, specifiers, demonstratives, participles, punctuations and residuals which
27
are morphosyntactically determined without any complexity in Manipuri language. The root itself can be considered as a word class in Manipuri. So we categorized 12 major word classes in Manipuri viz; root, noun, pronoun, verb, adjective, quantifier, specifier, adverb, demonstrative, particle, participle, punctuation and residual.
ACC
Accusative
REFL
Reflexive
INS
Instrumental
RECI
Reciprocal
DAT
Dative
PURP
Purposive
GEN
Genitive
COMM
Commutative
ABL
Ablative
COP
Copula
SOC
Sociative
SURP
Surprise
LOC
Locative
DUB
Dubitative
6 PROPOSED MORPHOSYNTACTIC TAGSET FOR MANIPURI
NMZ
Nominalizer
CONF
Confirmation
This section presents the procedure to design a tagset for Manipuri language. The total numbers of tags in the tagset are 97 including generic attributes and language specific attribute values. The proposed tagset is based on the ILPOST framework. It has been customized for Manipuri to meet the morpho-syntactic requirements of the language and in accordance with language specific and writing conventions followed in Manipuri. The tagset has two tables: table-1 contains 29 tags of sub categories for 12 major categories. There are 2 tags for root, 5 tags for noun, 6 tags for pronoun, 1 tag for verb, 1 tag for adjective, 1 tag for quantifier, 1 tag for specifier, 1 tag for adverb, 2 tags for demonstrative, 3 tags for participle, 3 tags for particle, 1 tag for punctuation and 2 tags for residsual. Table-2 contains the morpho-syntactic features or attributes of the sub categories. There are 32 attributes having 68 attribute value tags.
SIM
Simmilaritive
CMPL
Complaint
PRG
Progressive
INSIS
Insistent
PRF
Perfective
DTRB
Distributive
PROS
Prospective
DEF
Definiteness
INC
Inceptive
EMPH
Emphatic
HAB
Habitual
NEG
Negative
DEC
Declarative
PRX
Proximal
SUP
Suplicative
DST
Distal
PROH
Prohibitive
INL
Inclusive
IMP
Imperative
EXL
Exclusive
PERM
Permissive
HON
Honorificity
OPT
Optative
ORD
Ordinal
MANIPURI TAGSET : TABLE 1 Tag
Description
Tag
Description
NC
Common noun
DAB
Absolute demonstrative
NCP
Compound noun
DWH
Wh-demonstrative
NP
Proper noun
RB
Adverb
ND
Derived noun
PLRL
Relative participle
NST
Spatio-Temporal
PLV
Verbal participle
PPN
Personal pronoun
PLC
Conditional participle
PPS
Possessive pronoun
RPCD
Co-ordinating particle
PDM
Demonstrativepronoun
RPSB
Subordinating particle
PRF
Reflexive pronoun
RPINT
Interjection particle
PRC
Reciprocal pronoun
RDF
Foreign word residual
PIN
Interrogative pronoun
RDS
Symbol residual
V
Verb
RTB
Bound Root
JJ
Adjective
RTF
Free Root
QT
Quantifier
PUN
Punctuation
SPEC
Specifier
MANIPURI TAGSET : TABLE 2 Description
ALL
Allative
EXAS
Exasperation
APP
Approximate
PERSU
Persuation
INT
Interrogative
CRD
Cardinal
POT
Potential
NMN
Non – numeral
6.1 Examples of Manipuri Text Using the Tagset The following real text is tagged according to the proposesd tagset of Manipuri language: Sample Text: লাইরিক তমবা হায়বসি মীওইবগী পুন্সিগি য়াম্না মরু
ওইবা থৌদাং অমনি। চাউখলবা ঙসিগি মতম অসিদা লাইরিক হৈত্রবদি মি তাংবগা চপ মান্নৈ হাইনরিবনি। Tagged Text: লাইরিক\NC তমবা\ND হায়ব\ND সি\PRX মীওইব\NC গী\GEN পুন্সি\NC গি\GEN য়ামনা\SPEC মরু\NC ওইবা\ND থৌদাং\NC অম\CRD নি\COP ।\T চাউ\RTB খ\UP ল\PROS বা\NMZ ঙসি\NST গি\GEN মতম\NC অসি\PDM দা\LOC লাইরিক\NC হৈ\RTB ত্র\NEG ব\NMZ দি\DEF মি\NC তাংব\ND গা\SOC চপ\QNT মান্নৈ\V হাই\RTB ন\PURP রি\PRG ব\NMZ নি\COP ।\T
7 CONCLUSION AND FUTURE WORK In the present paper a morphosyntactic tagset of Manipuri language has been proposed as part of the larger goal of computer processing of the Manipuri language. The future work would be to design and test the algorithm of Part of Speech Tagging in Manipuri. This process will include the manual tagging of Manipuri language, the creation of lexicon and then define a set of handcoded rules to get a single part of speech for each word. The authors are interested in updating the tagset including multi word expression, reduplication and named entity lexical items of the Manipuri lnguage.
Tag
Description
Tag
MAS
Masculine gender
OBL
Obligation
FEM
Feminine gender
VOL
Volition
SG
Singular number
EVI
Evidential
DU
Dual number
CERT
Certainty
PL
Plural number
UP
Upward
REFERENCES
1
First person
DOWN
Downward
[1]
2
Second person
IN
Inward
3
Third person
NPOT
Non-potential
ERG
Ergative
OUT
Outward
NOM
Nominative
CAUS
Causative
[2]
Sandipan Dandapat, Sudeshna Sarkar and Anupam Basu “A Hybrid Model for Part-of-Speech Tagging and its Application to Bengali”, Transactions on Engineering, Computing and Technology V1 December 2004 ISSN 1305-5313. Hardie, A. “The Computational Analysis of Morphosyntactic
28
[3]
[4]
[5]
[6] [7]
[8]
Categories in Urdu.” PhD Thesis submitted to Lancaster University, 2004. Santorini, B. “Part-of-speech tagging guidelines for the Penn Treebank Project.” Technical report MS-CIS-90-47, Department of Computer and Information, 1990. Baskaran S. et al.”Framework for a Common Parts-of-Speech Tagset for Indic Languages. (Draft)” http://research.microsoft.com/~baskaran/POSTagset/, 2007. Thoudam Doren Singh & Sivaji Bandyopadhyay “Morphology Driven Manipuri POS Tagger”, Proceedings of the IJCNLP-08 Workshop on NLP for Less Privileged Languages, pages 91–98, Hyderabad, India, January 2008. Kh. Dhiren Singha, “Loan Words in Manipuri”, Bilingualism and North-East India, an Assam University Publication, 2008. Grierson, G.A. (ed.) (1903-28). Linguistic Survey of India. Vol. III, Pt. III (reprinted 1967-68). Delhi-Varanasi: Motilal Banarsidas. IIIT-tagset. A Parts-of-Speech tagset for Indian languages. http://shiva.iiit.ac.in/SPSAL2007/iiit_tagset_guidelines.pdf.
S. Imoba. “Manipuri to English Dictionary”. S. Ibetombi Devi, Imphal, 2004. [10] Ch. Yashawanta Singh “Manipuri Grammar.” Rajesh Publications, New Delhi, 2000. [11] Leech, G and Wilson, A. “Recommendations for the Morphosyntactic Annotation of Corpora. EAGLES Report EAG-TCWGMAC/R,” 1996. [12] Eric Brill. “A simple rule-based part of speech tagger. In Proceedings Third Conference on Applied Natural Language Processing, ACL, Trento, Italy, 1992. [13] D. S. Thoudam and S. Bandyopadhyay. “Word Class Sentence Type Identification in Manipuri Morphological Analyzer”. In Proceedings of MSPIL, IIT Bombay, pp 11-17, 2006. [14] P.C. Thoudam. “Problems in the Analysis of Manipuri Language.” www.ciil-ebooks.net, CIIL, Mysore, 2006. [15] HSK _ Corpus Linguistics “Development of tag sets or part-ofspeech tagging”, MILES, Release 18.02x on Tuesday January 22 18:53:50 BST, 2008. [16] Ihsan Rabbi et al., 2008. “Developing a Tagset for Pashto Part of Speech Tagging” Second International Conference on Electrical Engineering 25-26 March, 2008. [17] John Fry, “PART-OF-SPEECH TAGGED CORPORA”, Linguistics 115: Corpus Linguistics, Fall 2007, SJSU. [9]
Kh Raju Singha is a Ph.D. student in the Department of Computer Science, Assam University, Silchar. Bipul Syam Purkayastha is working as a Professor in the Department of Computer Science, Assam University, Silchar. He is a member of IEEE and ACM journal. Kh Dhiren Singha is working as an Associate Professor in the Department of Linguistics, Assam University, Silchar. He is a member of the Linguistic Society of India and International journal of Dravidian Linguistics. Arindam Roy is working as an Assistant Professor in the Department of Computer Science, Assam University, Silchar.