A high recall error identification tool for Hindi Treebank ...

Viewer
Transcript

A high recall error identification tool for Hindi Treebank Validation Bharat Ram Ambati, Mridul Gupta, Samar Husain, Dipti Misra Sharma Language Technologies Research Centre, International Institute of Information Technology, Hyderabad, INDIA—500 032 Email: [email protected], [email protected], {samar, dipti}@mail.iiit.ac.in

Introduction 

Treebank - One of the most important linguistic resources.

Rule Based Approach (Cont…) 

Results

POS level error detection: No. of words







Utility in various NLP tasks such as parsing, natural language understanding. Type of Error POS Errors Chunk Errors Depende ncy Errors

Linguistic information encoded at different levels such as morphological, syntactic, syntactico-semantic (dependency). Linguistic annotation is done at these levels in the treebank.







Error-free annotated data required.











Dependency level error detection:



Hence, the manual task of data validation should be made expeditious without compromising quality.

Automatic conversion from DS to PS planned.





Approach



Statistical Module  Rule Based post-processing 

 

Statistical Module  Calculate frequencies of Pattern – Tag pair  if freq < threshold: 



Hence, important to have a high quality version of the dependency treebank.

Rule Based Approach



Hybrid Approach Statistical Module  Rule based post-processing step 

Error detection handled at three levels: Part-of-speech  Chunk  Dependency









Annotation guidelines for Indian languages used to formulate rules.

Analyze mismatches in annotated and manually validated data from the development data set. Analysis of mismatches to frame additional rules. Nature of the rules varies for different types of annotation.

Total Total Recall instances Errors 13922 16 12/16 = 75% 7113 24 15/24 = 62.5% 7113 843 218/843 = 25.86%

Approach

Total System Correct Recall Errors output Errors

Rule Based Approach

843

218

218

25.86%

Hybrid Approach

843

2546

158

18.74%

Combining both 843 Approaches

2728

340

40.33%

Table 3: Recall of error detection at dependency level using different approaches

Proposed a tool which uses both rule-based and hybrid systems to detect annotation errors.

An attempt to account for problems arising due to data sparsity. Richer context required to handle data sparsity. Tool also aimed at improving annotation guidelines, which subsequently improves annotation quality.



Data annotated based on improved guidelines to reduce the occurrence of errors.

Also, improve consistency among annotators. Figure below, shows the cycle of the whole treebanking process.

else: 

if pattern has multiple tags and freq < threshold2% of total instances of pattern: Possible Error

Future Work eka ladake ne khaana khaayaa one boy ‘ERG’ food ate “A boy ate the food” [NP:: eka/DEM ladake/NN ne/PSP] [NP:: khaana/NN] [VGF:: khaayaa/VM] Tag

Original Pattern



Rule Based Approach

15k

Possible Error

Approach 

10k

Hybrid Approach







Treebank annotation – A time-consuming and expensive task.

Treebank to have dependency representation (DS), verb-argument (PropBank) and phrase structure (PS) representation.

40k

Conclusions



A multi-layered and multi-representational treebank for Hindi currently under development.

Testing

Table 2: Error Detection using rule-based system at different levels

Validation of data and error correction important steps to achieve a high quality standard.

Hindi Dependency Treebank (HDT)

Development

Table 1: Data set for experiments

Motivation 

Training

Pattern after similarity criterion

NN (POS)

ladake

ladake

NP (Chunk)

NP::eka/DEM--ladake/NN--ne/PSP

NP::NOUN--ne/PSP

k1 (Inter-chunk NP::eka/DEM--ladake/NN--ne/PSP dependencies)  VGF::khaaya/VM

NP::NOUN--ne/PSP  VGF::VM







The tool is constantly being improved. More robust rules being devised. Improving statistical module to handle data sparsity problems. Need to explore a probability based hybrid approach to provide richer context.