A high recall error identification tool for Hindi Treebank Validation Bharat Ram Ambati, Mridul Gupta, Samar Husain, Dipti Misra Sharma Language Technologies Research Centre, International Institute of Information Technology, Hyderabad, INDIA—500 032 Email: [email protected], [email protected], {samar, dipti}@mail.iiit.ac.in

Introduction 

Treebank - One of the most important linguistic resources.

Rule Based Approach (Cont…) 

Results

POS level error detection: No. of words







Utility in various NLP tasks such as parsing, natural language understanding. Type of Error POS Errors Chunk Errors Depende ncy Errors

Linguistic information encoded at different levels such as morphological, syntactic, syntactico-semantic (dependency). Linguistic annotation is done at these levels in the treebank.







Error-free annotated data required.











Dependency level error detection:



Hence, the manual task of data validation should be made expeditious without compromising quality.

Automatic conversion from DS to PS planned.





Approach



Statistical Module  Rule Based post-processing 

 

Statistical Module  Calculate frequencies of Pattern – Tag pair  if freq < threshold: 



Hence, important to have a high quality version of the dependency treebank.

Rule Based Approach



Hybrid Approach Statistical Module  Rule based post-processing step 

Error detection handled at three levels: Part-of-speech  Chunk  Dependency









Annotation guidelines for Indian languages used to formulate rules.

Analyze mismatches in annotated and manually validated data from the development data set. Analysis of mismatches to frame additional rules. Nature of the rules varies for different types of annotation.

Total Total Recall instances Errors 13922 16 12/16 = 75% 7113 24 15/24 = 62.5% 7113 843 218/843 = 25.86%

Approach

Total System Correct Recall Errors output Errors

Rule Based Approach

843

218

218

25.86%

Hybrid Approach

843

2546

158

18.74%

Combining both 843 Approaches

2728

340

40.33%

Table 3: Recall of error detection at dependency level using different approaches

Proposed a tool which uses both rule-based and hybrid systems to detect annotation errors.

An attempt to account for problems arising due to data sparsity. Richer context required to handle data sparsity. Tool also aimed at improving annotation guidelines, which subsequently improves annotation quality.



Data annotated based on improved guidelines to reduce the occurrence of errors.

Also, improve consistency among annotators. Figure below, shows the cycle of the whole treebanking process.

else: 

if pattern has multiple tags and freq < threshold2% of total instances of pattern: Possible Error

Future Work eka ladake ne khaana khaayaa one boy ‘ERG’ food ate “A boy ate the food” [NP:: eka/DEM ladake/NN ne/PSP] [NP:: khaana/NN] [VGF:: khaayaa/VM] Tag

Original Pattern



Rule Based Approach

15k

Possible Error

Approach 

10k

Hybrid Approach







Treebank annotation – A time-consuming and expensive task.

Treebank to have dependency representation (DS), verb-argument (PropBank) and phrase structure (PS) representation.

40k

Conclusions



A multi-layered and multi-representational treebank for Hindi currently under development.

Testing

Table 2: Error Detection using rule-based system at different levels

Validation of data and error correction important steps to achieve a high quality standard.

Hindi Dependency Treebank (HDT)

Development

Table 1: Data set for experiments

Motivation 

Training

Pattern after similarity criterion

NN (POS)

ladake

ladake

NP (Chunk)

NP::eka/DEM--ladake/NN--ne/PSP

NP::NOUN--ne/PSP

k1 (Inter-chunk NP::eka/DEM--ladake/NN--ne/PSP dependencies)  VGF::khaaya/VM

NP::NOUN--ne/PSP  VGF::VM







The tool is constantly being improved. More robust rules being devised. Improving statistical module to handle data sparsity problems. Need to explore a probability based hybrid approach to provide richer context.

A high recall error identification tool for Hindi Treebank ...

Validation of data and error correction important steps to achieve a high quality ... Statistical Module ... Analyze mismatches in annotated and manually validated ...

329KB Sizes 1 Downloads 186 Views

Recommend Documents

A high recall error identification tool for Hindi Treebank ...
Consistency in treebank annotation is a must for making data as error-free as possible ... We report some results of using the tool on a sample of data extracted.

CCG Treebank from the Hindi Dependency Treebank
Abstract In this paper, we present an approach for automatically creating a Com- binatory Categorial Grammar (CCG) treebank from a dependency treebank for the. Subject-Object-Verb language Hindi. Rather than a direct conversion from depen- dency tree

A Hybrid Approach to Error Detection in a Treebank - language
of Ambati et al. (2011). The figure shows the pseudo code of the algo- rithm. Using this algorithm Ambati et al. (2011) could not only detect. 3False positives occur when a node that is not an error is detected as an error. 4http://sourceforge.net/pr

A Hybrid Approach to Error Detection in a Treebank - language
recall error identification tool for Hindi treebank validation. In The 7th. International Conference on Language Resources and Evaluation (LREC). Valleta, Malta.

A Hybrid Approach to Error Detection in a Treebank - Semantic Scholar
The rule-based correction module is described in detail in Section. 3.1 below. We use the EPBSM as the statistical module in the over- all hybrid system to compare the performance of the resultant system with the overall hybrid system employed by Amb

A Hybrid Approach to Error Detection in a Treebank - Semantic Scholar
Abstract. Treebanks are a linguistic resource: a large database where the mor- phological, syntactic and lexical information for each sentence has been explicitly marked. The critical requirements of treebanks for various. NLP activities (research an

High Capacity Data Hiding for Error-Diffused Block ...
plementary Hiding Error-Diffused Block Truncation Coding ..... Instead, the substitution, namely Human visual system PSNR ...... edu.tw/public file/ImageSet.rar ...

raw tool identification through detected demosaicing ... - IEEE Xplore
RAW tools are PC software tools that develop the RAWs,. i.e. the camera sensor data, into full-color photos. In this paper, we propose to study the internal ...

A Tool for Text Comparison
The data to be processed was a comparative corpus, the. METER ..... where xk denotes the mean value of the kth variables of all the entries within a cluster.

A new tool for teachers
Items 11 - 20 - Note: The authors wish to express their sincere thanks to Jim Davis .... of the American population) to allow confident generalizations. Children were ..... available to them and (b) whether they currently had a library card. Those to

A Tool for All Seasons
variation. Moreover, museum curators are often reluctant to allow researchers to drill deep grooves into rare hominin teeth. In contrast to conventional methods, ...

A Collaborative Tool for Synchronous Distance Education
application in a simulated distance education setting. The application combines video-conference with a networked virtual environment in which the instructor and the students can experiment ..... Virtual Campus: Trends for Higher Education and. Train

High efficiency, error minimizing coding strategy method and apparatus
Dec 19, 2011 - systems, data is typically transmitted as a series of code words. In general ... channel susceptible to dominant errors, a computer hard disk.

High efficiency, error minimizing coding strategy method and apparatus
Dec 19, 2011 - miZing code for use in connection With systems having a communication ..... channel for transmitting voice or data, a computer netWork, or.