A high recall error identification tool for Hindi Treebank Validation Bharat Ram Ambati, Mridul Gupta, Samar Husain, Dipti Misra Sharma Language Technologies Research Centre, International Institute of Information Technology, Hyderabad, INDIA—500 032 Email:
[email protected],
[email protected], {samar, dipti}@mail.iiit.ac.in
Introduction
Treebank - One of the most important linguistic resources.
Rule Based Approach (Cont…)
Results
POS level error detection: No. of words
Utility in various NLP tasks such as parsing, natural language understanding. Type of Error POS Errors Chunk Errors Depende ncy Errors
Linguistic information encoded at different levels such as morphological, syntactic, syntactico-semantic (dependency). Linguistic annotation is done at these levels in the treebank.
Error-free annotated data required.
Dependency level error detection:
Hence, the manual task of data validation should be made expeditious without compromising quality.
Automatic conversion from DS to PS planned.
Approach
Statistical Module Rule Based post-processing
Statistical Module Calculate frequencies of Pattern – Tag pair if freq < threshold:
Hence, important to have a high quality version of the dependency treebank.
Rule Based Approach
Hybrid Approach Statistical Module Rule based post-processing step
Error detection handled at three levels: Part-of-speech Chunk Dependency
Annotation guidelines for Indian languages used to formulate rules.
Analyze mismatches in annotated and manually validated data from the development data set. Analysis of mismatches to frame additional rules. Nature of the rules varies for different types of annotation.
Total Total Recall instances Errors 13922 16 12/16 = 75% 7113 24 15/24 = 62.5% 7113 843 218/843 = 25.86%
Approach
Total System Correct Recall Errors output Errors
Rule Based Approach
843
218
218
25.86%
Hybrid Approach
843
2546
158
18.74%
Combining both 843 Approaches
2728
340
40.33%
Table 3: Recall of error detection at dependency level using different approaches
Proposed a tool which uses both rule-based and hybrid systems to detect annotation errors.
An attempt to account for problems arising due to data sparsity. Richer context required to handle data sparsity. Tool also aimed at improving annotation guidelines, which subsequently improves annotation quality.
Data annotated based on improved guidelines to reduce the occurrence of errors.
Also, improve consistency among annotators. Figure below, shows the cycle of the whole treebanking process.
else:
if pattern has multiple tags and freq < threshold2% of total instances of pattern: Possible Error
Future Work eka ladake ne khaana khaayaa one boy ‘ERG’ food ate “A boy ate the food” [NP:: eka/DEM ladake/NN ne/PSP] [NP:: khaana/NN] [VGF:: khaayaa/VM] Tag
Original Pattern
Rule Based Approach
15k
Possible Error
Approach
10k
Hybrid Approach
Treebank annotation – A time-consuming and expensive task.
Treebank to have dependency representation (DS), verb-argument (PropBank) and phrase structure (PS) representation.
40k
Conclusions
A multi-layered and multi-representational treebank for Hindi currently under development.
Testing
Table 2: Error Detection using rule-based system at different levels
Validation of data and error correction important steps to achieve a high quality standard.
Hindi Dependency Treebank (HDT)
Development
Table 1: Data set for experiments
Motivation
Training
Pattern after similarity criterion
NN (POS)
ladake
ladake
NP (Chunk)
NP::eka/DEM--ladake/NN--ne/PSP
NP::NOUN--ne/PSP
k1 (Inter-chunk NP::eka/DEM--ladake/NN--ne/PSP dependencies) VGF::khaaya/VM
NP::NOUN--ne/PSP VGF::VM
The tool is constantly being improved. More robust rules being devised. Improving statistical module to handle data sparsity problems. Need to explore a probability based hybrid approach to provide richer context.