SQAD: Simple Question Answering Database Marek Medved’ and Aleš Horák Natural Language Processing Centre Faculty of Informatics, Masaryk University Botanická 68a, 602 00 Brno, Czech Republic {xmedved1,hales}@fi.muni.cz

Abstract. In this paper, we present a new free resource for comparable Czech question answering evaluation. The Simple Question Answering Database, SQAD, contains 3301 questions and answers extracted and processed from the Czech Wikipedia. The SQAD database was prepared with the aim of a precision evaluation of automatic question answering systems. Such resource was currently not available for the Czech language. We describe the process of SQAD creation, processing of the texts by automatic tokenization (Unitok) and morphological disambiguation (Desamb) and successive semi-automatic cleaning and post-processing. We also show the results of a first version of Czech question answering system named SBQA (syntax-based question answering). Keywords: question answering, Simple Question Answering Database, SQAD, syntax-based question answering, SBQA

1

Introduction

The question answering (QA) field has a significant potential nowadays. Systems that are capable of answering any possible question can be widely used by general public to answer questions about the weather, public transport etc. In specialized applications, question answering can be used in medicine to speed up the process of fitting the patient symptoms to all possible illnesses. Nowadays, several perspective question answering systems appeared, such as the IBM’s Watson [1,2] that had a big success in popular television competitive show Jeopardy!1 . Many QA systems are orientated on a specific domain and are always able to answer only simple questions [3,4,5,6]. Question answering systems employ tools that process the input question than go through a knowledge base and provide a reasonable answer to the question. The presented SQAD database will help to measure and improve accuracy of QA tools as it offers all relevant processing parts, i.e. the source text, the question and the expected answer. 1

Jeopardy! is an American television show that features a quiz competition in which contestants are presented with general knowledge clues in the form of answers, and must phrase their responses in question form.

Aleš Horák, Pavel Rychlý (Eds.): Proceedings of Recent Advances in Slavonic Natural Language Processing, c NLP Consulting 2014 RASLAN 2014, pp. 121–128, 2014. ○

122

Marek Medved’ and Aleš Horák Original text: Létající jaguár je novela spisovatele Josefa Formánka z roku 2004. [Létající jaguár is a novel of writer Josef Formánek form the 2004.] Question: Kdo je autorem novely Létající jaguár? [Who is the author of the novel of Flying jaguar?] Answer: Josef Formánek URL: http://cs.wikipedia.org/wiki/L%C3%A9taj%C3%ADc%C3%AD_jagu%C3%A1r

Author: chalupnikova

Fig. 1: Example of a SQAD record

2

The SQAD Database

The Simple Question Answering Database, SQAD, was created in a computational linguistic course by students. Their task was to choose sections from Czech Wikipedia articles as a source for different questions and the respective answers. SQAD contains 3301 records with the following data fields (see Figure 1 for an example): the original sentence(s) from Wikipedia a question that is directly answered in the text the expected answer to the question as it appears in the original text the URL of the Wikipedia web page from which the original text was extracted – name of the author of this SQAD record

– – – –

In the first phase, the SQAD database consisted of plain texts. To support the comparison and development of question answering systems including SBQA [7] that is presented further in the text, SQAD was supplemented with automatic morphological annotations. The texts were processed with two tools: Unitok [12] for text tokenization and Desamb [8] morphological tagger, which provides unambiguous morphological annotation of tokenized texts (see Figure 2). Both tools are automatic systems and their accuracy is not 100% thus they occasionally make mistakes. To obtain high-quality data, the tagged texts were checked and corrected by semi-automatic and manual adjustments that are described in the following sections.

3

Adjustments of the SQAD Database

In this section, we describe amendments to the tokenization and tagging as obtained by Unitok and Desamb processing. Some of the modifications were

SQAD: Simple Question Answering Database Kdo je autorem novely Létající jaguár ?

kdo být autor novela létající jaguár

k3yRnSc1 k5eAaImIp3nS k1gMnSc7 k1gFnSc2 k2eAgMnSc1d1 k1gMnSc1

?

kIx.

123

Fig. 2: Morphological annotation (consists of form, lemma and tag) of a sentence “Kdo je autorem novely Létající jaguár? (Who is the author of the novel of Flying jaguar?)” a)

1 200 300

b)

−→

1 200 300

Fig. 3: Unitok: a) wrong and b) adjusted tokenization of number “1 200 300”.

systematic and proceeded in bulk over all the records, some of them were part of manual checking. 3.1

Tokenization Adjustments

During the manual checking of SQAD records, we found that large numbers have wrong tokenization if they contained whitespace as a thousands separator (see Figure 3). Therefore we have adapted the matching patterns that Unitok uses for processing the text to fix this setup. 3.2

Out-of-Vocabulary Words

The system Desamb is used for morphological tags disambiguation according to the word context. In some cases, the context is too narrow or there is no context at all so Desamb cannot determine the correct tag. In this case, Desamb returns “k?” (unknown kind) as the resulting tag. In SQAD, this is specifically true for the cases, where the answer in the SQAD record contains only a number. Then the Desamb resulting tag is “k?”. Therefore we have systematically changed such tags to the “k4” tag for numerals (see Figure 4 for an example). The Desamb tool is working over the attributive Czech tagset of the Majka system [9,10]. Majka is based on a deterministic acyclic finite state automaton

124

Marek Medved’ and Aleš Horák 120 120 k?

−→

120 120 k4

Fig. 4: Desamb: unrecognized number Los Los k? Angeles Angeles k? LA LA k?

−→

Los Los k1 Angeles Angeles k1 LA LA kA

Fig. 5: Desamb: unrecognized proper names and abbreviations

(DAFSA) that is created from large morphological database. Despite the coverage of more than 30 million Czech verb forms, Majka does not recognize all the existing words, especially proper names and abbreviations. Thus the resulting tag on such words is usually “k?”, which means that the system does not contain this word in the database and the context is not long enough to guess the tag. For such words, we have systematically changed the unknown tag to: a) “k1” (nouns) for all words that start with an upper case letter, and b) “kA” (abbreviations) for words that contains only upper case letters, words ending with dot or words containing dots between upper case letters. See Figure 5 for the resulting annotation of unknown proper names and abbreviations. As we have mentioned above, the SQAD database is extracted from Wikipedia texts, which are multilingual and the texts often contain original forms of proper names. For example original writing of the name "Tokio" is "東京". These foreign language words are not included in Majka database thus Desamb always assigns them the tag for unknown words (“k?”). To fix the remaining unknown word tags in SQAD, we have extracted all unknown words into one file keeping the original file name, word position and unknown word with its lemma and tag from Desamb. This file was than manually annotated and programmatically applied back to the original annotated file (see Figure 6). 3.3

Mistakes in Morphological Analysis

In case of unknown words, the Desamb tool can not only leave the word tag undecided, but if the unknown word has a suitable context, Desamb can “guess” its lemma and tag even if the word is not present in the Majka database. This

SQAD: Simple Question Answering Database

125

Original Desamb output stored in file 03text.txt: Tokio ( jap. 東京

Tokio ( jap. 東京

k1gInSc1 kIx( kA k?

Record of unknown word extracted from 03text.txt file: ./0000/03tex.txt|3|東京

東京

k?

東京

k1

Record from 03text.txt with manual changes: ./0000/03tex.txt|3|東京 File 03text.txt with changes: Tokio ( jap. 東京

Tokio ( jap. 東京

k1gInSc1 kIx( kA k1

Fig. 6: Desamb: unrecognized foreign words works very well when Desamb processes Czech words, however it may cause mistakes for foreign words. In this case not only the tag is wrong but also the lemma. For example if we have word "Las" (from proper name "Las Vegas") the output of Desamb is "Las laso k1gInSc1" (where word laso means lasso). To repair all these mistakes, we checked all the SQAD database records and extracted a similar file as in Section 3.2 and corrected the tags in the original SQAD files.

4

Evaluation

We have used the SQAD database to evaluate the accuracy of a first version of SBQA, syntax-based question answering system [7]. SBQA was developed during a master thesis by Matej Pavla at Faculty of Informatics, Masaryk University. The input of SBQA system is a plain text question which is then preprocessed by Unitok and Desamb system and passed to SET [11] parser to identify dependencies and phrase relations within the question. SBQA then decides the rules for finding the answer in its knowledge base based on a match on corresponding syntactic structures (hence the “syntax-base” in its title). The SBQA knowledge base is made also from plain text documents, which are automatically processed with the same tool chain as the questions (Unitok, Desamb and SET). As we presented in this paper, the automatically preprocessed texts contain mistakes. For the purpose of evaluation of the QA part of SBQA, we have modified its workflow to build the knowledge base from the annotated SQAD database and not to use the automatic processors.

126

Marek Medved’ and Aleš Horák

Table 1: Evaluation of SBQA system total questions correct partially correct incorrect not found 3,301 758 60 2,003 480 100% 23% 1% 61% 15%

Table 2: Classification of SBQA errors (on 200 examples) total questions error in SBQA error in tokenization uncovered phenomena system or syntax analysis 200 119 43 38 59.5% 24.5% 19% 100%

For the original results see [7]. The current results of SBQA as measured on the SQAD database are presented in Table 1. After the evaluation, we have taken 200 questions that are incorrectly answered by SBQA and manually checked the reason of the mistake. We have obtained three categories of mistakes that are caused by: 1. errors in implementation of SBQA system 2. errors in tokenization or syntactic analysis 3. phenomena not covered by the current implementation of SBQA system An overview of the percentage of these error categories is presented in Table 2. A detailed description of particular error categories can be found in the following subsections. 4.1

Errors in Implementation of SBQA System

The SBQA system balances between answer correctness and the ability to find the answer. Because of that the SBQA implementation uses concept of choosing the answer candidates based on their probability values. On the other hand this implementation sometimes leads to a wrong evaluation that produces incorrect answer. We have identified the following error types that are caused by SBQA implementation: – answer in brackets: if the answer of the question is placed in brackets, the SBQA system nearly always provides incorrect answer – part of speech requirement: even if the tokenization and syntax analysis is provided well, the SBQA system does not check the part of speech that is important for answering the question and the system answers incorrectly – comparison of dates or numbers: in case of a question, whose answer needs to perform some date or numeric computations, the system does not provide the answer (only exact matches are found).

SQAD: Simple Question Answering Database

127

– wrong question type: as described in [7], SBQA determines for each question its question type. But sometimes the system provides an answer to the question with yes/no question type even if the original question is not a yes/no question. 4.2

Errors in Tokenization or Syntactic Analysis

The SBQA system answers incorrectly in case the tokenization or syntactic analysis contains mistakes. There are three types of such errors that appear in the current SQAD database: – Desamb incorrectly detects sentence boundaries and splits one sentence into two or more sentences – Desamb incorrectly tagged a word thus the syntactic analysis is incorrect and SBQA system cannot derive the required answer – SET incorrectly parses a sentence and creates an incorrect syntactic tree. This usually leads to incorrect answer. 4.3

Uncovered Phenomena

The SBQA system has not yet implemented advanced NLP techniques such as anaphora resolution. Which means that if the answer to the question is in the sentence that refers to previous text then the answer cannot be found.

5

Conclusions

In this paper we have presented new Czech question answering database called SQAD. Each SQAD record consists of an annotated question, the annotated answer, the annotated sentence containing the full answer, Wikipedia URL as a source of the statement and the author name of this question-answer pair. The morphological annotation was obtained automatically and manually corrected. The corrections make SQAD data more precise so they are more suitable for evaluation of Czech question answering systems. The SQAD database in its current version is available for download at http://nlp.fi.muni.cz/projects/sqad. Acknowledgements This work has been partly supported by the Ministry of Education of CR within the LINDAT-Clarin project LM2010013 and by the Czech-Norwegian Research Programme within the HaBiT Project 7F14047.

References 1. Ferrucci, D.A.: IBM’s Watson/DeepQA. SIGARCH Comput. Archit. News 39(3) (2011) – http://doi.acm.org/10.1145/2024723.2019525.

128

Marek Medved’ and Aleš Horák

2. Ferrucci, D.A.: IBM’s Watson/DeepQA. In: Proceedings of the 38th Annual International Symposium on Computer Architecture. ISCA ’11, New York, NY, USA, ACM (2011) – http://dl.acm.org/citation.cfm?id=2000064.2019525. 3. Krishnamurthi, I., Shanthi, P.: Ontology based question answering on closed domain for e-learning applications. Australian Journal of Basic and Applied Sciences 8(13) (2014) 396–401 4. Chung, H., Song, Y.I., Han, K.S., Yoon, D.S., Lee, J.Y., Rim, H.C., Kim, S.H.: A practical QA system in restricted domains. In: Proceedings of the Workshop Question Answering in Restricted Domains, within ACL. (2004) 5. Vargas-Vera, M., Lytras, M.D.: Aqua: A closed-domain question answering system. Information Systems Management 27(3) (2010) 217–225 6. Wang, D.: A domain-specific question answering system based on ontology and question templates. In: Software Engineering Artificial Intelligence Networking and Parallel/Distributed Computing (SNPD), 2010 11th ACIS International Conference on, IEEE (2010) 151–156 7. Pavla, M.: Automatic question answering system. Master thesis, Masaryk university, Faculty of Informatics (2014) http://is.muni.cz/th/418138/fi_m/?lang=en. 8. Šmerk, P.: Unsupervised learning of rules for morphological disambiguation. In: Text, Speech and Dialogue, Springer Berlin Heidelberg (2004) http://dx.doi.org/ 10.1007/978-3-540-30120-2_27. 9. Šmerk, P.: Fast Morphological Analysis of Czech. Proceedings of Recent Advances in Slavonic Natural Language Processing (2009) 10. Jakubícek, M., Kováˇr, V., Šmerk, P.: Czech morphological tagset revisited. Proceedings of Recent Advances in Slavonic Natural Language Processing 20 (2011) 29–42 11. Kováˇr, V., Horák, A., Jakubíˇcek, M.: Syntactic analysis using finite patterns: A new parsing system for czech. In: Human Language Technology. Challenges for Computer Science and Linguistics, Berlin/Heidelberg (2011) 161–171 http://dx.doi.org/10. 1007/978-3-642-20095-3_15. 12. Michelfeit, J., Pomikálek, J., Suchomel, V.: Text Tokenisation Using unitok. In Horák, A., Rychlý, P., eds.: 8th Workshop on Recent Advances in Slavonic Natural Language Processing, Brno, Tribun EU (2014) 71–75

SQAD: Simple Question Answering Database - raslan 2014 - NLP ...

Database, SQAD, contains 3301 questions and answers extracted and pro- cessed from the .... 3.3 Mistakes in Morphological Analysis. In case of unknown ...

256KB Sizes 1 Downloads 243 Views

Recommend Documents

SQAD: Simple Question Answering Database - raslan 2014 - NLP ...
as the IBM's Watson [1,2] that had a big success in popular television compet- itive show ... SQAD contains 3301 records with the following data fields (see Figure 1 .... of Education of CR within the LINDAT-Clarin project LM2010013 and by the.

Intelligent Search and Replace for Czech Phrases - raslan 2014 - NLP ...
Replace' function well known from most text processing software. The standard ..... William A. Gale, Kenneth W. Church, and David Yarowsky. One sense per ...

Intelligent Search and Replace for Czech Phrases - raslan 2014 - NLP ...
Replace' function well known from most text processing software. The standard search and replace function is used ... word regardless its forms, and 2) avoid searching. 1 See https://en.wikipedia.org/wiki/Scunthorpe_problem .... See the appendix for

Building Effective Question Answering Characters
conference (see Appendix C for a photograph of .... set of answers which we can call documents in ac- ..... tions (i.e., when and how to activate the micro-.

Knowledge Representation and Question Answering
For example, an intelligence analyst tracking a particular person's movement would have text ..... by means of a certain kind of transportation, or a certain route”).

Question Answering and Generation
Type of knowledge: Is the knowledge organized as a semantic network, plan, causal structure, ..... Dr. Piwek pioneered research on Text-to-Dialogue (T2D) generation based on ... templates, which offer limited flexibility and content coverage.

Knowledge Representation and Question Answering
Text: John, who always carries his laptop with him, took a flight from Boston to ..... 9 where base, pos, sense are the elements of the triple describing the head ...

Online Question Answering System
Consequently, this additional feature brings dictionary functionality into .... between the question and the available question and answer pair in the database.

Building Effective Question Answering Characters
We are developing virtual characters with similar capabilities. ... Automatic question answering (QA) has been ..... phone and how to adjust the microphone posi-.

Automated Question Answering From Lecture Videos
Proceedings of the 38th Hawaii International Conference on System Sciences ... Question Answering systems are designed based on .... Calvados is a dry apple.

Open-domain Factoid Question Answering via Knowledge Graph Search
they proved being beneficial for different tasks in ... other knowledge graph and it increases their domain ... ciated types in large knowledge graphs, translating.

Text-to-text generation for question answering
In this process, we use a graph-based model to generate coherent answers. We then apply sentence fusion to combine partial an- swers from different sources ...

Description of SQUASH, the SFU Question Answering ... - CiteSeerX
... in order to participate in the 2005 Document Understanding Conference ... ing the Rouge score, while the next stage which we call the Editor module, focuses ...

How to evaluate your question answering system every ...
whether or not node labels matter, and how partial credit should be ... Which company created the internet browser ... This range in accuracy is small because the vast ..... back of the card). ... Processing conference in 2000, and in the reading.

QuASM: A System for Question Answering Using Semi ...
Question answering (QA) systems augment IR systems by taking over the ..... The second database contained the content files with tables extracted and prose ...

Paraphrase for Open Question Answering: New ...
velop the latter as an alternative or in combination with the former. One major challenge for question answering from a KB is the many ways in which relations ...

How to evaluate your question answering system every ...
answering system allows the user to pinpoint information in this collection. ..... the Graduate Management Assessment Test's Analytical. Writing Assignment ...

QuASM: A System for Question Answering Using Semi-Structured Data
document contains suitable data for answering questions. Many of the documents ... Permission to make digital or hard copies of all or part of this work for personal or classroom use ...... Andre Gauthier for the disk space. Any opinions, findings ..

A Simple Distant Supervision Approach for the ... - Stanford NLP Group
the organizers, Wikipedia, and web snippets. Our implementation .... host cities tend to be capitals, which neither follows logically, nor happens to be true, ...

MC7103-Database Management Systems question bank_edited.pdf ...
There was a problem previewing this document. Retrying... Download. Connect more apps... Try one of the apps below to open or edit this item.

Karnataka PGCET 2014 Question Paper - Chemical Engineering.pdf ...
eroom invigilator as it is. After separating the top sheet, the invigilator will return the bottom sheet replica (Candidate's copy) to you to. carry home for self-evaluation. Preserve the replica of the OMR answer sheet for a minimum period of ONE ye