Linked Hypernyms Dataset - Generation Framework and Use Cases ...

Viewer
Transcript

Linked Hypernyms Dataset - Generation Framework and Use Cases Tom´ aˇs Kliegr1 , V´ aclav Zeman1 , and Milan Dojchinovski1,2 1

Department of Information and Knowledge Engineering Faculty of Informatics and Statistics University of Economics, Prague, Czech Republic [email protected] 2 Web Engineering Group Faculty of Information Technology Czech Technical University in Prague [email protected]

Abstract. The Linked Hypernyms Dataset (LHD) provides entities described by Dutch, English and German Wikipedia articles with types taken from the DBpedia namespace. LHD contains 2.8 million entitytype assignments. Accuracy evaluation is provided for all languages. These types are generated based on one-word hypernym extracted from the free text of Wikipedia articles, the dataset is thus to a large extent complementary to DBpedia 3.8 and YAGO 2s ontologies. LHD is available at http://ner.vse.cz/datasets/linkedhypernyms.

1

Introduction

The Linked Hypernyms Dataset provides a source of types for entities described by Wikipedia articles. The dataset follows the same data modelling approach as the well-known DBpedia [6] and YAGO [1] knowledge bases. The types are extracted with hand-crafted lexico-syntactic patterns from the free text of the articles. The dataset can thus be used as enrichment to DBpedia and YAGO, which are populated from the structured and semistructured information in Wikipedia. The dataset consist of two subdatasets: Hypernyms dataset contains only the raw plain text hypernyms extracted from the articles. An example entry is: DiegoMaradona;manager. This dataset can be used as gazetteer. Linked Hypernyms Dataset identifies both the entity and the hypernym by a DBpedia URI, either a DBpedia resource or a DBpedia ontology class (preferred). Example entries (n-triples format) are: . .

The work presented here complements papers [7,8], the former describes LHD 1.0 in detail and the latter, a statistical type inference algorithm, which was used to extend the coverage of DBpedia Ontology classes in LHD 2.0 Draft. This paper has the following focus areas: ◦ ◦ ◦ ◦ ◦

2

Section Section Section Section Section

2: 3: 4: 5: 6:

updated LHD generation framework, LHD 1.0/2.0 comparison – size and accuracy, use cases, future work – extending LHD to other languages, and dataset license and availability.

Dataset Generation

The dataset generation is a process in which the textual content of each Wikipedia page is processed, the word corresponding to the type is identified, and finally this word is disambiguated to a DBpedia concept. In this paper, we describe the updated LHD generation framework, which is available at http: //ner.vse.cz/datasets/linkedhypernyms. The first step in the process is the extraction of the hypernym from first sentences of Wikipedia articles. To avoid parsing of Wikipedia pages from the XML dump, the updated framework performs hypernym extraction from the DBpedia RDF n-triples dump. The hypernym is extracted from the textual contents of the DBpedia property dbo:abstract3 , which contains the introductory text of Wikipedia articles.4 The hypernym extraction step was implemented as a pipeline in the GATE text engineering framework.5 The pipeline consists of the following processing resources: 1. 2. 3. 4.

ANNIE English Tokenizer ANNIE Regex Sentence Splitter ANNIE Part-of-Speech Tagger (English), TreeTagger (other languages) JAPE Transducer

The hypernym extraction is performed with hand-crafted lexico-syntactic patterns written as a JAPE grammar [2]. The JAPE grammars are designed to recognize several variations of Hearst patterns [5]:

“[to be] [article] [modifiers] [hypernym]”. 3

4

5

dbo refers to the http://dbpedia/ontology/ namespace, and dbpedia to the http: //dbpedia/resource/ namespace The statistics reported in Section 3 relate to the original version of the dataset, where Wikipedia dump is used as input. https://gate.ac.uk/

Only the first sentence in the dbo:abstract content is processed and only the first matched hypernym is considered. The manually tagged corpora used for the grammar development were made available on the dataset website. The three corpora (English, German and Dutch) contain more than 1,500 articles, which were used to develop the grammars. Example. The first sentence in the dbo:abstract for the DBpedia instance dbpedia: Diego Maradona is as follows: Diego Armando Maradona Franco is an Argentine football manager. The English JAPE grammar applied on this POS-tagged sentence will result in marking the word manager as a hypernym. The word is is matched with the [to be] part of the grammar, word the with the [article] and Argentine football is captured by the [modifiers] group. Next, the hypernym is mapped to a DBpedia Ontology class. The process of mapping is two stage. ◦ Hypernym is mapped to a DBpedia instance using Wikipedia Search API. This naive approach provided average performance in a recent entity linking contest [4]. ◦ In order to improve interconnectedness, mapping to a DBpedia Ontology class is attempted. • In LHD 1.0 the mapping is performed based on a total textual match in order to maximize precision. A set of approximate matches (based on a substring match) is also generated. • In LHD 2.0 the mapping is performed using a statistical type inference algorithm. At this point, the hypernym is represented with a Linked Open Data (LOD) identifier in the http://dbpedia/resource/ namespace. The result from the processing is an RDF triple: Example. The output of the first stage is dbpedia:Diego Maradona rdf:type dbpedia:Manager

Since the type is in the less desirable dbpedia namespace, the system tries to find a fitting DBpedia Ontology class. The total textual match fails in DBpedia 3.8. However, the statistical type inference algorithm is more successful, yielding additional triple dbpedia:Diego Maradona rdf:type dbo:SoccerManager

3

Dataset Metrics

The size of the LHD dataset for individual languages is captured on Table 1.

Table 1. Hypernyms and Linked Hypernyms datasets - size statistics. dataset Dutch Hypernyms dataset 866,122 Linked Hypernyms Dataset 664,045 - type is a DBpedia Ontology class (LHD 1.0) 78,778 - type is a DBpedia Ontology class (LHD 2.0) 283,626

English 1,507,887 1,305,111 513,538 1,268,857

German 913,705 825,111 171,847 615,801

Table 2. Hypernyms and Linked Hypernyms datasets - accuracy . dataset Dutch English German Hypernyms dataset 0.93 0.95 0.95 LHD 1.0 0.88 0.86 0.77 LHD 2.0 inferred types NA 0.65 NA

Human evaluation of the correctness of both dataset was performed separately for the entire English, German and Dutch datasets, each represented by a randomly drawn 1,000 articles. The evaluation for English were done by three annotators. The evaluation for German and Dutch were done by the best performing annotator from the English evaluation. The results are depicted on Table 2. The average accuracy for English, which is the largest dataset, is 0.95 for the plain text types and 0.86 for types disambiguated to DBpedia concepts (DBpedia ontology class or a DBpedia resource). LHD 2.0 [8] increases the number of entities aligned to the DBpedia Ontology to more than 95% for English and to more than 50% for other languages. Since a statistical type inference algorithm is used, the increase in coverage comes at a cost of reduced accuracy. The new triples added in LHD 2.0 have estimated accuracy of 0.65 (one annotator). LHD 2.0 Draft is thus an extension, rather than a replacement for LHD 1.0. The reason is not a decrease in reliability of the types, but also the fact that the types are complementary. For Diego Maradona, the LHD 1.0 type is dbpedia:Manager, while the LHD 2.0 type is dbo:SoccerManager. More information about the evaluation setup and additional results can be found at [7] and at http://ner.vse.cz/datasets/linkedhypernyms/.

4

Uses Cases

The purpose of LHD to provide enrichment to type statements in DBpedia and YAGO ontologies. We have identified the following types of complementarity: ◦ LHD allows to choose the most common type for an entity. According to our observation, the type in the first sentence (the hypernym) is the main type that people typically associate with the entity. Therefore, the LHD dataset can be also used as a dataset which provides “primary”, or “most common” types. Note that the content in Wikipedia is constantly updated and the type can thus be also considered as temporally valid.

◦ LHD provides a more specific type than DBpedia or YAGO. This is typically the case for less prolific entities, for which the semistructured information in Wikipedia is limited. ◦ LHD provides a more precise type, giving an alternative to an erroneousness type in DBpedia or YAGO. ◦ LHD is the only knowledge base providing any type information. As a complementary resource to other knowledgebases, LHD can be used in common entity classification systems (wikifiers). Entityclassifier.eu is an example of a wikifier, which uses LHD alongside DBpedia and YAGO [3].

5

Future work - LHD for Other Languages

Creating LHD for another language requires the availability of a POS tagger and a manually devised JAPE grammar. Currently we are investigating a new workflow, which could lead to a fully automated LHD generation: generating a labeled set of articles by annotating as hypernyms noun phrases that match any of the types assigned in DBpedia, and subsequently using this set to train a hypernym tagger, e.g. as proposed in [9]. The hypernyms output by the tagger could be used in the same way as hypernyms identified by the hand-crafted JAPE grammars, leaving the rest of the LHD generation framework unaffected.

6

Conclusions

LHD is downloadable from http://ner.vse.cz/datasets/linkedhypernyms/. The dataset is released under a Creative Commons License. In order to stipulate the generation of the dataset for other languages, we are providing also the source code for the LHD extraction framework at http://ner.vse.cz/ datasets/linkedhypernyms in a form of a Maven project.

Acknowledgements This research was supported by the European Union’s 7th Framework Programme via the LinkedTV project (FP7-287911).

References 1. C. Bizer, et al. DBpedia - a crystallization point for the web of data. Web Semant., 7(3):154–165, Sep. 2009. 2. H. Cunningham, D. Maynard, and V. Tablan. JAPE - a Java Annotation Patterns Engine (Second edition), Department of Computer Science, University of Sheffield, 2000. Tech. rep., 2000. Technical Report. 3. M. Dojchinovski and T. Kliegr. Entityclassifier.eu: real-time classification of entities in text with Wikipedia. In ECML’13, pp. 654–658. Springer, 2013.

4. M. Dojchinovski, T. Kliegr, I. Laˇsek, and O. Zamazal. Wikipedia search as effective entity linking algorithm. In Text Analysis Conference (TAC) 2013 Proceedings. NIST, 2013. To appear. 5. M. A. Hearst. Automatic acquisition of hyponyms from large text corpora. In Proceedings of the 14th conference on Computational linguistics - Volume 2, COLING ’92, pp. 539–545. ACL, Stroudsburg, PA, USA, 1992. 6. J. Hoffart, F. M. Suchanek, K. Berberich, and G. Weikum. YAGO2: A spatially and temporally enhanced knowledge base from Wikipedia. Artificial Intelligence, 194:28–61, 2013. 7. T. Kliegr. Linked hypernyms: Enriching DBpedia with Targeted Hypernym Discovery. 2013. Under review. 8. T. Kliegr and O. Zamazal. Towards Linked Hypernyms Dataset 2.0: complementing DBpedia with hypernym discovery and statistical type inferrence. In Proceedings of The Ninth International Conference on Language Resources and Evaluation, LREC 2014. To appear. 9. B. Litz, H. Langer, and R. Malaka. Sequential supervised learning for hypernym discovery from Wikipedia. In A. Fred, J. L. G. Dietz, K. Liu, and J. Filipe, (eds.) Knowledge Discovery, Knowlege Engineering and Knowledge Management, vol. 128 of Communications in Computer and Information Science, pp. 68–80. SpringerVerlag, Berlin Heidelberg, 2011.