The University of Tokyo Graduate School of Information Science and Technology

COLLECTIVE SEMANTIC ANNOTATION FOR WEB TEXT: TRIPLE TAGGING AND TRIPLE EXTRACTION

A Thesis in Department of Creative Informatics by Jie Yang

c 2008 Jie Yang °

Submitted in Partial Fulfillment of the Requirements for the Degree of

Doctor of Philosophy

June 2008

Abstract

Semantic annotations are machine-understandable metadata attached to web resources. Semantic annotations represent information contained in text documents in a structured format which are more amenable to applications in data mining, question answering, or the Semantic Web. Considerable research has been done in the reign of semantic annotation. If we check the sources of the semantics of semantic annotations, existing studies can be classified in two categories: the “ontology-centric” class which depends on the “a-prior” vocabularies (generally known as ontologies) to annotate web text; and the recent “user-centric” class which avoids pre-defined vocabularies and allows normal web users to annotate web text with less or no constraints. This research on “collective semantic annotation” is a user-centric annotation approach. The goal of the work is to explore how we can generate semantic annotations for web text by exploiting the strengths of both normal web users and computers. Specifically, two questions are addressed. Firstly, what user-centric support can be provided to encourage normal web users annotating web text? Secondly, how to automate the annotation process? As the result of the first question, a user-centric annotation diagram, triple tagging diagram, is proposed. I identify eight dimensions which help us to describe annotation frameworks. Literature work is investigated in terms of the eight dimensions. The features and novelties of the triple tagging diagram are addressed. The diagram consists of three parts: the concept model which defines annotation primitives, the collaboration model which addresses the information collection and navigation possibilities, and the ontology model which provides a common definition for triple annotations so that they can be exchanged, re-used, and extended on the Web. A model evaluation is carried out, which includes both qualitative and quantitative analysis. The evaluation exhibits the expressive power and advantages of the triple tagging diagram over existing work. Regarding the second question, I propose an interactive approach which generates semantic annotations for web text automatically. In this approach, the annotation generation problem is defined as a binary relation extraction problem. ii

Linguistics and machine learning techniques are exploited to solve the problem. Specifically, we propose the algorithm of penalty tree similarity. The algorithm is an extension of tree kernels which are widely used in the field of Information Extraction. A triple tagging corpus is created and used in experiments. The result shows that the extended tree similarity algorithm achieves better performance. As a result of this research, a triple tagging system, Triple-Note, is implemented. It is implemented in a web-server architecture. On the client side an extension of Firefox browser is implemented to support users’ annotating actions. On the server side, automatic extraction, annotation storage, and other servicing models are implemented.

iii

Table of Contents

List of Figures

vii

List of Tables

ix

Acknowledgments

x

Chapter 1 Introduction 1.1 Context . . . . . . . . . . . . . . . . . . . . . 1.1.1 Annotation and Semantic Annotation . 1.1.2 Ontology-centric Semantic Annotation 1.1.3 User-centric Semantic Annotation . . . 1.2 Research Questions . . . . . . . . . . . . . . . 1.3 Approach and Contribution . . . . . . . . . . 1.4 Structure of the Thesis . . . . . . . . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

1 1 1 4 6 7 8 9

Chapter 2 Review of Annotation Systems 2.1 Eight Dimensions . . . . . . . . . . . . 2.2 Review of Literature Work . . . . . . . 2.2.1 Group I: Ontological Approach 2.2.2 Group II: Social Approach . . . 2.2.3 Group III: Bridging Approach . 2.3 Our Diagram . . . . . . . . . . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

11 11 14 15 17 17 23

Chapter 3 Triple Tagging Model 3.1 Conceptual Model . . . . . . . . . 3.1.1 Triple Tagging Primitives 3.1.2 Tag Graph . . . . . . . . . 3.1.3 Mapping to RDF . . . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

24 24 25 26 28

iv

. . . .

. . . .

. . . .

3.2

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

Chapter 4 Model Evaluation 4.1 Model Expressiveness . . . . . . . . . . . . 4.1.1 Triple Tagging and Semantic Wikis 4.1.2 Triple Tagging and Social Tagging . 4.1.3 Triple Tagging and Google Base . . 4.2 User Evaluation . . . . . . . . . . . . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

42 . 42 . 43 . 45 . 46 . 47

Chapter 5 Sentence-Based Triple Extraction 5.1 Definition of the Problem . . . . . . 5.1.1 Motivation . . . . . . . . . . . 5.1.2 Binary Relation Extraction . 5.2 Related Work . . . . . . . . . . . . . 5.3 Syntactic Representation of Sentence 5.4 Tree Kernels . . . . . . . . . . . . . . 5.4.1 Kernel Methods . . . . . . . . 5.4.2 Dependency Tree Kernel . . . 5.5 Penalty Tree Similarity . . . . . . . . 5.6 Discussion . . . . . . . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

52 52 52 53 54 55 58 58 59 61 64

. . . . . . . . . . .

66 66 69 70 72 72 73 73 74 75 76 76

3.3

3.4

Exploiting Triple Tagging Model . 3.2.1 Collaboration . . . . . . . 3.2.2 Triple Query . . . . . . . . 3.2.3 Augmented Navigation . . Triple Tagging Ontology . . . . . 3.3.1 Triple Tagging Ontology . 3.3.2 An Example . . . . . . . . Discussion . . . . . . . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . . . .

. . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

Chapter 6 An Interactive Approach for Triple Extraction 6.1 Definition of the Problem . . . . . . . . . . . . 6.2 Related Work . . . . . . . . . . . . . . . . . . . 6.3 Overview of the Process . . . . . . . . . . . . . 6.4 Pre-processing . . . . . . . . . . . . . . . . . . . 6.4.1 Dependency Parsing . . . . . . . . . . . 6.4.2 Semantic Tagging . . . . . . . . . . . . . 6.5 Word Pair Detecting . . . . . . . . . . . . . . . 6.5.1 POS Filtering . . . . . . . . . . . . . . . 6.5.2 Word Pair Filtering . . . . . . . . . . . . 6.6 Relation Labeling and Triple Filtering . . . . . 6.7 Discussion . . . . . . . . . . . . . . . . . . . . . v

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

29 29 30 31 32 33 34 36

Chapter 7 Experiments 7.1 Create a Corpus . . . . . . . . . . . . . . . . . . . . 7.2 Does Hypothesis One Hold? Semantic Convergence 7.3 Experiments . . . . . . . . . . . . . . . . . . . . . . 7.3.1 Setup . . . . . . . . . . . . . . . . . . . . . 7.3.2 Precision and Recall . . . . . . . . . . . . . 7.3.3 Results . . . . . . . . . . . . . . . . . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

79 79 80 84 84 85 86

Chapter 8 Implementation System: Triple-Note 8.1 User Interface . . . . . . . . . . . . . 8.1.1 Triple tagging web contents . 8.1.2 Triple recommendation . . . . 8.1.3 Triple graph browsing . . . . 8.2 Architecture . . . . . . . . . . . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

88 88 89 89 90 92

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

Chapter 9 Conclusions and Future Research Directions 94 9.1 Research Justification . . . . . . . . . . . . . . . . . . . . . . . . . . 94 9.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 9.3 Future Research Directions . . . . . . . . . . . . . . . . . . . . . . . 97 Appendix A Triple Tagging Guideline 99 A.1 Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 A.2 Examples and Explanations . . . . . . . . . . . . . . . . . . . . . . 100 Appendix B OWL File of the Triple Tagging Model

103

Bibliography

109

Index

119

List of Abbreviations

120

Publications

121

vi

The University of Tokyo Graduate School of Information ...

sources. Semantic annotations represent information contained in text documents in a structured format which are more amenable to applications in data mining,.

39KB Sizes 0 Downloads 299 Views

Recommend Documents

No documents