From One Tree to a Forest

Viewer
Transcript

From One Tree to a Forest – A Unified Solution for Structured Web Data Extraction Qiang Hao†‡, Rui Cai†, Yanwei Pang‡, Lei Zhang† † Microsoft

Research Asia ‡ Tianjin University

Outline • Motivation & Challenges • Our Solution – – – –

Main Idea & Framework Overview Feature Extraction Vertical Knowledge Learning Vertical Knowledge Adaptation

• Experimental Results • Summary

2011/7/27

2

What’s Structured Data Extraction • Extracting structured data records from web pages = identifying values of attributes

attributes Title

Author

Publish Date

…

The Kite Runner

Khaled Hosseini

April 2004

…

Mercy

Toni Morrison

2008

…

The Time Machine

H. G. Wells

June 30, 2004

…

a data record = attribute values of an entity 2011/7/27

3

We Need Structured Data • A vertical is a category of entities associated with similar attributes (e.g., each book has title/author/…) websites verticals

...

Books · · · ·

Title Author Publisher Publish Date

Restaurants · · · ·

Name Cuisine Address Phone

Autos · · · ·

Model Price Engine Fuel Economy

... 2011/7/27

4

Challenges • Example: Vertical = Book, Attribute = Pub. Date a page from site 1



the same entity

a page from site 2

different value formats



Attribute value variations across sites

Noisy page contents 2011/7/27

5

Challenges (contd.) Autos · · · ·

Model Price Engine Fuel Economy

Jobs · · · ·

Title Company Location Date

Restaurants

Page layout variations across sites 2011/7/27

· · · ·

Name Cuisine Address Phone

Books · · · ·

Title Author Publisher Publish Date

Movies · · · ·

Title Director Genre Rating

Universities · · · ·

Name Phone Website Type

Various verticals & attributes 6

Existing Solutions Manual solutions • Pros: highly accurate • Cons: labor-intensive; difficult to scale up

Semi-automatic solutions • Pros: automatically locate data in templates • Cons: need to annotate semantics manually

Automatic solutions

Kushmerick(PhD thesis ’97) Muslea et al.(AGENTS’99) Soderland(Mach.Learn.’99) Zheng et al.(KDD’07) …

Crescenzi et al.(VLDB’01) Arasu et al.(SIGMOD’03) Liu et al.(KDD’03) Zhai et al.(WWW’05) …

Zhu et al.(ICML’05,KDD’06) Carlson et al.(ECML’08) Wong et al.(SIGIR’08 & ’09) Yang et al.(WWW’09) …

• Pros: extract data with specified semantics • Cons: need strong features and/or abundant training data

2011/7/27

7

Our Goal • A unified solution for extracting structured data with: Minimal human effort • Label one seed site for each vertical  many unseen sites Books · · · ·

Flexibility for verticals • Handle various verticals & attributes without redesign Autos



Title Author Publisher Publish Date

· · · ·

Model Price Engine Fuel Economy

Jobs



· · · ·

... 2011/7/27

Title Company Location Date

Books



· · · ·

Title Author Publisher Publish Date

Restaurants



· · · ·

Name Cuisine Address Phone

... 8

Outline • Motivation & Challenges • Our Solution – – – –

Main Idea & Framework Overview Feature Extraction Vertical Knowledge Learning Vertical Knowledge Adaptation

• Experimental Results • Summary

2011/7/27

9

Our Solution: Main Idea Flexible for various verticals & attributes

Robust to variations across websites

General Features

Loose Classifiers

Recall↑ Combine

Site-Level Constraints

Precision↑

Web pages are generated by site-level templates

2011/7/27

10

Framework Overview (b) Vertical Knowledge Learning (a) Feature Extraction A labeled seed site

Layout

Attribute-Specific Semantics Learning Inter-Attribute Layout Learning

Content

(c) Vertical Knowledge Adaptation

A new unseen site

Data Extraction

Context

Wrappers

Structured Data

2011/7/27

Page-Level Semantic Prediction Inter-Page Aggregation Inter-Attribute Re-ranking

11

(a) Feature Extraction Layout Content Context

Feature Extraction • Layout • Content • Context 2011/7/27

12

General Features of Web Pages • Extract features from text nodes in DOM trees of web pages ... div h1 text

The Kite Runner

em text

by

a text node

a

Web Page

...

... DOM-tree

text

Khaled Hosseini

Three types of features: layout, content, context

2011/7/27

13

Layout Features • Goal: characterize the position of a text node Visual Position = (24, 798)

Visual • Position •

position in a rendered page = coordinates to the top left

Web Page

... div

h1 text

The Kite Runner

em text

by

DOM Path

• position in a DOM tree • = root-to-leaf tag path

a ...

... DOM-tree

text

Khaled Hosseini

DOM Path = /html/body/div/div/div/div/h1/em/a/text 2011/7/27

14

Content Features • Derived from the value contained in a text node Unigram

• Set of unique tokens Example:

Extracted at page level

Length Character Type

Site-level statistics

• Number of tokens/characters

• Proportion of letters/digits/symbols

• Page Redundancy

Proportion of pages containing text node with the same value

Compared with specific features like (for Price) Contains(‘$’, ’.’, ‘0-9’), more flexible for any mixture of characters Example: In Restaurant vertical, Cuisines are much redundant than Names

General enough to characterize various attributes 2011/7/27

15

Context Features • Motivation – Surrounding text indicates semantics of text nodes – Text nodes with identical context  similar semantics Extracted at page level

Preceding • Text

Site-level statistics

Prefix & Suffix

2011/7/27

values of visually preceding text nodes

• static (across pages) sub-strings of text node values

Illustrative example of prefix/suffix extraction

16

Context Features • Motivation – Surrounding text indicates semantics of text nodes – Text nodes with identical context  similar semantics {letter}: {digit}.{digit} {letter} Price: 25.86 USD

Price: 19.99 USD Price: 9.76 EUR Weight: 1.25 ounces Length: 22.4 mm Price: 18.54 EUR Length: 15.8 mm Weight: 6.9 ounces Weight: 10.4 ounces Price: 45.99 USD Price: 0.99 EUR

Length: 23.8 mm

Sub-groups Price: {dig}.{dig} USD Price: 45.99 USD Price: 25.86 USD Price: 19.99 USD

Length: {dig}.{dig} mm Length: 15.8 mm Length: 22.4 mm Length: 23.8 mm

Weight: {dig}.{dig} ounces Weight: 10.4 ounces Weight: 6.9 ounces Weight: 1.25 ounces

Price: {dig}.{dig} EUR Price: 9.76 EUR Price: 18.54 EUR Price: 0.99 EUR

2011/7/27

Prefixes

Suffixes

Price: Weight: Length:

USD EUR ounces mm

Illustrative example of prefix/suffix extraction

17

(b) Vertical Knowledge Learning Attribute-Specific Semantics Learning Inter-Attribute Layout Learning

Vertical Knowledge Learning • Attribute-Specific Semantics • Inter-Attribute Layout

2011/7/27

18

Features to Vertical Knowledge • Goal – Learn knowledge from a labeled seed site based on features extracted from text nodes – Guide data extraction from unseen sites

• Two types of vertical knowledge Features

Vertical Knowledge

Content Attribute-Specific Semantics Context Layout 2011/7/27

Inter-Attribute Layout 19

Attribute-Specific Semantics • Content features  classifiers (e.g., SVMs) • Context features  (token-score) lookup tables Text nodes for attribute ai

Content Features

Context Features 2011/7/27

Semantics of attribute ai

train

build

Classifier 𝓒 𝑎𝑖

Predict semantic relevance to ai for new text nodes

Lookup Table 𝒯 𝑎𝑖 20

Inter-Attribute Layout • Construct a K×K layout matrix from layout features • Encode pairwise distances between K attributes Text nodes for attribute a1

Help verify combinations of attributes

…

… Text nodes for attribute aK

Tend to be close

Layout Features Layout Features

K×K layout matrix (darker  closer)

Example: layout matrices from 5 websites in Book vertical

title author ISBN-13 publisher publish-date 2011/7/27

21

(c) Vertical Knowledge Adaptation Page-Level Semantic Prediction Inter-Page Aggregation Inter-Attribute Re-ranking

Vertical Knowledge Adaptation • Page-Level Semantic Prediction • Inter-Page Aggregation • Inter-Attribute Re-ranking 2011/7/27

22

Page-Level Semantic Prediction Attribute-specific semantics  page-level candidates

A

Attributes:

B

0.7 0.8 0.9

0.9

0.8

0.9

0.6

0.7

0.8 0.4

0.7 0.5

0.9

2011/7/27

C

0.1

0.7

0.2

0.3

0.4

0.5

Page 1

Page 2

Page 3 23

Inter-Page Aggregation • For each attribute: multiple candidates per page  Page 1 Groundtruth

Page 2 0.7

?

Page 3 0.8

?

0.9

?

? 0.9

False alarms

?

0.9

Site-level “page”

2011/7/27

?

0.1

?

0.2

align & aggregate Count=3, Mean=0.8

 True attribute occurrences

Count=1, Mean=0.9

 Infrequent noise

Count=3, Mean=0.4

 Occasional false prediction 24

Inter-Attribute Re-ranking • Multiple possible solutions (attribute combinations) Solution 1

Solution 2

Solution 3



? Sim=0.6

Inter-attribute layout learnt from the seed site 2011/7/27

Sim=0.5

Sim=0.9

Re-rank candidates by measuring similarity 25

Summary: Flowchart of Features

Content Features

Attribute-Specific Semantics

Page-Level Semantic Prediction

Context Features

Inter-Page Alignment

Inter-Page Aggregation

Layout Features

Inter-Attribute Layout

Inter-Attribute Re-ranking

2011/7/27

26

Outline • Motivation & Challenges • Our Solution – – – –

Main Idea & Framework Overview Feature Extraction Vertical Knowledge Learning Vertical Knowledge Adaptation

• Experimental Results • Summary

2011/7/27

27

A Large-Scale Dataset • • • •

(Publicly available at http://swde.codeplex.com) 8 verticals with diverse semantics 80 websites (10 per vertical) 124,291 pages (200~2,000 per website) 32 attributes (3~5 per vertical) with labeled ground-truth Vertical Autos

model, price, engine, fuel_economy

Books

title, author, ISBN, publisher, pub_date

Cameras Jobs

2011/7/27

Attributes

model, price, manufacturer title, company, location, date

Movies

title, director, genre, rating

NBA players

name, team, height, weight

Restaurants

name, address, phone, cuisine

Universities

name, phone, website, type 28

Experimental Settings • Methods 1. SSM (Stacked Skews Model) Carlson et al. ECML’08 2. PL (page-level semantic prediction) 3. PL + IP (inter-page aggregation) 4. PL + IP + IA (inter-attribute re-ranking)

• One seed site (by turns), test on other sites • Performance metrics: precision & recall

2011/7/27

29

Performance

2011/7/27

30

Performance (contd.)

2011/7/27

31

Performance: Multiple Seed Sites • Our solution with multiple seed sites – Take the solution with highest confidence score

• Our solution with (one seed + bootstrapping seeds) • SSM with multiple seed sites

2011/7/27

32

Summary • A unified solution for structured data extraction – Minimal human effort: labeling one site per vertical – Flexible for various verticals & attributes

• A large-scale dataset (has been published online) – 124K pages from 80 websites in 8 verticals

• Promising performance – Precision ≥ 80%, Recall ≥ 80% for most verticals

• Future work – Bootstrapping: accumulate vertical knowledge incrementally

2011/7/27

33

Dataset available at http://swde.codeplex.com

Dataset available at http://swde.codeplex.com

From forest to pasture

From one-to-one to one-to-many: A study of the ...

Forest-tree population genomics and adaptive evolution

Evaluating forest fragmentation and its tree community ...

Forest productivity and tree diversity relationships ...

Title: From one-to-one to one-to-many: An instrumental ...

Transferring a configuration file from one model to another

one tree hill season complete.pdf

One Tree Hill Filming Locations.pdf

Tree detection from aerial imagery - Semantic Scholar

A Wish to be a Christmas Tree

Revolution in information from one person, one machine One begins ...

How to Plant a Tree Seedling.indd - Wisconsin DNR

Download FROM NIGGAS TO GODS, PART ONE ...

Tatung: From Taiwan Number One National Brand To ...

Forest structure estimation and pattern exploration from ...

Estimating stand basal area from forest panoramas - GigaPan