From One Tree to a Forest – A Unified Solution for Structured Web Data Extraction Qiang Hao†‡, Rui Cai†, Yanwei Pang‡, Lei Zhang† † Microsoft

Research Asia ‡ Tianjin University

Outline • Motivation & Challenges • Our Solution – – – –

Main Idea & Framework Overview Feature Extraction Vertical Knowledge Learning Vertical Knowledge Adaptation

• Experimental Results • Summary

2011/7/27

2

What’s Structured Data Extraction • Extracting structured data records from web pages = identifying values of attributes

attributes Title

Author

Publish Date



The Kite Runner

Khaled Hosseini

April 2004



Mercy

Toni Morrison

2008



The Time Machine

H. G. Wells

June 30, 2004



a data record = attribute values of an entity 2011/7/27

3

We Need Structured Data • A vertical is a category of entities associated with similar attributes (e.g., each book has title/author/…) websites verticals

...

Books · · · ·

Title Author Publisher Publish Date

Restaurants · · · ·

Name Cuisine Address Phone

Autos · · · ·

Model Price Engine Fuel Economy

... 2011/7/27

4

Challenges • Example: Vertical = Book, Attribute = Pub. Date a page from site 1



the same entity

a page from site 2

different value formats



Attribute value variations across sites

Noisy page contents 2011/7/27

5

Challenges (contd.) Autos · · · ·

Model Price Engine Fuel Economy

Jobs · · · ·

Title Company Location Date

Restaurants

Page layout variations across sites 2011/7/27

· · · ·

Name Cuisine Address Phone

Books · · · ·

Title Author Publisher Publish Date

Movies · · · ·

Title Director Genre Rating

Universities · · · ·

Name Phone Website Type

Various verticals & attributes 6

Existing Solutions Manual solutions • Pros: highly accurate • Cons: labor-intensive; difficult to scale up

Semi-automatic solutions • Pros: automatically locate data in templates • Cons: need to annotate semantics manually

Automatic solutions

Kushmerick(PhD thesis ’97) Muslea et al.(AGENTS’99) Soderland(Mach.Learn.’99) Zheng et al.(KDD’07) …

Crescenzi et al.(VLDB’01) Arasu et al.(SIGMOD’03) Liu et al.(KDD’03) Zhai et al.(WWW’05) …

Zhu et al.(ICML’05,KDD’06) Carlson et al.(ECML’08) Wong et al.(SIGIR’08 & ’09) Yang et al.(WWW’09) …

• Pros: extract data with specified semantics • Cons: need strong features and/or abundant training data

2011/7/27

7

Our Goal • A unified solution for extracting structured data with: Minimal human effort • Label one seed site for each vertical  many unseen sites Books · · · ·

Flexibility for verticals • Handle various verticals & attributes without redesign Autos



Title Author Publisher Publish Date

· · · ·

Model Price Engine Fuel Economy

Jobs



· · · ·

... 2011/7/27

Title Company Location Date

Books



· · · ·

Title Author Publisher Publish Date

Restaurants



· · · ·

Name Cuisine Address Phone

... 8

Outline • Motivation & Challenges • Our Solution – – – –

Main Idea & Framework Overview Feature Extraction Vertical Knowledge Learning Vertical Knowledge Adaptation

• Experimental Results • Summary

2011/7/27

9

Our Solution: Main Idea Flexible for various verticals & attributes

Robust to variations across websites

General Features

Loose Classifiers

Recall↑ Combine

Site-Level Constraints

Precision↑

Web pages are generated by site-level templates

2011/7/27

10

Framework Overview (b) Vertical Knowledge Learning (a) Feature Extraction A labeled seed site

Layout

Attribute-Specific Semantics Learning Inter-Attribute Layout Learning

Content

(c) Vertical Knowledge Adaptation

A new unseen site

Data Extraction

Context

Wrappers

Structured Data

2011/7/27

Page-Level Semantic Prediction Inter-Page Aggregation Inter-Attribute Re-ranking

11

(a) Feature Extraction Layout Content Context

Feature Extraction • Layout • Content • Context 2011/7/27

12

General Features of Web Pages • Extract features from text nodes in DOM trees of web pages ... div h1 text

The Kite Runner

em text

by

a text node

a

Web Page

...

... DOM-tree

text

Khaled Hosseini

Three types of features: layout, content, context

2011/7/27

13

Layout Features • Goal: characterize the position of a text node Visual Position = (24, 798)

Visual • Position •

position in a rendered page = coordinates to the top left

Web Page

... div

h1 text

The Kite Runner

em text

by

DOM Path

• position in a DOM tree • = root-to-leaf tag path

a ...

... DOM-tree

text

Khaled Hosseini

DOM Path = /html/body/div/div/div/div/h1/em/a/text 2011/7/27

14

Content Features • Derived from the value contained in a text node Unigram

• Set of unique tokens Example:

Extracted at page level

Length Character Type

Site-level statistics

• Number of tokens/characters

• Proportion of letters/digits/symbols

• Page Redundancy

Proportion of pages containing text node with the same value

Compared with specific features like (for Price) Contains(‘$’, ’.’, ‘0-9’), more flexible for any mixture of characters Example: In Restaurant vertical, Cuisines are much redundant than Names

General enough to characterize various attributes 2011/7/27

15

Context Features • Motivation – Surrounding text indicates semantics of text nodes – Text nodes with identical context  similar semantics Extracted at page level

Preceding • Text

Site-level statistics

Prefix & Suffix

2011/7/27

values of visually preceding text nodes

• static (across pages) sub-strings of text node values

Illustrative example of prefix/suffix extraction

16

Context Features • Motivation – Surrounding text indicates semantics of text nodes – Text nodes with identical context  similar semantics {letter}: {digit}.{digit} {letter} Price: 25.86 USD

Price: 19.99 USD Price: 9.76 EUR Weight: 1.25 ounces Length: 22.4 mm Price: 18.54 EUR Length: 15.8 mm Weight: 6.9 ounces Weight: 10.4 ounces Price: 45.99 USD Price: 0.99 EUR

Length: 23.8 mm

Sub-groups Price: {dig}.{dig} USD Price: 45.99 USD Price: 25.86 USD Price: 19.99 USD

Length: {dig}.{dig} mm Length: 15.8 mm Length: 22.4 mm Length: 23.8 mm

Weight: {dig}.{dig} ounces Weight: 10.4 ounces Weight: 6.9 ounces Weight: 1.25 ounces

Price: {dig}.{dig} EUR Price: 9.76 EUR Price: 18.54 EUR Price: 0.99 EUR

2011/7/27

Prefixes

Suffixes

Price: Weight: Length:

USD EUR ounces mm

Illustrative example of prefix/suffix extraction

17

(b) Vertical Knowledge Learning Attribute-Specific Semantics Learning Inter-Attribute Layout Learning

Vertical Knowledge Learning • Attribute-Specific Semantics • Inter-Attribute Layout

2011/7/27

18

Features to Vertical Knowledge • Goal – Learn knowledge from a labeled seed site based on features extracted from text nodes – Guide data extraction from unseen sites

• Two types of vertical knowledge Features

Vertical Knowledge

Content Attribute-Specific Semantics Context Layout 2011/7/27

Inter-Attribute Layout 19

Attribute-Specific Semantics • Content features  classifiers (e.g., SVMs) • Context features  (token-score) lookup tables Text nodes for attribute ai

Content Features

Context Features 2011/7/27

Semantics of attribute ai

train

build

Classifier 𝓒 𝑎𝑖

Predict semantic relevance to ai for new text nodes

Lookup Table 𝒯 𝑎𝑖 20

Inter-Attribute Layout • Construct a K×K layout matrix from layout features • Encode pairwise distances between K attributes Text nodes for attribute a1

Help verify combinations of attributes



… Text nodes for attribute aK

Tend to be close

Layout Features Layout Features

K×K layout matrix (darker  closer)

Example: layout matrices from 5 websites in Book vertical

title author ISBN-13 publisher publish-date 2011/7/27

21

(c) Vertical Knowledge Adaptation Page-Level Semantic Prediction Inter-Page Aggregation Inter-Attribute Re-ranking

Vertical Knowledge Adaptation • Page-Level Semantic Prediction • Inter-Page Aggregation • Inter-Attribute Re-ranking 2011/7/27

22

Page-Level Semantic Prediction Attribute-specific semantics  page-level candidates

A

Attributes:

B

0.7 0.8 0.9

0.9

0.8

0.9

0.6

0.7

0.8 0.4

0.7 0.5

0.9

2011/7/27

C

0.1

0.7

0.2

0.3

0.4

0.5

Page 1

Page 2

Page 3 23

Inter-Page Aggregation • For each attribute: multiple candidates per page  Page 1 Groundtruth

Page 2 0.7

?

Page 3 0.8

?

0.9

?

? 0.9

False alarms

?

0.9

Site-level “page”

2011/7/27

?

0.1

?

0.2

align & aggregate Count=3, Mean=0.8

 True attribute occurrences

Count=1, Mean=0.9

 Infrequent noise

Count=3, Mean=0.4

 Occasional false prediction 24

Inter-Attribute Re-ranking • Multiple possible solutions (attribute combinations) Solution 1

Solution 2

Solution 3



? Sim=0.6

Inter-attribute layout learnt from the seed site 2011/7/27

Sim=0.5

Sim=0.9

Re-rank candidates by measuring similarity 25

Summary: Flowchart of Features

Content Features

Attribute-Specific Semantics

Page-Level Semantic Prediction

Context Features

Inter-Page Alignment

Inter-Page Aggregation

Layout Features

Inter-Attribute Layout

Inter-Attribute Re-ranking

2011/7/27

26

Outline • Motivation & Challenges • Our Solution – – – –

Main Idea & Framework Overview Feature Extraction Vertical Knowledge Learning Vertical Knowledge Adaptation

• Experimental Results • Summary

2011/7/27

27

A Large-Scale Dataset • • • •

(Publicly available at http://swde.codeplex.com) 8 verticals with diverse semantics 80 websites (10 per vertical) 124,291 pages (200~2,000 per website) 32 attributes (3~5 per vertical) with labeled ground-truth Vertical Autos

model, price, engine, fuel_economy

Books

title, author, ISBN, publisher, pub_date

Cameras Jobs

2011/7/27

Attributes

model, price, manufacturer title, company, location, date

Movies

title, director, genre, rating

NBA players

name, team, height, weight

Restaurants

name, address, phone, cuisine

Universities

name, phone, website, type 28

Experimental Settings • Methods 1. SSM (Stacked Skews Model) Carlson et al. ECML’08 2. PL (page-level semantic prediction) 3. PL + IP (inter-page aggregation) 4. PL + IP + IA (inter-attribute re-ranking)

• One seed site (by turns), test on other sites • Performance metrics: precision & recall

2011/7/27

29

Performance

2011/7/27

30

Performance (contd.)

2011/7/27

31

Performance: Multiple Seed Sites • Our solution with multiple seed sites – Take the solution with highest confidence score

• Our solution with (one seed + bootstrapping seeds) • SSM with multiple seed sites

2011/7/27

32

Summary • A unified solution for structured data extraction – Minimal human effort: labeling one site per vertical – Flexible for various verticals & attributes

• A large-scale dataset (has been published online) – 124K pages from 80 websites in 8 verticals

• Promising performance – Precision ≥ 80%, Recall ≥ 80% for most verticals

• Future work – Bootstrapping: accumulate vertical knowledge incrementally

2011/7/27

33

Dataset available at http://swde.codeplex.com

Dataset available at http://swde.codeplex.com

From One Tree to a Forest

Jul 27, 2011 - Extracting structured data records from web pages. = identifying values of attributes. 2011/7/27. 3. What's Structured Data Extraction. Title. Author. Publish Date … The Kite Runner Khaled Hosseini. April 2004 … Mercy. Toni Morrison. 2008 … The Time Machine. H. G. Wells. June 30, 2004 … attributes.

3MB Sizes 0 Downloads 141 Views

Recommend Documents

From forest to pasture
precipitation: 1500Б2000 mm), cloud forest, oak forest and pine-oak ...... Morona Santiago, Ecaudorian Amazon. Б Lyonia 7: ... Statistical computing. Б Wiley.

From one-to-one to one-to-many: A study of the ...
they can complete the program, receive their master's degree, and gain a ..... Contribute technology skills content to Teach for Tomorrow online professional.

Forest-tree population genomics and adaptive evolution
useful tool, if cost-effective, to design conservation strategies for forest trees. New Phytologist. (2006) ... clonal tests) are commonly used to study adaptive evolution. of quantitative ... In forest trees specifically, common applications. of mo

Evaluating forest fragmentation and its tree community ...
Forestry and Ecology Division, National Remote. Sensing ... remote sensing to map patterns of forest frag- mentation ...... Debinski, D. M., & Holt, R. D. (2000).

Forest productivity and tree diversity relationships ...
within mid-Atlantic and Appalachian forests (USA) .... causal but only correlative. In such a ..... used land ownership patterns to group FIA plots, we observed sim-.

Title: From one-to-one to one-to-many: An instrumental ...
coursework, receive their master's degree, and gain a school library media endorsement on ... The Michigan Department of Education requires university.

Transferring a configuration file from one model to another
There are a number of ways but the simplest is through the Web Based Manager. ... a network that is on the same subnet as the computer hosting the TFTP server. The port ... best practices you should also only copy and paste sections of the ... Step #

one tree hill season complete.pdf
verdict reviewonetree hillthecompletesecond season. Amazon.comonetree hillthecomplete ninth and finalseason. Onetree hill basketball. players picture gallery.

One Tree Hill Filming Locations.pdf
Sign in. Page. 1. /. 1. Loading… Page 1 of 1. Page 1 of 1. Main menu. Displaying One Tree Hill Filming Locations.pdf. Page 1 of 1.

Tree detection from aerial imagery - Semantic Scholar
Nov 6, 2009 - automatic model and training data selection to minimize the manual work and .... of ground-truth tree masks, we introduce methods for auto-.

A Wish to be a Christmas Tree
would like to pin on my K-12 Perfect Pinning Pinterest board. https://www.pinterest.com/mauerrosa/k-12-perfect-pinning/. Facebook. Like me on Facebook and see what's new in my store. https://www.facebook.com/RosaMauerTeachingTools/. Rosa Mauer's TpT

Revolution in information from one person, one machine One begins ...
Revolution in information from one person, one machine One begins.pdf. Revolution in information from one person, one machine One begins.pdf. Open. Extract.

How to Plant a Tree Seedling.indd - Wisconsin DNR
share it with family and friends. Dig a hole as deep as the root system ... planting site with enough room for roots and branches to stretch out and reach their.

Download FROM NIGGAS TO GODS, PART ONE ...
writings are primarily targeted toward the Black Youth of this day, of which I am a part of. I am not a "Master" of these teachings, but these teachings I wish to ...

Tatung: From Taiwan Number One National Brand To ...
10. 1.4. The industrial development policy during Export-Oriented .... At the same time, the price of a 16-inch Tatung TV was 138 ..... laptop manufacturing. ... Samsung and LG that now their scale are big enough to compete in the international.

Forest structure estimation and pattern exploration from ...
is graphed as a function of individual plot scores for canonicals one and two, ... to parameterize decision-support tools for analysis of carbon cycle impacts as part of the North American Carbon Pro- ..... three-dimensional visualization for five.

Estimating stand basal area from forest panoramas - GigaPan
Stand basal area in an Alaskan birch forest was measured and the result compared to .... of a nationwide network of research sites sponsored by the National ..... panoramas so the horizontal coverage is exactly 360°, but if there is no overlap betwe

Estimating stand basal area from forest panoramas - GigaPan
basal area, Bitterlich, forestry, gigapan, photography, plotless sampling, timber cruising, timber inventory. INTRODUCTION ... selected sampling sites, and then extrapolating to a standard unit of land area. ..... Error-free stitching of forest inter