IM-barton-big'14.pdf

Viewer
Transcript

Building an Analy,c Pla/orm for The Web

Stanislav Barton, Julien Masanès, Philippe Rigaux Internet Memory Big Data Innovators Gathering, BIG 2014

Big Data on the Web •  Web: largest corpus of societal data •  People are puKng everything (too much) on the web today •  Some,mes fenced •  Holis,c view?

http://www.worldwidewebsize.com/ "

Who can do research on Webscale data?

7"

Data Market Places

Web Archives •  Some are large (web-‐scale) •  Two types of use •  human browsing and searching •  machine processing

•  Well-‐organized •  But

–  Connec,ng via data link doesn’t scale –  Data overwhelm if disks shipped –  Crude web resources in most cases hard to use

Analy,c Pla/orm for The Web •  Mignify started as a pla/orm for managing large scale web archives (Crawl, Reorganize, Index, Store) based on HBase and Hadoop •  Main mission was to create sustainable data pla/orm –  Cheap –  Scaling out –  Extendable

Web Archive Management Pla/orm

Content Access Server

Data organiza,on •  Resources (URL) and their versions (snapshot ,mestamp), R= •  Resource version has features V={F1, … ,Fm} •  Features organized in sets (content:*, meta:*), sets physically stored together •  reversed URL as primary index •  Column families as “secondary indices” •  Values stored are typed

Data prepara,on -‐ Workﬂow Example Read Crawler output: 1.  Take Resource:content and Detect mime type, encoding, language (Characterize) 2.  Extract plaintext -‐ drop boiler-‐plates (from PDFs, MS docs, htmls, RSSs – Transform) 3.  Reorder & Reorganize 4.  Fill auxiliary DBs, Indices (Store -‐ Preserve)

Extrac,on Views

Processor – an algorithm (3rd party), takes values on input and produces values on output, inputs and outputs are annotated" Extractor – Processor + Mapping of ResourceVersions Features"

Processing Engine Archive File

HBase

Reduce

Map

HBase

Data File File

Views

Oozie

Co-‐scanner

Analy,c Pla/orm for The Web II •  Mignify started as a pla/orm for managing large scale web archives (Crawl, Reorganize, Index, Store) based on HBase and Hadoop •  Once having pages at hand it was temp,ng to do more than just organize •  Further preprocessing (data cleansing beyond metadata extrac,on – microformats, OpenGraph, etc.), data stored near original data in separate column families •  Ready to post-‐process and derive data, but always on “whole collec,on”? (CF as a secondary index)

Analy,c Pla/orm for The Web -‐ Mignify •  Build secondary indices on the features (Elas,csearch) •  Alter processing engine to accept list of Resources (Versions) rather than resources themselves •  Allow third par,es to introduce their own Processors (algorithm marketplace) •  Keep the costs down

Architecture

Resource IDs

urces Reso

Extrac,on Pla/orm

Data push

HDFS

Elas,csearch

Mignify – Hardware Pla/orm •  Large number of small nodes (ATX i3-i7, 16-32GB RAM, 4x4TB disks" •  Node have no enclosures" •  1 column 60 nodes – theoretically 960TB"

Example Usecases •  Select a subset of the web archive -‐> run analy,cs/processing •  Addresses from pages via microformats –  Get all pages from given area

•  Product/Company names via keywords + language, mime type, dates –  Sen,ment analysis on result

•  Source classiﬁca,on

–  Limit search by type of content holder (forums, blogs, news, etc.)

•  En,ty recogni,on –  Focus on

•  …

Final Users" Technology Bricks" Mignify"

extractors

Mignify

"

data "

Data for Data •  Even small services can beneﬁt from large numbers eﬀect –  Language models –  N-‐gram sta,s,cs –  Seman,c query expansion

Conclusions & Future Work •  Allowing small players to play with big data •  Most of our clients are start-‐ups •  Will MR suﬃce? –  Stratosphere integra,on (e.g., Map,Map,Reduce)

•  Beyond Resource oriented computa,ons –  Aggrega,on Queries on Custom data

Q & A •  Thank you for your auen,on. •  Any ques,ons?

Acknowledgements •  This work was par,ally supported by the SCAPE Project. The SCAPE project is co-‐funded by the European Union under FP7 ICT-‐2009.4.1 (Grant Agreement number 270137).

Page 1 of 22. Building an Analy,c Pla/orm for The Web. Stanislav Barton, Julien MasanÃ¨s, Philippe Rigaux. Internet Memory. Big Data Innovators Gathering, BIG ...

Download PDF

5MB Sizes 1 Downloads 195 Views

Report

IM-barton-big'14.pdf

Recommend Documents