Building an Analy,c Pla/orm for The Web
Stanislav Barton, Julien Masanès, Philippe Rigaux Internet Memory Big Data Innovators Gathering, BIG 2014
Big Data on the Web • Web: largest corpus of societal data • People are puKng everything (too much) on the web today • Some,mes fenced • Holis,c view?
http://www.worldwidewebsize.com/ "
Who can do research on Webscale data?
7"
Data Market Places
Web Archives • Some are large (web-‐scale) • Two types of use • human browsing and searching • machine processing
• Well-‐organized • But
– Connec,ng via data link doesn’t scale – Data overwhelm if disks shipped – Crude web resources in most cases hard to use
Analy,c Pla/orm for The Web • Mignify started as a pla/orm for managing large scale web archives (Crawl, Reorganize, Index, Store) based on HBase and Hadoop • Main mission was to create sustainable data pla/orm – Cheap – Scaling out – Extendable
Web Archive Management Pla/orm
Content Access Server
Data organiza,on • Resources (URL) and their versions (snapshot ,mestamp), R= • Resource version has features V={F1, … ,Fm} • Features organized in sets (content:*, meta:*), sets physically stored together • reversed URL as primary index • Column families as “secondary indices” • Values stored are typed
Data prepara,on -‐ Workflow Example Read Crawler output: 1. Take Resource:content and Detect mime type, encoding, language (Characterize) 2. Extract plaintext -‐ drop boiler-‐plates (from PDFs, MS docs, htmls, RSSs – Transform) 3. Reorder & Reorganize 4. Fill auxiliary DBs, Indices (Store -‐ Preserve)
Extrac,on Views
Processor – an algorithm (3rd party), takes values on input and produces values on output, inputs and outputs are annotated" Extractor – Processor + Mapping of ResourceVersions Features"
Processing Engine Archive File
HBase
Reduce
Map
HBase
Data File File
Views
Oozie
Co-‐scanner
Analy,c Pla/orm for The Web II • Mignify started as a pla/orm for managing large scale web archives (Crawl, Reorganize, Index, Store) based on HBase and Hadoop • Once having pages at hand it was temp,ng to do more than just organize • Further preprocessing (data cleansing beyond metadata extrac,on – microformats, OpenGraph, etc.), data stored near original data in separate column families • Ready to post-‐process and derive data, but always on “whole collec,on”? (CF as a secondary index)
Analy,c Pla/orm for The Web -‐ Mignify • Build secondary indices on the features (Elas,csearch) • Alter processing engine to accept list of Resources (Versions) rather than resources themselves • Allow third par,es to introduce their own Processors (algorithm marketplace) • Keep the costs down
Architecture
Resource IDs
urces Reso
Extrac,on Pla/orm
Data push
HDFS
Elas,csearch
Mignify – Hardware Pla/orm • Large number of small nodes (ATX i3-i7, 16-32GB RAM, 4x4TB disks" • Node have no enclosures" • 1 column 60 nodes – theoretically 960TB"
Example Usecases • Select a subset of the web archive -‐> run analy,cs/processing • Addresses from pages via microformats – Get all pages from given area
• Product/Company names via keywords + language, mime type, dates – Sen,ment analysis on result
• Source classifica,on
– Limit search by type of content holder (forums, blogs, news, etc.)
• En,ty recogni,on – Focus on
• …
Final Users" Technology Bricks" Mignify"
extractors
Mignify
"
data "
Data for Data • Even small services can benefit from large numbers effect – Language models – N-‐gram sta,s,cs – Seman,c query expansion
Conclusions & Future Work • Allowing small players to play with big data • Most of our clients are start-‐ups • Will MR suffice? – Stratosphere integra,on (e.g., Map,Map,Reduce)
• Beyond Resource oriented computa,ons – Aggrega,on Queries on Custom data
Q & A • Thank you for your auen,on. • Any ques,ons?
Acknowledgements • This work was par,ally supported by the SCAPE Project. The SCAPE project is co-‐funded by the European Union under FP7 ICT-‐2009.4.1 (Grant Agreement number 270137).