Building  an  Analy,c  Pla/orm  for  The  Web  

Stanislav  Barton,  Julien  Masanès,  Philippe  Rigaux   Internet  Memory   Big  Data  Innovators  Gathering,  BIG  2014  

Big  Data  on  the  Web   •  Web:  largest  corpus  of  societal  data   •  People  are  puKng  everything  (too  much)  on  the   web  today   •  Some,mes  fenced   •  Holis,c  view?    

http://www.worldwidewebsize.com/ "

Who  can  do  research  on  Webscale  data?  

7"

Data  Market  Places  

Web  Archives   •  Some  are  large  (web-­‐scale)   •  Two  types  of  use   •  human  browsing  and  searching   •  machine  processing  

•  Well-­‐organized   •  But  

–  Connec,ng  via  data  link  doesn’t  scale   –  Data  overwhelm  if  disks  shipped   –  Crude  web  resources  in  most  cases  hard  to  use  

Analy,c  Pla/orm  for  The  Web   •  Mignify  started  as  a  pla/orm  for  managing  large   scale  web  archives  (Crawl,  Reorganize,  Index,   Store)  based  on  HBase  and  Hadoop   •  Main  mission  was  to  create  sustainable  data   pla/orm   –  Cheap   –  Scaling  out   –  Extendable  

Web  Archive  Management  Pla/orm  

Content   Access   Server  

Data  organiza,on   •  Resources  (URL)  and  their  versions  (snapshot   ,mestamp),  R=   •  Resource  version  has  features  V={F1,  …  ,Fm}   •  Features  organized  in  sets  (content:*,  meta:*),   sets  physically  stored  together   •  reversed  URL  as  primary  index   •  Column  families  as  “secondary  indices”   •  Values  stored  are  typed  

Data  prepara,on  -­‐  Workflow  Example   Read  Crawler  output:   1.  Take  Resource:content  and  Detect  mime   type,  encoding,  language  (Characterize)   2.  Extract  plaintext  -­‐  drop  boiler-­‐plates  (from   PDFs,  MS  docs,  htmls,  RSSs  –  Transform)     3.  Reorder  &  Reorganize   4.  Fill  auxiliary  DBs,  Indices  (Store  -­‐  Preserve)  

Extrac,on  Views  

Processor – an algorithm (3rd party), takes values on input and produces values on output, inputs and outputs are annotated" Extractor – Processor + Mapping of ResourceVersions Features"

Processing  Engine   Archive  File  

HBase  

Reduce  

Map  

HBase  

Data  File   File  

Views  

Oozie  

Co-­‐scanner  

Analy,c  Pla/orm  for  The  Web  II   •  Mignify  started  as  a  pla/orm  for  managing  large  scale   web  archives  (Crawl,  Reorganize,  Index,  Store)  based   on  HBase  and  Hadoop   •  Once  having  pages  at  hand  it  was  temp,ng  to  do   more  than  just  organize   •  Further  preprocessing  (data  cleansing  beyond   metadata  extrac,on  –  microformats,  OpenGraph,   etc.),  data  stored  near  original  data  in  separate   column  families   •  Ready  to  post-­‐process  and  derive  data,  but  always  on   “whole  collec,on”?  (CF  as  a  secondary  index)  

Analy,c  Pla/orm  for  The  Web  -­‐  Mignify   •  Build  secondary  indices  on  the  features   (Elas,csearch)   •  Alter  processing  engine  to  accept  list  of  Resources   (Versions)  rather  than  resources  themselves   •  Allow  third  par,es  to  introduce  their  own   Processors  (algorithm  marketplace)   •  Keep  the  costs  down  

Architecture  

Resource   IDs  

urces   Reso

Extrac,on  Pla/orm  

Data  push  

HDFS  

Elas,csearch  

Mignify  –  Hardware  Pla/orm   •  Large number of small nodes (ATX i3-i7, 16-32GB RAM, 4x4TB disks" •  Node have no enclosures" •  1 column 60 nodes – theoretically 960TB"

Example  Usecases   •  Select  a  subset  of  the  web  archive  -­‐>  run  analy,cs/processing   •  Addresses  from  pages  via  microformats   –  Get  all  pages  from  given  area  

•  Product/Company  names  via  keywords  +  language,  mime  type,   dates   –  Sen,ment  analysis  on  result    

•  Source  classifica,on    

–  Limit  search  by  type  of  content  holder  (forums,  blogs,  news,  etc.)  

•  En,ty  recogni,on   –  Focus  on    

•  …  

Final Users" Technology Bricks" Mignify"

extractors

Mignify

"

data "

Data  for  Data   •  Even  small  services  can  benefit  from  large   numbers  effect   –  Language  models   –  N-­‐gram  sta,s,cs   –  Seman,c  query  expansion  

Conclusions  &  Future  Work   •  Allowing  small  players  to  play  with  big  data   •  Most  of  our  clients  are  start-­‐ups   •  Will  MR  suffice?   –  Stratosphere  integra,on  (e.g.,  Map,Map,Reduce)  

•  Beyond  Resource  oriented  computa,ons   –  Aggrega,on  Queries  on  Custom  data  

Q  &  A   •  Thank  you  for  your  auen,on.   •  Any  ques,ons?  

Acknowledgements   •  This  work  was  par,ally  supported  by  the  SCAPE  Project.   The  SCAPE  project  is  co-­‐funded  by  the  European  Union   under  FP7  ICT-­‐2009.4.1  (Grant  Agreement  number  270137).  

IM-barton-big'14.pdf

Page 1 of 22. Building an Analy,c Pla/orm for The Web. Stanislav Barton, Julien Masanès, Philippe Rigaux. Internet Memory. Big Data Innovators Gathering, BIG ...

5MB Sizes 1 Downloads 195 Views

Recommend Documents

No documents