Introduction to Big Data Dr. Putchong Uthayopas Department of Computer Engineering, Faculty of Engineering, Kasetsart University Email: [email protected]

PART  I:  INTRODUCTION  TO  BIG   DATA  

During the first day of a baby’s life, the amount of data generated by humanity is equivalent to 70 times the information contained in the Library of Congress. | Photo Credit: ©Catherine Balet “Strangers in the light” (Steidl) 2012 / from The Human Face of Big Data  

By signing up with the personal genetics company 23andMe, producer of the documentary We Came Home Yasmine Delawari Johnson was able to get a glimpse into the future. | Photo Credit: © Douglas Kirkland 2012 / from The Human Face of Big Data

Informa)on  as  an  Asset   •  Cloud  will  enable    larger  and   larger  data  to  be  easily  collected   and    used   •  People  will  deposit  informa)on   into  the  cloud     –  Bank,  personal  ware  house  

•  New  technology  will  emerge  

–  Larger  and  scalable  storage   technology   –   Innova)ve  and  complex  data   analysis/visualiza)on  for   mul)media  data   –  Security  technology  to  ensure   privacy  

•  Cloud  will  be  mankind  intelligent   and  memory!    

WHAT  IS  BIG  DATA?  

Big data is high-volume, high-velocity and highvariety information assets that demand costeffective, innovative forms of information processing for enhanced insight and decision making. “Gartner Inc.”

Why  BigData?   The  real  value  of  big   data  is  in  the  insights   it    produces  when   analyzed— discovered  paNerns,   derived  meaning,   indicators  for   decisions,  and     ul)mately  the  ability   to  respond  to  the   world    with  greater   intelligence.    

•  Improve  product  and  service   •  Increase  customer   sa)sfac)on/behavior   •  Improve  opera)on  efficiency   •  Understand  emerging   market  trends      

h#p://www.intel.com/content/dam/www/public/us/en/documents/product-­‐briefs/big-­‐data-­‐cloud-­‐technologies-­‐brief.pdf  )  

Property  of  Big  Data   Velocity   Volume  

Variety  

BIG  Data  

Volume   •  Big  data  must  be   huge     –  Beyond  the   capability  of  a  single   computer  server  to   process  it     –  Possible  to  store  the   data  but  difficult  to   process  it  

Velocity   •  Big  data  accumulate  at  a   very  fast  speed   –  Stock  market  data   –  Internet    access  log   –  Social  media  data   •  TwiNer  ,  facebook,  IG    

•  We  need  to   –  Extract  meaning  as  fast  and   as  much  as  we  can  before   throwing  away  the  data    

Variety   •  Data  come  with   variety  

–  Tradi)onal  data   base   –  Documents   –  Web  page   –  Social  media   data   –  Image   –  Video/Audio   –  Loca)on  

Considera)on  for  Applying  Big   Data  

hNp://fredericgonzalo.com/en/2013/07/07/big-­‐data-­‐in-­‐tourism-­‐hospitality-­‐4-­‐key-­‐components/  

BIG  DATA  ECOSYSTEM  

Big  Data  Eco  system-­‐  Infrastructure   •  Hadoop-­‐  

–  technologies  designed  for  the  storing,  processing  and  analysing   of  data  by  breaking  up  and  distribu)ng  data  into  parts  and   analysing  those  parts  concurrently,  rather  than  tackling  one   monolithic  block  of  data  all  in  one  go.  

•  NoSQL  

–  Stands  for  Not  Only  SQL   –  involved  in  processing  large  volumes  of  mul)-­‐structured  data.   Most  NoSQL  databases  are  most  adept  at  handling  discrete  data   stored  among  mul)-­‐structured  data.    

•  Massively  Parallel  Processing  (MPP)  Databases  

–  MPP  databases  work  by  segmen)ng  data  across  mul)ple  nodes,   and  processing  these  segments  of  data  in  parallel,  and  uses  SQL.     Reference:  hNp://dataconomy.com/understanding-­‐big-­‐data-­‐ecosystem/  

Big  Data  Eco  system-­‐  Analy)cs   •  AnalyKcs  PlaLorms   –  Integrate  and  analyse  data  to  uncover  new  insights,  and  help  companies  make  beNer-­‐ informed  decisions.    

•  VisualizaKon  PlaLorms     –   visualizing  data;  taking  the  raw  data  and  presen)ng  it  in  complex,  mul)-­‐dimensional   visual  formats  to  illuminate  the  informa)on  

•  Business  Intelligence  (BI)  PlaLorms   –  analyze  data  from  mul)ple  sources  to  deliver  services  such  as  business  intelligence   reports,  dashboards  and  visualiza)ons  

•  Machine  Learning   –  machine  learning  is  data  the  algorithm  ‘learns  from’,  and  the  output  depends  on  the  use   case.  One  of  the  most  famous  examples  is  IBM’s  super  computer  Watson,  which  has   ‘learned’  to  scan  vast  amounts  of  informa)on  to  find  specific  answers,  and  can  comb   through  200  million  pages  of  structured  and  unstructured  data  in  minutes.    

Reference:  hNp://dataconomy.com/understanding-­‐big-­‐data-­‐ecosystem/  

NoSQL  (Not  Only  SQL)   •  A  NoSQL  (oeen  interpreted  as  Not  only  SQL)   database  provides  a  mechanism  for  storage  and   retrieval  of  data  that  is  modeled  in  means  other  than   the  tabular  rela)ons  used  in  rela)onal  databases.     –  being  non-­‐relaKonal,  distributed,  open-­‐ source  and  horizontally  scalable.   –  Used  to  handle  a  huge  amount  of  data     –  The  original  inten)on  has  been  modern  web-­‐scale   databases.     Reference:  hNp://nosql-­‐database.org/  

•  MongoDB  is  a  general  purpose,   open-­‐source  database.     •  MongoDB  features:  

–  Document  data  model  with   dynamic  schemas   –  Full,  flexible  index  support  and  rich   queries   –  Auto-­‐Sharding    for  horizontal   scalability   –  Built-­‐in  replica)on  for  high   availability   –  Text  search   –  Advanced  security  

•  Hadoop  is  an  open-­‐source  soeware   framework  wriNen  in  Java  for   distributed  storage  and  distributed   processing  of  very  large  data  sets   on  computer  clusters  built  from   commodity  hardware.     •  Hadoop  was  created  by  Doug   Cu'ng  and  Mike  Cafarella  in   2005.  Culng,  who  was  working   at  Yahoo!  at  the  )me,  named  it   aeer  his  son's  toy  elephant.  

•  The  base  Apache  Hadoop  framework  is  composed  of   the  following  modules:  

–  Hadoop  Common  –  contains  libraries  and  u)li)es  needed   by  other  Hadoop  modules;   –  Hadoop  Distributed  File  System  (HDFS)  –  a  distributed   file-­‐system  that  stores  data  on  commodity  machines,   providing  very  high  aggregate  bandwidth  across  the   cluster;   –  Hadoop  YARN  –  a  resource-­‐management  plaporm   responsible  for  managing  compute  resources  in  clusters   and  using  them  for  scheduling  of  users'  applica)ons;and   –  Hadoop  MapReduce  –  a  programming  model  for  large   scale  data  processing.  

Magic  behind  Hadoop  and  HDFS   •  Problem  is  divided  into  two  phases  

–  Map  applying  some  ac)on  to  data  in     Pair  and  get  some  intermediate  results   –  Reduce  summarize  intermediate  result     and  return  back  to  main  program  

Ricky  Ho,  How  Hadoop  Map/Reduce  works,     hNp://architects.dzone.com/ar)cles/how-­‐hadoop-­‐mapreduce-­‐works  

Example:  Word  count   •  Coun)ng  word  in  an  input  text  file.  

–  How  many  word  “love”  in  a  novel?  ^_^  

•  In  map  phase  the  sentence  would  be  split  as  words  and   form  the  ini)al  key  value  pair     •  “tring  tring  the  phone  rings”  becomes    ,,  ,   ,    

–  In  the  reduce  phase  the  keys  are  grouped  together  and  the  values   for  similar  keys  are  added.     •  There  are  only  one  pair  of  similar  keys  ‘tring’  the  values  for  these  keys   would  be  added  so  the  out  put  key  value  pairs  would  be   •  ,  ,  ,     •  Reduce  forms  an  aggrega)on  phase  for  keys    

–  This  would  give  the  number  of  occurrence  of  each  word  in  the   input.     hNp://kickstarthadoop.blogspot.com/2011/04/word-­‐count-­‐hadoop-­‐map-­‐reduce-­‐ example.html  

Hadoop  and  its  Eco  System   •  Hadoop  is  not  a  piece  of  soeware  but  an   ecosystem  for  Big  Data  processing   •  Many  tools  has  been  build  and  share  many  of   the  Hadoop  components  especially  the  HDFS  

Hadoop  Eco  System   •  Pig (http://pig.apache.org) –  High-level language for data analysis

•  Hbase (http://hbase.apache.org) (very large tables -- billions of rows X millions of columns) –  Table storage for semi-structured data

•  Zookeeper (https://zookeeper.apache.org) –  Coordinating distributed applications

•  Hive (https://hive.apache.org) –  SQL-like Query language and Metastore

•  Mahout (http://www.tutorialspoint.com/mahout/) –  Machine learning

Apache  Pig   •  Apache  Pig  is  a  soeware  framework  which  offers  a  run-­‐ )me  environment  for  execu)on  of  MapReduce  jobs  on  a   Hadoop  Cluster  via  a  high-­‐level  scrip)ng  language  called   Pig  La)n.  The  following  are  a  few  highlights  of  this  project:   –  Pig  is  an  abstrac)on  (high  level  programming  language)  on  top  of  a   Hadoop  cluster.   –  Pig  La)n  queries/commands  are  compiled  into  one  or  more   MapReduce  jobs  and  then  executed  on  a  Hadoop  cluster.   –  Just  like  a  real  pig  can  eat  almost  anything,  Apache  Pig  can  operate   on  almost  any  kind  of  data.   –  Hadoop  offers  a  shell  called  Grunt  Shell  for  execu)ng  Pig  commands.   –  DUMP  and  STORE  are  two  of  the  most  common  commands  in  Pig.   DUMP  displays  the  results  to  screen  and  STORE  stores  the  results  to   HDFS.   –  Pig  offers  various  built-­‐in  operators,  func)ons  and  other  constructs   for  performing  many  common  opera)ons.   Source:  hNp://www.kalyanhadooptraining.com/2014/07/big-­‐data-­‐basics-­‐ part-­‐6-­‐related-­‐apache.html  

An  Example  Problem   •  Input: user data in a file, website data in another •  Requirement: find the top 5 most visited pages by users aged 18-25

Load Users data file Filter by age

Load web Pages

18-25 Join on name Group on url

Count number of clicks Order by clicks Take top 5 User A visited Page X

In  MapReduce  

In  Pig  La)n   Users  =  load  ‘users’ as  (name,  age);   Filtered  =  filter  Users  by  age  >=  18  and  age  <=  25;     Pages  =  load  ‘pages’ as  (user,  url);   Joined  =  join  Filtered  by  name,  Pages  by  user;   Grouped  =  group  Joined  by  url;   Summed  =  foreach  Grouped  generate  group,                              count(Joined)  as  clicks;   Sorted  =  order  Summed  by  clicks  desc;   Top5  =  limit  Sorted  5;   store  Top5  into  ‘top5sites’;

Apache  HBase   •  Apache  HBase  is  a  distributed,   versioned,  column-­‐oriented,   scalable  and  a  big  data  store  on  top   of  Hadoop/HDFS.  The  following  are   a  few  highlights  of  this  project:  

–  Runs  on  top  of  Hadoop  and  HDFS  in  a   distributed  fashion.   –  Supports  Billions  of  Rows  and   Millions  of  Columns.   –  Runs  on  a  cluster  of  commodity   hardware  and  scales  linearly.   –  Offers  consistent  reads  and  writes.   –  Offers  easy  to  use  Java  APIs  for  client   access.  

HBASE  Table  Structure  

Hbase  Vs.  HDFS  

Apache  HBase   •  Where  to  Use  HBase  

–  Apache  HBase  is  used  to  have  random,  real-­‐)me   read/write  access  to  Big   –  Data.   –  It  hosts  very  large  tables  on  top  of  clusters  of   commodity  hardware.  

•  Applica)ons  of  HBase  

–  HBase  is  used  whenever  we  need  to  provide  fast   random  access  to  available  data.  

•  Companies  such  as  Facebook,  TwiNer,  Yahoo,  and   Adobe  use  Hbase  internally.  

Hive   •  Developed at Facebook •  Used for majority of Facebook jobs •  “Relational database” built on Hadoop –  Maintains list of table schemas –  SQL-like query language (HiveQL) –  Can call Hadoop Streaming scripts from HiveQL –  Supports table partitioning, clustering, complex data types, some optimizations Original  Slides  by  Matei  Zaharia  UC  Berkeley   RAD  Lab  

What  Is  Hive?   –  Hive  is  designed  to  enable  easy  data   summariza)on,  ad-­‐hoc  querying  and  analysis  of   large  volumes  of  data.   •  Hive  provides  a  simple  query  language  called  Hive  QL,   which  is  based  on  SQL  and  which  enables  users  familiar   with  SQL  to  do  ad-­‐hoc  querying,  summariza)on  and   data  analysis  easily.     •  Hive  QL  also  allows  tradi)onal  map/reduce   programmers  to  be  able  to  plug  in  their  custom   mappers  and  reducers  to  do  more  sophis)cated   analysis  that  may  not  be  supported  by  the  built-­‐in   capabili)es  of  the  language.   hNps://cwiki.apache.org/confluence/display/ Hive/Tutorial  

What  Hive  Is  NOT!     •  Hadoop  is  a  batch  processing  system  and  Hadoop  jobs  tend   to  have  high  latency  and  incur  substan)al  overheads  in  job   submission  and  scheduling.     –  latency  for  Hive  queries  is  generally  very  high  (minutes)  even   when  data  sets  involved  are  very  small  (say  a  few  hundred   megabytes).     –  Hive  aims  to  provide  acceptable  (but  not  op)mal)  latency  for   interac)ve  data  browsing,  queries  over  small  data  sets  or  test   queries.  

•  Hive  is  not  designed  for  online  transac)on  processing  and   does  not  offer  real-­‐)me  queries  and  row  level  updates.    

–  It  is  best  used  for  batch  jobs  over  large  sets  of  immutable  data   (like  web  logs).  

 

Crea)ng  a  Hive  Table   CREATE  TABLE  page_views(viewTime  INT,  userid  BIGINT,                                        page_url  STRING,  referrer_url   STRING,                                          ip  STRING  COMMENT  'User  IP   address')     COMMENT  'This  is  the  page  view  table'     PARTITIONED  BY(dt  STRING,  country  STRING)   STORED  AS  SEQUENCEFILE;    

•  Partitioning breaks table into separate files for each (dt, country) pair Ex: /hive/page_view/dt=2008-06-08,country=USA /hive/page_view/dt=2008-06-08,country=CA

A  Simple  Query   •  Find all page views coming from xyz.com on March 31st: SELECT  page_views.*     FROM  page_views     WHERE  page_views.date  >=  '2008-­‐03-­‐01'   AND  page_views.date  <=  '2008-­‐03-­‐31'   AND  page_views.referrer_url  like  '%xyz.com';  

•  Hive only reads partition 2008-­‐03-­‐01,* instead of scanning entire table

•  Apache  Mahout  is  a  scalable  machine  learning  and  data   mining  library.  The  following  are  a  few  highlights  of  this   project:  

–  Mahout  implements  the  machine  learning  and  data  mining   algorithms  using  MapReduce.   –  Mahout  has  4  major  categories  of  algorithms:  Collabora)ve   Filtering,  Classifica)on,  Clustering,  and  Dimensionality   Reduc)on.   –  Mahout  library  contains  two  types  of  algorithms:  Ones  that  can   run  in  local  mode  and  the  others  which  can  run  in  a  distributed   fashion.  

•  List  of  algorithm  supported  is  here.  hNp:// mahout.apache.org/users/basics/algorithms.html  

Who  use  Mahout?   •  Foursquare  uses  Mahout  for   its  recommenda)on  engine.   •  TwiNer  uses  Mahout's  LDA  implementa)on   for  user  interest  modeling   •  Yahoo!  Mail  uses  Mahout's  Frequent  PaNern   Set  Mining.  See  slides   •  Foursquare  uses  Mahout  for   its  recommenda)on  engine.  

Apache  Flume   •  Apache  Flume  is  a  distributed,  reliable,  and  available  service  for   efficiently  collec)ng,  aggrega)ng,  and  moving  large  amounts  of   streaming  data  into  the  Hadoop  Distributed  File  System  (HDFS).   –  based  on  streaming  data  flows   –  tunable  reliability  mechanisms  for  failover  and  recovery.  

Source:  hNp://hortonworks.com/hadoop/flume/  

Flume  Example  Use  Case   •  Flume  can  be    used  to  log   manufacturing  opera)ons.    

–  When  one  run  of  product   comes  off  the  line,  it   generates  a  log  file  about  that   run.     •  this  occurs  hundreds  or   thousands  of  )mes  per  day  

–  Large  volume  log  file  data  can   stream  through  Flume     –  Years  of  produc)on  runs  can   be  stored  in  HDFS  and   analyzed  by  a  quality   assurance  engineer  using   Apache  Hive.  

HortonWorks  Commercial  Hadoop   Ecosystem   • 

Hortonworks  Data  Plaporm  (HDP)  is  an  open  source  distribu)on   powered  by  Apache  Hadoop.     actual  Apache-­‐released  versions  of  the  components  with  all  the   necessary  bug  fixes  to  make  all  the  components  interoperable  in   your  produc)on  environments.     –  packaged  with  an  easy  to  use  installer  (HDP  Installer)  that  deploys   the  complete  Apache  Hadoop  stack  to  your  en)re  cluster  and   provides  the  necessary  monitoring  capabili)es  using  Ganglia  and   Nagios.       – 

• 

The  HDP  distribu)on  consists  of  the  following  components:  

Core  Hadoop  plaporm  (Hadoop  HDFS  and  Hadoop  MapReduce)   Non-­‐rela)onal  database  (Apache  HBase)   Metadata  services  (Apache  HCatalog)   Scrip)ng  plaporm  (Apache  Pig)   Data  access  and  query  (Apache  Hive)   Workflow  scheduler  (Apache  Oozie)   Cluster  coordina)on  (Apache  Zookeeper)   Data  integra)on  services  (HCatalog  APIs,  WebHDFS,  and  Apache   Sqoop)   –  Distributed  log  management  services  (Apache  Flume)   –  Machine  learning  library  (Mahout)   –  –  –  –  –  –  –  – 

• 

Try  using  Sandbox  hNp://hortonworks.com/products/ hortonworks-­‐sandbox/#install  

BIG  DATA  BENEFIT  AND  USE  CASE

Big  Data  use  case   • 

A  360  degree  view  of  the  customer   – 

• 

Internet  of  Things   – 

• 

A  large  company,  hoping  to  boost  the  efficiency  of  its   enterprise  data  warehouse  

Big  data  service  refinery   – 

• 

The  second  most  popular  use  case  involves  IoT-­‐ connected  devices  managed  by  hardware,  sensor,   and  informa)on  security  companies.  "These  devices   are  

Data  warehouse  op)miza)on   – 

• 

This  use  is  most  popular,  according  to  Gallivan.   Online  retailers  want  to  find  out  what  shoppers  are   doing  on  their  sites  -­‐-­‐  what  pages  they  visit,  where   they  linger,  how  long  they  stay,  and  when  they  leave.  

This  means  using  big-­‐data  technologies  to  break   down  silos  across  data  stores  and  sources  to  increase   corporate  efficiency.  

Informa)on  security   – 

This  last  use  case  involves  large  enterprises  with   sophis)cated  informa)on  security  architectures,  as   well  as  security  vendors  looking  for  more  efficient   ways  to  store  petabytes  of  event  or  machine  data.  

Source:  hNp://www.informa)onweek.com/big-­‐data/big-­‐data-­‐analy)cs/5-­‐big-­‐data-­‐use-­‐cases-­‐to-­‐watch/d/d-­‐id/1251031  

Social  Media  Analy)cs   •  Social  media  analyKcs  is  the   prac)ce  of  gathering  data   from  blogs  and  social   media  websites  and   analyzing  that  data  to  make   business  decisions.  The   most  common  use  of  social   media  analyKcs  is  to  mine   customer  sen)ment  in   order  to  support  marke)ng   and  customer  service   ac)vi)es.   What   is  social  media  analy)cs?  -­‐  Defini)on  from  WhatIs.com  

Sen)ment  Analysis   •  Sen)ment  analysis  (also  known  as  opinion   mining)  refers  to  the  use  of  natural   language  processing,  text  analysis  and   computa)onal  linguis)cs  to  iden)fy  and   extract  subjec)ve  informa)on  in  source   materials.     •  sen)ment  analysis  aims  to  determine  the   altude  of  a  speaker  or  a  writer  with   respect  to  some  topic  or  the  overall   contextual  polarity  of  a  document.    

•  judgment  or  evalua)on  (see  appraisal  theory)   •  affec)ve  state  (that  is  to  say,  the  emo)onal   state  of  the  author  when  wri)ng)   •   intended  emo)onal  communica)on  (that  is  to   say,  the  emo)onal  effect  the  author  wishes  to   have  on  the  reader).  

•  Applica)on  

–  Book  Recommenda)on   –  Product  review  

Google  Flu   • 

• 

• 

paNern  emerges  when  all  the  flu-­‐ related  search  queries  are  added   together.     We  compared  our  query  counts  with   tradi)onal  flu  surveillance  systems   and  found  that  many  search  queries   tend  to  be  popular  exactly  when  flu   season  is  happening.     By  coun)ng  how  oeen  we  see  these   search  queries,  we  can  es)mate  how   much  flu  is  circula)ng  in  different   countries  and  regions  around  the   world.    

hNp://www.google.org/flutrends/ about/how.html  

WHAT  FACEBOOK  KNOWS  

hNp://www.facebook.com/data  

Cameron  Marlow  calls  himself  Facebook's  "in-­‐ house  sociologist."  He  and  his  team  can  analyze   essen)ally  all  the  informa)on  the  site  gathers.  

Study  of  Human  Society   •  Facebook,  in  collabora)on  with  the  University   of  Milan,  conducted  experiment  that  involved     –  the  en)re  social  network  as  of  May  2011   –  more  than  10  percent  of  the  world's  popula)on.    

•  Analyzing  the  69  billion  friend  connec)ons   among  those  721  million  people  showed  that     –  four  intermediary  friends  are  usually  enough  to   introduce  anyone  to  a  random  stranger.    

Cupid  in  you  Network • 

Study  matchmaker    

–  surveyed  approximately  1500  English  speakers   around  the  world  who  had  listed  a  rela)onship   on  their  profile  at  least  one  year  ago  but  no   more  than  two  years   –  asking  them  how  they  met  their  partner  and   who  introduced  them  (if  anyone).   –  analyzed  network  proper)es  of  couples  and   their  matchmakers  using  de-­‐iden)fied,   aggregated  data.  

• 

Matchmaker  characteris)cs  

–  Matchmakers  have  far  more  friends  than  the   people  they're  selng  up.   –  Matchmakers'  networks  have  a  different   structure   •  their  networks  are  less  dense:  their  friends  are   less  likely  to  know  each  other  

–  Matchmakers  were  more  likely  to  be  close   friends,  rather  than  acquaintances.  

https://research.facebook.com/blog/448802398605370/cupid-inyour-network/

Why?   •  Facebook  can  improve  users  experience       –  make  useful  predic)ons  about  users'  behavior   –  make  beNer  guesses  about  which  ads  you  might   be  more  or  less  open  to  at  any  given  )me  

•  Right  before  Valen)ne's  Day  this  year  a   blog  post  from  the  Data  Science  Team  listed   the  songs  most  popular  with  people  who  had   recently  signaled  on  Facebook  that  they  had   entered  or  lee  a  rela)onship  

How  facebook  handle  Big  Data?   •  Facebook  built  its  data  storage  system  using  open-­‐ source  soeware  called  Hadoop.  

–  Hadoop  spreading  them  across  many  machines  inside  a   data  center.   –  Use  Hive,  open-­‐source  that  acts  as  a  transla)on  service,   making  it  possible  to  query  vast  Hadoop  data  stores  using   rela)vely  simple  code.  

•  Much  of  Facebook's  data  resides  in  one  Hadoop  store   more  than  100  petabytes  (a  million  gigabytes)  in  size,   says  Sameet  Agarwal,  a  director  of  engineering  at   Facebook  who  works  on  data  infrastructure,  and  the   quan)ty  is  growing  exponen)ally.  "Over  the  last  few   years  we  have  more  than  doubled  in  size  every  year,”  

PART  II:  DATA  SCIENCE  

What  is  Data  Science?   •  Data  Science  is the extraction of knowledge from large volumes of data that are structured or unstructured.

–  "Unstructured data" can include emails, videos, photos, social media, and other usergenerated content. –  Data science often requires sorting through a great amount of information and writing algorithms to extract insights from this data.

•  Source:Wikipedia  

Data  Science  Process  

"Data  visualiza)on  process  v1"  by  Farcaster  at  English  Wikipedia.  Licensed  under  CC  BY-­‐SA  3.0  via  Wikimedia  Commons  -­‐   hNps://commons.wikimedia.org/wiki/File:Data_visualiza)on_process_v1.png#/media/File:Data_visualiza)on_process_v1.png  

Source:  The  Filed  Guide  to  Data  Science,  Booz,  Allen,  Hamilton  

Source:  The  Filed  Guide  to  Data  Science,  Booz,  Allen,  Hamilton  

Data  Product   •  Data  Product  provides  ac)onalble  informa)on   without  exposing  decision  maker  to  the   underlying  data  or  analy)cs   –  Movie  Recommenda)ons   –  Weather  Forecast   –  Stock  Market  Predic)on   –  Opera)on  improvement   –  Health  Diagnosis   –  Targeted  Adver)sing     Source:  The  Filed  Guide  to  Data  Science,  Booz,  Allen,  Hamilton  

4  Data  Science  Important  Components   •  Data  Type     –  What  is  your  input  ?  

•  Analy)cs  Class   –  How  to  get  insight?  

•  Learning  Model   –  How  to  learn  things?  

•  Execu)on  Model   –  How  it  work?   Source:  The  Filed  Guide  to  Data  Science,  Booz,  Allen,  Hamilton  

Data  Type   •  Structure   –  Transca)on  data  from  tradi)onal  database  

•  Unstructured     –  Text,  Speech,  Video,  Mul)media  

•  Semi  Structure   –  Social  data  :  TwiNer,  Facebook   –  User  ac)vity  log,  geo-­‐loca)on  ,  web  log  

Analy)cs  Class   •  An  approach  to  get  the  knowledge  from  your  data?   –  Grouping  people   –  Trends  of  event  

Source:  The  Filed  Guide  to  Data  Science,  Booz,  Allen,  Hamilton  

Source:  The  Filed  Guide  to  Data  Science,  Booz,  Allen,  Hamilton  

Source:  The  Filed  Guide  to  Data  Science,  Booz,  Allen,  Hamilton  

Source:  The  Filed  Guide  to  Data  Science,  Booz,  Allen,  Hamilton  

Learning  Model   •  How  computer  algorithm  learn  to  understand   your  problem?  

Source:  The  Filed  Guide  to  Data  Science,  Booz,  Allen,  Hamilton  

Learning  Style   •  Supervised   –  Some  guideline  has  been  given  to  algorithm  

•  Unsupervised   –  Derive  the  knowledge  from  data  directly  

Training  Style   •  Off  line   –  Data  has  been  stored   –  Goes  through  the  informa)on  and  extract   knowledge  

•  On-­‐line   –  Informa)on  is  coming  in  as  a  stream   –  Extract  knowledge  as  the  data  is  coming  in  

Execu)on  Model   •  How  your  infrastructure  do  their  jobs?  

Star)ng  The  Ini)a)ve  

Top  Down   Visualiza)on   Analy)cs  Soeware   Big  Data  Tools   Infrastructure   Data  

BoNom  Up  

BoNom  up  approach   •  What  is  the  data  that  we  have?   •  How  can  we  collect  and  store  it?   •  What  is  the  infrastructure  and   tool  to  process  this  big  data?   •  What  analy)cs  method  can  be   apply?   •  What  is  the  insight  we  can  gain   from  this  data  and  analysis?  

Top  down   •  What  is  the  business   challenge  that  can  create   value  and  impact  to  the   organiza)on?   •  What  is  the  data  that  we   need?   •  What  is  the  tools  and  analy)cs   approach  that  should  be   used  ?   •  What  is  the  infrastructure   needed?    

PART  III:  TRENDS  

Gartner’s  Top  10  Trends  

Informa)on  Tsunami   •  Rapid  expansion  of  Smartphone  Usage,  social   compu)ng,  mobile  applica)on,  gaming   •  Rapid  increases  in  Network  Bandwidth  and  coverage   –  Wifi,  4G    

•  Rapid  move  toward  Internet  of  Things  (IOT)   –  Sensor  everywhere,  mul)media  informa)on  

 Diya  Soubra,    The  3Vs  that  define  Big  Data,  2012   hNp://www.datasciencecentral.com/forum/topics/the-­‐3vs-­‐that-­‐define-­‐big-­‐data  

In-­‐memory  Database   •  An  in-­‐memory  database  is    

–  a  database  management  system  that   primarily  relies  on  main   memory  for  computer  data  storage.     –  faster  than  disk-­‐op)mized  databases   since  the  internal  op)miza)on   algorithms  are  simpler  and  execute   fewer  CPU  instruc)ons.   –   Accessing  data  in  memory   eliminates  seek  )me  when  querying   the  data,  which  provides  faster  and   more  predictable  performance  than   disk.  

Source:  hNp://en.wikipedia.org/wiki/In-­‐memory_database  

BigData  Infrastructure  Goes  to  Cloud   •  Data  is  already  on  the  cloud   –  Virtual  organiza)on   –  Cloud  based  SaaS  Service  

•  Big  Data  As  a  Service  on  the  Cloud   –  Private  Cloud   –  Public  Cloud  

•  IBM  Bluemix,  Amazon  AWS  (EMR)  and  many     App  

Services   App  

Big  Data   Services  

Amazon •  Amazon  EC2   –  Computa)on  Service  using  VM  

•  Amazon  DynamoDB   –  Large  scalable  NoSQL  databased   –  Fully  distributed  shared  nothing  architecture  

•  Amazon  Elas)c  MapReduce  (Amazon  EMR)   –  Hadoop  based  analysis  engine   –  Can  be  used  to  analyse  big  data  without  the   need  to  build  the  infrastucture   hNp://aws.amazon.com/big-­‐data/  

Deep  Learning   •  subcategory  of  machine  learning  with  the  use   of  neural  networks  to  improve  things   like  speech  recogni)on,  computer  vision,   and  natural  language  processing.     –  Unsupervised  learning  for  abstract  concept  

Applying  Deep  Learning   •  Facebook  using  deep  learning  exper)se  to  help  create   solu)ons  that  will  beNer  iden)fy  faces  and  objects  in   the  350  million  photos  and  videos  uploaded  to  Facebook   each  day.   •  Voice  recogni)on  like  Google  Now  and  Apple’s  Siri  is  now   using  deep  learning.  

–  According  to  Google  researchers,  the  voice  error  rate  in  the  new   version  of  Android-­‐-­‐aeer  adding  insights  from  deep  learning-­‐-­‐ stands  at  25%  lower  than  previous  versions  of  the  soeware.    

h#p://www.wired.com/2014/08/deep-­‐learning-­‐yann-­‐lecun/   Source:  h#p://www.fastcolabs.com/3026423/why-­‐google-­‐is-­‐invesUng-­‐in-­‐deep-­‐learning  

IBM  Watson  and  Cogni)ve  Technology   •  Watson  is  a  cogni)ve  technology  that   processes  informa)on  more  like  a   human  than  a  computer   –  understanding  natural  language,   genera)ng  hypotheses  based  on   evidence,  and  learning  as  it  goes.  And   learn  it  does.    

•  Watson  “gets  smarter”  in  three  ways:     –  being  taught  by  its  users   –   learning  from  prior  interac)ons   –  being  presented  with  new  informa)on.    

•  This  means  organiza)ons  can  more   fully  understand  and  use  the  data   that  surrounds  them,  and  use  that   data  to  make  beNer  decisions.  

Applying  Watson  in  Healthcare   • 

WellPoint,  Inc.  is  an  Indianapolis-­‐based  health   benefits  company.     –  approximately  37  million  health  plan  members     –  processes  more  than    550  million  claims  per  year.    

• 

Using  IBM  Watson™  to  improve  the  quality    and   efficiency  of  healthcare  decisions.   –  WellPoint  trained  Watson  with  25,000  historical   cases.  Now  Watson  uses  hypothesis  genera)on  and   evidence-­‐based  learning  to  generate  confidence-­‐ scored  recommenda)ons  that  help  nurses  make   decisions  about  u)liza)on  management.  Natural   language  processing  leverages  unstructured  data,   such  as  text-­‐based  Treatment  requests.    

• 

Benefit  

–  Helps  UM  nurses  make  faster  UM  decisions  about   treatment  requests   –  Could  accelerate  healthcare  preapprovals,  which  can   be  cri)cal  when  treatments  are  )me-­‐sensi)ve   –  Includes  unstructured  data  in  the  streamlined   decision  process  

Challenges   •  Developing  Big  Data  Applica)on  is  not  simple   –  New  algorithm,  new  soeware  development  tools    

•  Proper  policy  about  data  security  and   ownership   •  Lack  of  Data  Scien)sts   –  Different  from  Soeware  Developer  

 

Support  Slide  

Source:  The  field  guide  to  Data  Science  

Flume  Components   Event  

Component  

Defini)on   A  singular  unit  of  data  that  is  transported  by  Flume  (typically  a  single  log   entry)  

Source  

The  en)ty  through  which  data  enters  into  Flume.  Sources  either  ac)vely  poll   for  data  or  passively  wait  for  data  to  be  delivered  to  them.  A  variety  of   sources  allow  data  to  be  collected,  such  as  log4j  logs  and  syslogs.  

Sink  

The  en)ty  that  delivers  the  data  to  the  des)na)on.  A  variety  of  sinks  allow   data  to  be  streamed  to  a  range  of  des)na)ons.  One  example  is  the  HDFS  sink   that  writes  events  to  HDFS.  

Channel  

The  conduit  between  the  Source  and  the  Sink.  Sources  ingest  events  into  the   channel  and  the  sinks  drain  the  channel.  

Agent  

Any  physical  Java  virtual  machine  running  Flume.  It  is  a  collec)on  of  sources,   sinks  and  channels.  

Client  

The  en)ty  that  produces  and  transmits  the  Event  to  the  Source  opera)ng   within  the  Agent.  

Source:  hNp://hortonworks.com/hadoop/flume/  

How  Flume  Works!     •  •  •  •  •  • 

• 

A  flow  in  Flume  starts  from  the  Client.   The  Client  transmits  the  Event  to   a  Source  opera)ng  within  the  Agent.   The  Source  receiving  this  Event  then  delivers  it  to   one  or  more  Channels.   One  or  more  Sinks  opera)ng  within  the   same  Agent  drains  these  Channels.   Channels  decouple  the  inges)on  rate  from  drain   rate  using  the  familiar  producer-­‐consumer  model   of  data  exchange.   When  spikes  in  client  side  ac)vity  cause  data  to  be   generated  faster  than  can  be  handled  by  the   provisioned  des)na)on  capacity  can  handle,   the  Channel  size  increases.  This  allows  sources  to   con)nue  normal  opera)on  for  the  dura)on  of  the   spike.   The  Sink  of  one  Agent  can  be  chained  to   the  Source  of  another  Agent.  This  chaining  enables   the  crea)on  of  complex  data  flow  topologies.  

Source:  hNp://hortonworks.com/hadoop/flume/  

eBay     • 

eBay  is  using  Hadoop  technology  and  the  Hbase  database,  which  supports  real-­‐ )me  analysis  of  Hadoop  data,  to  build  a  new  search  engine  for  its  auc)on  site.   –  –  –  – 

•  • 

97  million  ac)ve  buyers  and  sellers     over  200  million  items  for  sale  in  50,000  categories.     The  site  handles  close  to  2  billion  page  views.    250  million  search  queries  and  tens  of  billions  of  database  calls  daily.  

The  company  has  9  petabytes  of  data  stored  on  Hadoop  and  Teradata  clusters,   and  the  amount  is  growing  quickly,  he  said.   100  eBay  engineers  are  working  on  the  Cassini  project.  The  new  engine  is   expected  to  respond  to  user  queries  with  results  that  are  context-­‐based  and  more   accurate  than  those  provided  by  the  current  system.  

Source:  hNp://www.computerworld.com/ar)cle/2550078/data-­‐center/hadoop-­‐is-­‐ready-­‐for-­‐the-­‐enterprise-­‐-­‐it-­‐execs-­‐say.html  

•  JPMorgan  Chase  s)ll  relies  heavily  on  rela)onal   database  systems  for  transac)on  processing.   •  Hadoop  technology  is  used  for  a  growing  number  of   purposes,  including  fraud  detecUon,  IT  risk   management  and  self  service.   –  With  over  150  petabytes  of  data  stored  online,  30,000   databases  and  3.5  billion  log-­‐ins  to  user  accounts.  

•  Hadoop's  ability  to  store  vast  volumes  of  unstructured   data  allows  the  company  to  collect  and  store  Web   logs,  transac)on  data  and  social  media  data.   •  The  data  is  aggregated  into  a  common  plaporm  for   use  in  a  range  of  customer-­‐focused  data  mining  and   data  analy)cs  tools.   Source:  hNp://www.computerworld.com/ar)cle/2550078/data-­‐center/hadoop-­‐is-­‐ready-­‐for-­‐the-­‐enterprise-­‐-­‐it-­‐execs-­‐say.html  

Rizzoli  Orthopedic  Ins)tute  in     Bologna,  Italy     •  Use  advanced  analy)cs  to  gain  a  more   “granular  understanding”  of  the  clinical   varia)ons  within  families  whereby  individual   pa)ents  display  extreme  differences  in  the   severity  of  their  symptoms.     •  The  insight  is  reported  to  have  reduced  annual   hospitaliza)ons  by  30%  and  the  number  of   imaging  tests  by  60%.  

Premier   •  Premier,  the  U.S.  healthcare  alliance  network.  More   than  2,700  members,  hospitals  and  health  systems, 90,000  non-­‐acute  facili)es  and  400,000  physicians    

–  a  large  database  of  clinical,  financial,  pa)ent,and  supply   chain  data   –  generated  comprehensive  and  comparable  clinical   outcome  measures,  resource  u)liza)on  reports  and   transac)on  level  cost  data.    

•  Big  data  is  used  to  improve  the  healthcare  processes  at   approximately  330  hospitals,  saving  an  es)mated   29,000  lives  and  reducing  healthcare  spending  by   nearly  $7  billion   Reference:  IBM:  Data  Driven  Healthcare  Organiza)ons  Use  Big  Data  Analy)cs  for  Big   Gains;  2013.  hNp://www03.ibm.com/industries/ca/en/healthcare/   documents/Data_driven_healthcare_organiza)ons_use_big_data_analy)cs_for_big_gains.pdf.  

Some  thought   •  BoNom  up  approach  may  be  good  when  you  do  not  know   how  to  start?   •  Pick  some  easy  ques)on  and  start  a  pilot   –  Learning  infrastructure  technology,  analy)c  technology  and   tools   –  Using  data  you  already  have    

•  Top  down  that  focus  on  business  value  is  beNer  but   challenging  

–  Hard  to  ask  a  good  ques)on,  need  management  to  iden)fy  the   need     –  May  have  to  ask  many  ques)ons  and  pick  the  right  one  based   on   •  Impact  and  value   •     

Example:  What  is/is  not    big  data   problem?   •  I  want  to  classify  the  legal  documents  to  make   it  easy  to  process  these  documents   •  I  want  to  learn  how  our  customer  react  to  our   new  Tee-­‐shirt   •  I  want  to  understand  how  our  students  use   facebook  

Trend:    Big  data  infrastructure   becomes  even  more  powerful  and   easy  to  use    

Google  Cloud  Plaporm •  App  engines    

–  mobile  and  web  app  

•  Cloud  SQL  

–  MySQL  on  the  cloud  

•  Cloud  Storage  

–  Data  storage  

•  Big  Query  

–  Data  analysis  

•  Google  Compute  Engine   –  Processing  of  large  data

Smarter  data  analy)cs  is  coming      

Big  Data  Analy)cs   •  a  set  of  advanced  technologies   designed  to  work  with  large   volumes  of  heterogeneous  data.     •  explore  the  data  and  to  discover   interrela)onships  and  paNerns   using    sophis)cated  quan)ta)ve   methods  such  as     •  •  •  •  • 

machine  learning   neural  networks   robo)cs  algorithm     computa)onal  mathema)cs   ar)ficial  intelligence    

What  is  Spark?   Fast and Expressive Cluster Computing " Engine Compatible with Apache Hadoop

10×  faster  o 100×  in  m n  disk,   emory

Up  to  

2-­‐5×  les s  c

 

ode  

Efficient  

Usable  

•  General  execu)on  graphs   •  In-­‐memory  storage  

•  Rich  APIs  in  Java,  Scala,   Python   •  Interac)ve  shell  

Spark  at  Yahoo   • 

Personalizing  news  pages  for  Web  visitors  and   another  for  running  analy)cs  for  adver)sing.   For  news  personaliza)on,  the  company  uses   ML  algorithms  running  on  Spark  to  figure  out   what  individual  users  are  interested  in,  and   also  to  categorize  news  stories  as  they  arise  to   figure  out  what  types  of  users  would  be   interested  in  reading  them.   –  wrote  a  Spark  ML  algorithm  120  lines  of  Scala.   (Previously,  its  ML  algorithm  for  news   personaliza)on  was  wriNen  in  15,000  lines  of  C++.)   –  With  just  30  minutes  of  training  on  a  large,  hundred   million  record  data  set,  the  Scala  ML  algorithm  was   ready  for  business.  

• 

Second  use  case  shows  off  Hive  on  Spark   (Shark’s)  interac)ve  capability.    

–  use  exis)ng  BI  tools  to  view  and  query  their   adver)sing  analy)c  data  collected  in  Hadoop.    

hNp://www.datanami.com/2014/03/06/apache_spark_3_real-­‐ world_use_cases/  

Source:  The  Filed  Guide  to  Data  Science,  Booz,  Allen,  Hamilton  

BigData-IMC-Sept-17.pdf

Page 1. Whoops! There was a problem loading more pages. BigData-IMC-Sept-17.pdf. BigData-IMC-Sept-17.pdf. Open. Extract. Open with. Sign In. Main menu.

21MB Sizes 3 Downloads 150 Views

Recommend Documents

No documents