Introduction to Big Data Dr. Putchong Uthayopas Department of Computer Engineering, Faculty of Engineering, Kasetsart University Email:
[email protected]
PART I: INTRODUCTION TO BIG DATA
During the first day of a baby’s life, the amount of data generated by humanity is equivalent to 70 times the information contained in the Library of Congress. | Photo Credit: ©Catherine Balet “Strangers in the light” (Steidl) 2012 / from The Human Face of Big Data
By signing up with the personal genetics company 23andMe, producer of the documentary We Came Home Yasmine Delawari Johnson was able to get a glimpse into the future. | Photo Credit: © Douglas Kirkland 2012 / from The Human Face of Big Data
Informa)on as an Asset • Cloud will enable larger and larger data to be easily collected and used • People will deposit informa)on into the cloud – Bank, personal ware house
• New technology will emerge
– Larger and scalable storage technology – Innova)ve and complex data analysis/visualiza)on for mul)media data – Security technology to ensure privacy
• Cloud will be mankind intelligent and memory!
WHAT IS BIG DATA?
Big data is high-volume, high-velocity and highvariety information assets that demand costeffective, innovative forms of information processing for enhanced insight and decision making. “Gartner Inc.”
Why BigData? The real value of big data is in the insights it produces when analyzed— discovered paNerns, derived meaning, indicators for decisions, and ul)mately the ability to respond to the world with greater intelligence.
• Improve product and service • Increase customer sa)sfac)on/behavior • Improve opera)on efficiency • Understand emerging market trends
h#p://www.intel.com/content/dam/www/public/us/en/documents/product-‐briefs/big-‐data-‐cloud-‐technologies-‐brief.pdf )
Property of Big Data Velocity Volume
Variety
BIG Data
Volume • Big data must be huge – Beyond the capability of a single computer server to process it – Possible to store the data but difficult to process it
Velocity • Big data accumulate at a very fast speed – Stock market data – Internet access log – Social media data • TwiNer , facebook, IG
• We need to – Extract meaning as fast and as much as we can before throwing away the data
Variety • Data come with variety
– Tradi)onal data base – Documents – Web page – Social media data – Image – Video/Audio – Loca)on
Considera)on for Applying Big Data
hNp://fredericgonzalo.com/en/2013/07/07/big-‐data-‐in-‐tourism-‐hospitality-‐4-‐key-‐components/
BIG DATA ECOSYSTEM
Big Data Eco system-‐ Infrastructure • Hadoop-‐
– technologies designed for the storing, processing and analysing of data by breaking up and distribu)ng data into parts and analysing those parts concurrently, rather than tackling one monolithic block of data all in one go.
• NoSQL
– Stands for Not Only SQL – involved in processing large volumes of mul)-‐structured data. Most NoSQL databases are most adept at handling discrete data stored among mul)-‐structured data.
• Massively Parallel Processing (MPP) Databases
– MPP databases work by segmen)ng data across mul)ple nodes, and processing these segments of data in parallel, and uses SQL. Reference: hNp://dataconomy.com/understanding-‐big-‐data-‐ecosystem/
Big Data Eco system-‐ Analy)cs • AnalyKcs PlaLorms – Integrate and analyse data to uncover new insights, and help companies make beNer-‐ informed decisions.
• VisualizaKon PlaLorms – visualizing data; taking the raw data and presen)ng it in complex, mul)-‐dimensional visual formats to illuminate the informa)on
• Business Intelligence (BI) PlaLorms – analyze data from mul)ple sources to deliver services such as business intelligence reports, dashboards and visualiza)ons
• Machine Learning – machine learning is data the algorithm ‘learns from’, and the output depends on the use case. One of the most famous examples is IBM’s super computer Watson, which has ‘learned’ to scan vast amounts of informa)on to find specific answers, and can comb through 200 million pages of structured and unstructured data in minutes.
Reference: hNp://dataconomy.com/understanding-‐big-‐data-‐ecosystem/
NoSQL (Not Only SQL) • A NoSQL (oeen interpreted as Not only SQL) database provides a mechanism for storage and retrieval of data that is modeled in means other than the tabular rela)ons used in rela)onal databases. – being non-‐relaKonal, distributed, open-‐ source and horizontally scalable. – Used to handle a huge amount of data – The original inten)on has been modern web-‐scale databases. Reference: hNp://nosql-‐database.org/
• MongoDB is a general purpose, open-‐source database. • MongoDB features:
– Document data model with dynamic schemas – Full, flexible index support and rich queries – Auto-‐Sharding for horizontal scalability – Built-‐in replica)on for high availability – Text search – Advanced security
• Hadoop is an open-‐source soeware framework wriNen in Java for distributed storage and distributed processing of very large data sets on computer clusters built from commodity hardware. • Hadoop was created by Doug Cu'ng and Mike Cafarella in 2005. Culng, who was working at Yahoo! at the )me, named it aeer his son's toy elephant.
• The base Apache Hadoop framework is composed of the following modules:
– Hadoop Common – contains libraries and u)li)es needed by other Hadoop modules; – Hadoop Distributed File System (HDFS) – a distributed file-‐system that stores data on commodity machines, providing very high aggregate bandwidth across the cluster; – Hadoop YARN – a resource-‐management plaporm responsible for managing compute resources in clusters and using them for scheduling of users' applica)ons;and – Hadoop MapReduce – a programming model for large scale data processing.
Magic behind Hadoop and HDFS • Problem is divided into two phases
– Map applying some ac)on to data in
Pair and get some intermediate results – Reduce summarize intermediate result and return back to main program
Ricky Ho, How Hadoop Map/Reduce works, hNp://architects.dzone.com/ar)cles/how-‐hadoop-‐mapreduce-‐works
Example: Word count • Coun)ng word in an input text file.
– How many word “love” in a novel? ^_^
• In map phase the sentence would be split as words and form the ini)al key value pair • “tring tring the phone rings” becomes ,, , ,
– In the reduce phase the keys are grouped together and the values for similar keys are added. • There are only one pair of similar keys ‘tring’ the values for these keys would be added so the out put key value pairs would be • , , , • Reduce forms an aggrega)on phase for keys
– This would give the number of occurrence of each word in the input. hNp://kickstarthadoop.blogspot.com/2011/04/word-‐count-‐hadoop-‐map-‐reduce-‐ example.html
Hadoop and its Eco System • Hadoop is not a piece of soeware but an ecosystem for Big Data processing • Many tools has been build and share many of the Hadoop components especially the HDFS
Hadoop Eco System • Pig (http://pig.apache.org) – High-level language for data analysis
• Hbase (http://hbase.apache.org) (very large tables -- billions of rows X millions of columns) – Table storage for semi-structured data
• Zookeeper (https://zookeeper.apache.org) – Coordinating distributed applications
• Hive (https://hive.apache.org) – SQL-like Query language and Metastore
• Mahout (http://www.tutorialspoint.com/mahout/) – Machine learning
Apache Pig • Apache Pig is a soeware framework which offers a run-‐ )me environment for execu)on of MapReduce jobs on a Hadoop Cluster via a high-‐level scrip)ng language called Pig La)n. The following are a few highlights of this project: – Pig is an abstrac)on (high level programming language) on top of a Hadoop cluster. – Pig La)n queries/commands are compiled into one or more MapReduce jobs and then executed on a Hadoop cluster. – Just like a real pig can eat almost anything, Apache Pig can operate on almost any kind of data. – Hadoop offers a shell called Grunt Shell for execu)ng Pig commands. – DUMP and STORE are two of the most common commands in Pig. DUMP displays the results to screen and STORE stores the results to HDFS. – Pig offers various built-‐in operators, func)ons and other constructs for performing many common opera)ons. Source: hNp://www.kalyanhadooptraining.com/2014/07/big-‐data-‐basics-‐ part-‐6-‐related-‐apache.html
An Example Problem • Input: user data in a file, website data in another • Requirement: find the top 5 most visited pages by users aged 18-25
Load Users data file Filter by age
Load web Pages
18-25 Join on name Group on url
Count number of clicks Order by clicks Take top 5 User A visited Page X
In MapReduce
In Pig La)n Users = load ‘users’ as (name, age); Filtered = filter Users by age >= 18 and age <= 25; Pages = load ‘pages’ as (user, url); Joined = join Filtered by name, Pages by user; Grouped = group Joined by url; Summed = foreach Grouped generate group, count(Joined) as clicks; Sorted = order Summed by clicks desc; Top5 = limit Sorted 5; store Top5 into ‘top5sites’;
Apache HBase • Apache HBase is a distributed, versioned, column-‐oriented, scalable and a big data store on top of Hadoop/HDFS. The following are a few highlights of this project:
– Runs on top of Hadoop and HDFS in a distributed fashion. – Supports Billions of Rows and Millions of Columns. – Runs on a cluster of commodity hardware and scales linearly. – Offers consistent reads and writes. – Offers easy to use Java APIs for client access.
HBASE Table Structure
Hbase Vs. HDFS
Apache HBase • Where to Use HBase
– Apache HBase is used to have random, real-‐)me read/write access to Big – Data. – It hosts very large tables on top of clusters of commodity hardware.
• Applica)ons of HBase
– HBase is used whenever we need to provide fast random access to available data.
• Companies such as Facebook, TwiNer, Yahoo, and Adobe use Hbase internally.
Hive • Developed at Facebook • Used for majority of Facebook jobs • “Relational database” built on Hadoop – Maintains list of table schemas – SQL-like query language (HiveQL) – Can call Hadoop Streaming scripts from HiveQL – Supports table partitioning, clustering, complex data types, some optimizations Original Slides by Matei Zaharia UC Berkeley RAD Lab
What Is Hive? – Hive is designed to enable easy data summariza)on, ad-‐hoc querying and analysis of large volumes of data. • Hive provides a simple query language called Hive QL, which is based on SQL and which enables users familiar with SQL to do ad-‐hoc querying, summariza)on and data analysis easily. • Hive QL also allows tradi)onal map/reduce programmers to be able to plug in their custom mappers and reducers to do more sophis)cated analysis that may not be supported by the built-‐in capabili)es of the language. hNps://cwiki.apache.org/confluence/display/ Hive/Tutorial
What Hive Is NOT! • Hadoop is a batch processing system and Hadoop jobs tend to have high latency and incur substan)al overheads in job submission and scheduling. – latency for Hive queries is generally very high (minutes) even when data sets involved are very small (say a few hundred megabytes). – Hive aims to provide acceptable (but not op)mal) latency for interac)ve data browsing, queries over small data sets or test queries.
• Hive is not designed for online transac)on processing and does not offer real-‐)me queries and row level updates.
– It is best used for batch jobs over large sets of immutable data (like web logs).
Crea)ng a Hive Table CREATE TABLE page_views(viewTime INT, userid BIGINT, page_url STRING, referrer_url STRING, ip STRING COMMENT 'User IP address') COMMENT 'This is the page view table' PARTITIONED BY(dt STRING, country STRING) STORED AS SEQUENCEFILE;
• Partitioning breaks table into separate files for each (dt, country) pair Ex: /hive/page_view/dt=2008-06-08,country=USA /hive/page_view/dt=2008-06-08,country=CA
A Simple Query • Find all page views coming from xyz.com on March 31st: SELECT page_views.* FROM page_views WHERE page_views.date >= '2008-‐03-‐01' AND page_views.date <= '2008-‐03-‐31' AND page_views.referrer_url like '%xyz.com';
• Hive only reads partition 2008-‐03-‐01,* instead of scanning entire table
• Apache Mahout is a scalable machine learning and data mining library. The following are a few highlights of this project:
– Mahout implements the machine learning and data mining algorithms using MapReduce. – Mahout has 4 major categories of algorithms: Collabora)ve Filtering, Classifica)on, Clustering, and Dimensionality Reduc)on. – Mahout library contains two types of algorithms: Ones that can run in local mode and the others which can run in a distributed fashion.
• List of algorithm supported is here. hNp:// mahout.apache.org/users/basics/algorithms.html
Who use Mahout? • Foursquare uses Mahout for its recommenda)on engine. • TwiNer uses Mahout's LDA implementa)on for user interest modeling • Yahoo! Mail uses Mahout's Frequent PaNern Set Mining. See slides • Foursquare uses Mahout for its recommenda)on engine.
Apache Flume • Apache Flume is a distributed, reliable, and available service for efficiently collec)ng, aggrega)ng, and moving large amounts of streaming data into the Hadoop Distributed File System (HDFS). – based on streaming data flows – tunable reliability mechanisms for failover and recovery.
Source: hNp://hortonworks.com/hadoop/flume/
Flume Example Use Case • Flume can be used to log manufacturing opera)ons.
– When one run of product comes off the line, it generates a log file about that run. • this occurs hundreds or thousands of )mes per day
– Large volume log file data can stream through Flume – Years of produc)on runs can be stored in HDFS and analyzed by a quality assurance engineer using Apache Hive.
HortonWorks Commercial Hadoop Ecosystem •
Hortonworks Data Plaporm (HDP) is an open source distribu)on powered by Apache Hadoop. actual Apache-‐released versions of the components with all the necessary bug fixes to make all the components interoperable in your produc)on environments. – packaged with an easy to use installer (HDP Installer) that deploys the complete Apache Hadoop stack to your en)re cluster and provides the necessary monitoring capabili)es using Ganglia and Nagios. –
•
The HDP distribu)on consists of the following components:
Core Hadoop plaporm (Hadoop HDFS and Hadoop MapReduce) Non-‐rela)onal database (Apache HBase) Metadata services (Apache HCatalog) Scrip)ng plaporm (Apache Pig) Data access and query (Apache Hive) Workflow scheduler (Apache Oozie) Cluster coordina)on (Apache Zookeeper) Data integra)on services (HCatalog APIs, WebHDFS, and Apache Sqoop) – Distributed log management services (Apache Flume) – Machine learning library (Mahout) – – – – – – – –
•
Try using Sandbox hNp://hortonworks.com/products/ hortonworks-‐sandbox/#install
BIG DATA BENEFIT AND USE CASE
Big Data use case •
A 360 degree view of the customer –
•
Internet of Things –
•
A large company, hoping to boost the efficiency of its enterprise data warehouse
Big data service refinery –
•
The second most popular use case involves IoT-‐ connected devices managed by hardware, sensor, and informa)on security companies. "These devices are
Data warehouse op)miza)on –
•
This use is most popular, according to Gallivan. Online retailers want to find out what shoppers are doing on their sites -‐-‐ what pages they visit, where they linger, how long they stay, and when they leave.
This means using big-‐data technologies to break down silos across data stores and sources to increase corporate efficiency.
Informa)on security –
This last use case involves large enterprises with sophis)cated informa)on security architectures, as well as security vendors looking for more efficient ways to store petabytes of event or machine data.
Source: hNp://www.informa)onweek.com/big-‐data/big-‐data-‐analy)cs/5-‐big-‐data-‐use-‐cases-‐to-‐watch/d/d-‐id/1251031
Social Media Analy)cs • Social media analyKcs is the prac)ce of gathering data from blogs and social media websites and analyzing that data to make business decisions. The most common use of social media analyKcs is to mine customer sen)ment in order to support marke)ng and customer service ac)vi)es. What is social media analy)cs? -‐ Defini)on from WhatIs.com
Sen)ment Analysis • Sen)ment analysis (also known as opinion mining) refers to the use of natural language processing, text analysis and computa)onal linguis)cs to iden)fy and extract subjec)ve informa)on in source materials. • sen)ment analysis aims to determine the altude of a speaker or a writer with respect to some topic or the overall contextual polarity of a document.
• judgment or evalua)on (see appraisal theory) • affec)ve state (that is to say, the emo)onal state of the author when wri)ng) • intended emo)onal communica)on (that is to say, the emo)onal effect the author wishes to have on the reader).
• Applica)on
– Book Recommenda)on – Product review
Google Flu •
•
•
paNern emerges when all the flu-‐ related search queries are added together. We compared our query counts with tradi)onal flu surveillance systems and found that many search queries tend to be popular exactly when flu season is happening. By coun)ng how oeen we see these search queries, we can es)mate how much flu is circula)ng in different countries and regions around the world.
hNp://www.google.org/flutrends/ about/how.html
WHAT FACEBOOK KNOWS
hNp://www.facebook.com/data
Cameron Marlow calls himself Facebook's "in-‐ house sociologist." He and his team can analyze essen)ally all the informa)on the site gathers.
Study of Human Society • Facebook, in collabora)on with the University of Milan, conducted experiment that involved – the en)re social network as of May 2011 – more than 10 percent of the world's popula)on.
• Analyzing the 69 billion friend connec)ons among those 721 million people showed that – four intermediary friends are usually enough to introduce anyone to a random stranger.
Cupid in you Network •
Study matchmaker
– surveyed approximately 1500 English speakers around the world who had listed a rela)onship on their profile at least one year ago but no more than two years – asking them how they met their partner and who introduced them (if anyone). – analyzed network proper)es of couples and their matchmakers using de-‐iden)fied, aggregated data.
•
Matchmaker characteris)cs
– Matchmakers have far more friends than the people they're selng up. – Matchmakers' networks have a different structure • their networks are less dense: their friends are less likely to know each other
– Matchmakers were more likely to be close friends, rather than acquaintances.
https://research.facebook.com/blog/448802398605370/cupid-inyour-network/
Why? • Facebook can improve users experience – make useful predic)ons about users' behavior – make beNer guesses about which ads you might be more or less open to at any given )me
• Right before Valen)ne's Day this year a blog post from the Data Science Team listed the songs most popular with people who had recently signaled on Facebook that they had entered or lee a rela)onship
How facebook handle Big Data? • Facebook built its data storage system using open-‐ source soeware called Hadoop.
– Hadoop spreading them across many machines inside a data center. – Use Hive, open-‐source that acts as a transla)on service, making it possible to query vast Hadoop data stores using rela)vely simple code.
• Much of Facebook's data resides in one Hadoop store more than 100 petabytes (a million gigabytes) in size, says Sameet Agarwal, a director of engineering at Facebook who works on data infrastructure, and the quan)ty is growing exponen)ally. "Over the last few years we have more than doubled in size every year,”
PART II: DATA SCIENCE
What is Data Science? • Data Science is the extraction of knowledge from large volumes of data that are structured or unstructured.
– "Unstructured data" can include emails, videos, photos, social media, and other usergenerated content. – Data science often requires sorting through a great amount of information and writing algorithms to extract insights from this data.
• Source:Wikipedia
Data Science Process
"Data visualiza)on process v1" by Farcaster at English Wikipedia. Licensed under CC BY-‐SA 3.0 via Wikimedia Commons -‐ hNps://commons.wikimedia.org/wiki/File:Data_visualiza)on_process_v1.png#/media/File:Data_visualiza)on_process_v1.png
Source: The Filed Guide to Data Science, Booz, Allen, Hamilton
Source: The Filed Guide to Data Science, Booz, Allen, Hamilton
Data Product • Data Product provides ac)onalble informa)on without exposing decision maker to the underlying data or analy)cs – Movie Recommenda)ons – Weather Forecast – Stock Market Predic)on – Opera)on improvement – Health Diagnosis – Targeted Adver)sing Source: The Filed Guide to Data Science, Booz, Allen, Hamilton
4 Data Science Important Components • Data Type – What is your input ?
• Analy)cs Class – How to get insight?
• Learning Model – How to learn things?
• Execu)on Model – How it work? Source: The Filed Guide to Data Science, Booz, Allen, Hamilton
Data Type • Structure – Transca)on data from tradi)onal database
• Unstructured – Text, Speech, Video, Mul)media
• Semi Structure – Social data : TwiNer, Facebook – User ac)vity log, geo-‐loca)on , web log
Analy)cs Class • An approach to get the knowledge from your data? – Grouping people – Trends of event
Source: The Filed Guide to Data Science, Booz, Allen, Hamilton
Source: The Filed Guide to Data Science, Booz, Allen, Hamilton
Source: The Filed Guide to Data Science, Booz, Allen, Hamilton
Source: The Filed Guide to Data Science, Booz, Allen, Hamilton
Learning Model • How computer algorithm learn to understand your problem?
Source: The Filed Guide to Data Science, Booz, Allen, Hamilton
Learning Style • Supervised – Some guideline has been given to algorithm
• Unsupervised – Derive the knowledge from data directly
Training Style • Off line – Data has been stored – Goes through the informa)on and extract knowledge
• On-‐line – Informa)on is coming in as a stream – Extract knowledge as the data is coming in
Execu)on Model • How your infrastructure do their jobs?
Star)ng The Ini)a)ve
Top Down Visualiza)on Analy)cs Soeware Big Data Tools Infrastructure Data
BoNom Up
BoNom up approach • What is the data that we have? • How can we collect and store it? • What is the infrastructure and tool to process this big data? • What analy)cs method can be apply? • What is the insight we can gain from this data and analysis?
Top down • What is the business challenge that can create value and impact to the organiza)on? • What is the data that we need? • What is the tools and analy)cs approach that should be used ? • What is the infrastructure needed?
PART III: TRENDS
Gartner’s Top 10 Trends
Informa)on Tsunami • Rapid expansion of Smartphone Usage, social compu)ng, mobile applica)on, gaming • Rapid increases in Network Bandwidth and coverage – Wifi, 4G
• Rapid move toward Internet of Things (IOT) – Sensor everywhere, mul)media informa)on
Diya Soubra, The 3Vs that define Big Data, 2012 hNp://www.datasciencecentral.com/forum/topics/the-‐3vs-‐that-‐define-‐big-‐data
In-‐memory Database • An in-‐memory database is
– a database management system that primarily relies on main memory for computer data storage. – faster than disk-‐op)mized databases since the internal op)miza)on algorithms are simpler and execute fewer CPU instruc)ons. – Accessing data in memory eliminates seek )me when querying the data, which provides faster and more predictable performance than disk.
Source: hNp://en.wikipedia.org/wiki/In-‐memory_database
BigData Infrastructure Goes to Cloud • Data is already on the cloud – Virtual organiza)on – Cloud based SaaS Service
• Big Data As a Service on the Cloud – Private Cloud – Public Cloud
• IBM Bluemix, Amazon AWS (EMR) and many App
Services App
Big Data Services
Amazon • Amazon EC2 – Computa)on Service using VM
• Amazon DynamoDB – Large scalable NoSQL databased – Fully distributed shared nothing architecture
• Amazon Elas)c MapReduce (Amazon EMR) – Hadoop based analysis engine – Can be used to analyse big data without the need to build the infrastucture hNp://aws.amazon.com/big-‐data/
Deep Learning • subcategory of machine learning with the use of neural networks to improve things like speech recogni)on, computer vision, and natural language processing. – Unsupervised learning for abstract concept
Applying Deep Learning • Facebook using deep learning exper)se to help create solu)ons that will beNer iden)fy faces and objects in the 350 million photos and videos uploaded to Facebook each day. • Voice recogni)on like Google Now and Apple’s Siri is now using deep learning.
– According to Google researchers, the voice error rate in the new version of Android-‐-‐aeer adding insights from deep learning-‐-‐ stands at 25% lower than previous versions of the soeware.
h#p://www.wired.com/2014/08/deep-‐learning-‐yann-‐lecun/ Source: h#p://www.fastcolabs.com/3026423/why-‐google-‐is-‐invesUng-‐in-‐deep-‐learning
IBM Watson and Cogni)ve Technology • Watson is a cogni)ve technology that processes informa)on more like a human than a computer – understanding natural language, genera)ng hypotheses based on evidence, and learning as it goes. And learn it does.
• Watson “gets smarter” in three ways: – being taught by its users – learning from prior interac)ons – being presented with new informa)on.
• This means organiza)ons can more fully understand and use the data that surrounds them, and use that data to make beNer decisions.
Applying Watson in Healthcare •
WellPoint, Inc. is an Indianapolis-‐based health benefits company. – approximately 37 million health plan members – processes more than 550 million claims per year.
•
Using IBM Watson™ to improve the quality and efficiency of healthcare decisions. – WellPoint trained Watson with 25,000 historical cases. Now Watson uses hypothesis genera)on and evidence-‐based learning to generate confidence-‐ scored recommenda)ons that help nurses make decisions about u)liza)on management. Natural language processing leverages unstructured data, such as text-‐based Treatment requests.
•
Benefit
– Helps UM nurses make faster UM decisions about treatment requests – Could accelerate healthcare preapprovals, which can be cri)cal when treatments are )me-‐sensi)ve – Includes unstructured data in the streamlined decision process
Challenges • Developing Big Data Applica)on is not simple – New algorithm, new soeware development tools
• Proper policy about data security and ownership • Lack of Data Scien)sts – Different from Soeware Developer
Support Slide
Source: The field guide to Data Science
Flume Components Event
Component
Defini)on A singular unit of data that is transported by Flume (typically a single log entry)
Source
The en)ty through which data enters into Flume. Sources either ac)vely poll for data or passively wait for data to be delivered to them. A variety of sources allow data to be collected, such as log4j logs and syslogs.
Sink
The en)ty that delivers the data to the des)na)on. A variety of sinks allow data to be streamed to a range of des)na)ons. One example is the HDFS sink that writes events to HDFS.
Channel
The conduit between the Source and the Sink. Sources ingest events into the channel and the sinks drain the channel.
Agent
Any physical Java virtual machine running Flume. It is a collec)on of sources, sinks and channels.
Client
The en)ty that produces and transmits the Event to the Source opera)ng within the Agent.
Source: hNp://hortonworks.com/hadoop/flume/
How Flume Works! • • • • • •
•
A flow in Flume starts from the Client. The Client transmits the Event to a Source opera)ng within the Agent. The Source receiving this Event then delivers it to one or more Channels. One or more Sinks opera)ng within the same Agent drains these Channels. Channels decouple the inges)on rate from drain rate using the familiar producer-‐consumer model of data exchange. When spikes in client side ac)vity cause data to be generated faster than can be handled by the provisioned des)na)on capacity can handle, the Channel size increases. This allows sources to con)nue normal opera)on for the dura)on of the spike. The Sink of one Agent can be chained to the Source of another Agent. This chaining enables the crea)on of complex data flow topologies.
Source: hNp://hortonworks.com/hadoop/flume/
eBay •
eBay is using Hadoop technology and the Hbase database, which supports real-‐ )me analysis of Hadoop data, to build a new search engine for its auc)on site. – – – –
• •
97 million ac)ve buyers and sellers over 200 million items for sale in 50,000 categories. The site handles close to 2 billion page views. 250 million search queries and tens of billions of database calls daily.
The company has 9 petabytes of data stored on Hadoop and Teradata clusters, and the amount is growing quickly, he said. 100 eBay engineers are working on the Cassini project. The new engine is expected to respond to user queries with results that are context-‐based and more accurate than those provided by the current system.
Source: hNp://www.computerworld.com/ar)cle/2550078/data-‐center/hadoop-‐is-‐ready-‐for-‐the-‐enterprise-‐-‐it-‐execs-‐say.html
• JPMorgan Chase s)ll relies heavily on rela)onal database systems for transac)on processing. • Hadoop technology is used for a growing number of purposes, including fraud detecUon, IT risk management and self service. – With over 150 petabytes of data stored online, 30,000 databases and 3.5 billion log-‐ins to user accounts.
• Hadoop's ability to store vast volumes of unstructured data allows the company to collect and store Web logs, transac)on data and social media data. • The data is aggregated into a common plaporm for use in a range of customer-‐focused data mining and data analy)cs tools. Source: hNp://www.computerworld.com/ar)cle/2550078/data-‐center/hadoop-‐is-‐ready-‐for-‐the-‐enterprise-‐-‐it-‐execs-‐say.html
Rizzoli Orthopedic Ins)tute in Bologna, Italy • Use advanced analy)cs to gain a more “granular understanding” of the clinical varia)ons within families whereby individual pa)ents display extreme differences in the severity of their symptoms. • The insight is reported to have reduced annual hospitaliza)ons by 30% and the number of imaging tests by 60%.
Premier • Premier, the U.S. healthcare alliance network. More than 2,700 members, hospitals and health systems, 90,000 non-‐acute facili)es and 400,000 physicians
– a large database of clinical, financial, pa)ent,and supply chain data – generated comprehensive and comparable clinical outcome measures, resource u)liza)on reports and transac)on level cost data.
• Big data is used to improve the healthcare processes at approximately 330 hospitals, saving an es)mated 29,000 lives and reducing healthcare spending by nearly $7 billion Reference: IBM: Data Driven Healthcare Organiza)ons Use Big Data Analy)cs for Big Gains; 2013. hNp://www03.ibm.com/industries/ca/en/healthcare/ documents/Data_driven_healthcare_organiza)ons_use_big_data_analy)cs_for_big_gains.pdf.
Some thought • BoNom up approach may be good when you do not know how to start? • Pick some easy ques)on and start a pilot – Learning infrastructure technology, analy)c technology and tools – Using data you already have
• Top down that focus on business value is beNer but challenging
– Hard to ask a good ques)on, need management to iden)fy the need – May have to ask many ques)ons and pick the right one based on • Impact and value •
Example: What is/is not big data problem? • I want to classify the legal documents to make it easy to process these documents • I want to learn how our customer react to our new Tee-‐shirt • I want to understand how our students use facebook
Trend: Big data infrastructure becomes even more powerful and easy to use
Google Cloud Plaporm • App engines
– mobile and web app
• Cloud SQL
– MySQL on the cloud
• Cloud Storage
– Data storage
• Big Query
– Data analysis
• Google Compute Engine – Processing of large data
Smarter data analy)cs is coming
Big Data Analy)cs • a set of advanced technologies designed to work with large volumes of heterogeneous data. • explore the data and to discover interrela)onships and paNerns using sophis)cated quan)ta)ve methods such as • • • • •
machine learning neural networks robo)cs algorithm computa)onal mathema)cs ar)ficial intelligence
What is Spark? Fast and Expressive Cluster Computing " Engine Compatible with Apache Hadoop
10× faster o 100× in m n disk, emory
Up to
2-‐5× les s c
ode
Efficient
Usable
• General execu)on graphs • In-‐memory storage
• Rich APIs in Java, Scala, Python • Interac)ve shell
Spark at Yahoo •
Personalizing news pages for Web visitors and another for running analy)cs for adver)sing. For news personaliza)on, the company uses ML algorithms running on Spark to figure out what individual users are interested in, and also to categorize news stories as they arise to figure out what types of users would be interested in reading them. – wrote a Spark ML algorithm 120 lines of Scala. (Previously, its ML algorithm for news personaliza)on was wriNen in 15,000 lines of C++.) – With just 30 minutes of training on a large, hundred million record data set, the Scala ML algorithm was ready for business.
•
Second use case shows off Hive on Spark (Shark’s) interac)ve capability.
– use exis)ng BI tools to view and query their adver)sing analy)c data collected in Hadoop.
hNp://www.datanami.com/2014/03/06/apache_spark_3_real-‐ world_use_cases/
Source: The Filed Guide to Data Science, Booz, Allen, Hamilton