Evolution from Apache Hadoop to the Enterprise Data Hub Dr. Amr Awadallah (Twitter: @awadallah) Co-Founder & CTO of Cloudera SMDB 2014
1
©2014 Cloudera, Inc. All rights reserved.
Why is Big Data Happening Now?
2
©2014 Cloudera, Inc. All rights reserved.
It Isn’t Just About Web 2.0 / Social AUTOMOTIVE Auto sensors reporting location, problems
COMMUNICATIONS Location-based advertising
CONSUMER PACKAGED GOODS Sentiment analysis of what’s hot, customer service
FINANCIAL SERVICES Risk & portfolio analysis New products
EDUCATION & RESEARCH Experiment sensor analysis
HIGH TECHNOLOGY / INDUSTRIAL MFG. Mfg quality Warranty analysis
LIFE SCIENCES Clinical trials Genomics
MEDIA / ENTERTAINMENT Viewers / advertising effectiveness
ON-LINE SERVICES / SOCIAL MEDIA People & career matching Website optimization
HEALTH CARE Patient sensors, monitoring, EHRs Quality of care
OIL & GAS Drilling exploration sensor analysis
RETAIL Consumer sentiment Optimized marketing
TRAVEL & TRANSPORTATION Sensor analysis for optimal traffic flows Customer sentiment
UTILITIES Smart Meter analysis for network capacity
LAW ENFORCEMENT & DEFENSE Threat analysis - social media monitoring, photo analysis
©2014 Cloudera, Inc. All Rights Reserved.
10TB to 10PB
IT’S ALL (BIG) DATA
4
©2014 Cloudera, Inc. All rights reserved.
Apache Hadoop: Storage and Compute on One Platform The Hadoop Way
The Traditional Way Compute (RDBMS, EDW)
Data Storage (SAN, NAS)
Compute (CPU)
Storage (Disk)
z z
Network
5
Memory
Expensive, Special purpose, “Reliable” Servers Expensive Licensed Software • Hard to scale • Network is a bottleneck • Only handles relational data • Difficult to add new fields & data types
Commodity “Unreliable” Servers Hybrid Open Source Software • Scales out forever • No bottlenecks • Easy to ingest any data • Agile data access
Expensive & Unattainable
Affordable & Attainable
$30,000+ per TB
$300-$1,000 per TB ©2014 Cloudera, Inc. All rights reserved.
Expanding Data Requires A New Approach
What we do
What we should do
Copy Data to Applications Comput e
Comput e
Dat a
Data
Data Comput e
6
Bring Applications to Data
Comput e
Process-centric businesses use:
Data
• Structured data mainly • Internal data only • “Important” data only • Multiple copies of data
Comput e
Comput e
Data
©2014 Cloudera, Inc. All rights reserved.
Information-centric businesses use all Data: Multi-structured, Internal & external data of all types
A Typical Journey of Hadoop Adoption Transformative Applications (New Business Value)
Operational Efficiency (Faster, Bigger, Cheaper)
Cheap Storage
ETL Acceleration
EDW Optimization
Agile Exploration
Converged Analytics
Business
IT 7
Data Science
©2014 Cloudera, Inc. All rights reserved.
The Typical Enterprise Data Analytics Stack Business Intelligence / Applications RDBMS ETL Processing Staging / Storage Collection 8
©2014 Cloudera, Inc. All rights reserved.
Step 1: EDH for Storage/Staging/Active Archive Business Intelligence / Applications RDBMS ETL Processing EDH for Storage Active Archive Collection 9
©2014 Cloudera, Inc. All rights reserved.
Step 2: EDH for Data Collection (Flume/Sqoop) Business Intelligence / Applications RDBMS ETL Processing
EDH for Collection & Storage.
10
©2014 Cloudera, Inc. All rights reserved.
Step 3: EDH for ETL Processing Acceleration Business Intelligence / Applications RDBMS
EDH for Collection, Storage & ETL Processing Acceleration.
11
©2014 Cloudera, Inc. All rights reserved.
ETL / Data Integration Tools
Step 4: EDH for EDW Optimization (Impala) Business Intelligence / Applications RDBMS
Rarely Used Data
EDH for Collection, Storage, ETL Acceleration & Historical RDBMS Data/Queries
12
©2014 Cloudera, Inc. All rights reserved.
Step 5: EDH for Agile Exploration BI / Applications
Agile Exploration
RDBMS
EDH for Collection, Storage, ETL Acceleration, Historical Queries, & Agile Exploration
13
©2014 Cloudera, Inc. All rights reserved.
Step 6: EDH for Data Science (Not Only SQL) BI / Applications
Agile Exploration
Data Science
RDBMS
EDH for Collection, Storage, ETL Acceleration, Historical Queries, Exploration & Data Science
14
©2014 Cloudera, Inc. All rights reserved.
Step 7: Converged Analytics - Apps Come to Data BI
Explore
Data Science
SAS, R, Spark
Informatica SyncSort, Pentaho
Hunk ...
RDBMS EDH for Collection, Storage, ETL Acceleration, Historical Queries, Exploration, Data Science & Mulitple Applications/Workloads
15
©2014 Cloudera, Inc. All rights reserved.
The Traditional Way: Bringing Data to Compute 4
Complex Architecture
3
Cost of Analytics
2
Time to Data
1
Missing Data
16
• Many special-purpose systems • Moving data around • No complete views
• Existing systems strained • No agility • “BI backlog” EDWS
MARTS
SERVERS
DOCUMENTS
STORAGE
SEARCH
ARCHIVE
• Up-front modeling • Transforms slow • Transforms lose data
• Leaving data behind • Risk and compliance • High cost of storage
ERP, CRM, RDBMS, MACHINES
FILES, IMAGES, VIDEOS, LOGS, CLICKSTREAMS
©2014 Cloudera, Inc. All rights reserved.
EXTERNAL DATA SOURCES
The New Way: Bringing Compute to Data 4
Diverse Analytic Platform • Bring applications to data • Combine different workloads on common data (i.e. SQL + Search) • True analytic agility
3
Self-Service Exploratory BI
2
Persistent Staging
1
Active Compliance Archive
17
3 2
• Simple search + BI tools • “Schema on read” agility • Reduce BI user backlog requests SERVERS
MARTS
EDWS
DOCUMENTS
STORAGE SEARCH
ARCHIVE
1
• One source of data for all analytics • Persist state of transformed data • Significantly faster & cheaper
• Full fidelity original data • Indefinite time, any source • Lowest cost storage
4
ERP, CRM, RDBMS, MACHINES
FILES, IMAGES, VIDEOS, LOGS, CLICKSTREAMS
©2014 Cloudera, Inc. All rights reserved.
ESTERNAL DATA SOURCES
Evolution of The Enterprise Data Hub
18
SEARCH ENGINE
MACHINE LEARNING
STREAM PROCESSING
MAPREDUCE
IMPALA
SOLR
SPARK
SPARK STREAMING
✖ ✔ ✖ ✔
✖ ✔
WORKLOAD MANAGEMENT
YARN
3RD PARTY APPS
CLOUDERA MANAGER
Secure and Governed
ANALYTIC SQL
SYSTEM MANAGEMENT
Open Architecture
BATCH PROCESSING
CLOUDERA NAVIGATOR
Managed
✔
CLOUDERA’S ENTERPRISE DATA HUB DATA MANAGEMENT
Open Source Scalable Flexible Cost-Effective
STORAGE FOR ANY TYPE OF DATA UNIFIED, ELASTIC, RESILIENT,, SECURE SENTRY
FILESYSTEM
ONLINE NOSQL
HDFS
HBASE
©2014 Cloudera, Inc. All rights reserved.
The Modern Information Architecture Data Architects
System Operators
Engineers
Data Scientists
Analysts
Business Users
META DATA / ETL TOOLS
CLOUDERA MANAGER
CONVERGED ANALYTICS
DATA MODELING
BI / ANALYTICS
ENTERPRISE REPORTING
ENTERPRISE DATA WAREHOUSE
ENTERPRISE DATA HUB
SYS LOGS
WEB LOGS
FILES
ONLINE SERVING SYSTEM
RDBMS
WEB/MOBILE APPLICATION Customers & End Users
19
©2014 Cloudera, Inc. All rights reserved.
The Power of the EDH is? EDH
RDBMS
20
©2014 Cloudera, Inc. All rights reserved.
Enabling The App Store of Big Data BI and Analytics Partners SI, Cloud, MSP Partners
Database Partners Resellers Data Integration Partners Hardware Partners
21
©2014 Cloudera, Inc. All rights reserved.
Thank You! Twitter: @awadallah 22
©2014 Cloudera, Inc. All rights reserved.