Outline Big Data, Machine Learning & Data mining, Computational Science and Engineering
Machine Learning and Data mining
Ho Tu Bao Japan Advanced Institute of Science and Technology (JAIST) John von Neumann Institute, VNU-HCM
Computational Science and Engineering
2
BIG DATA TORRENT
Three emerging IT technologies
BIG DATA VALUE
Smart devices, cloud computing, big data
cloud computing
Smartdevices Many IT companies planned for the future based on these three technologies
3
Source:(McKinsey(Global(Institute(
4
What is big data?
Where does big data come from?
Big data refers to data sets that are too large and complex to manage and analyze with traditional IT techniques.
Variety: Complexity of data in many
different structures, ranging from relational, to logs, to raw text
Velocity: Streaming data and large volume data movement
!
Social media data: Insights to companies on consumer behavior and sentiment.
!
Machine data: Industrial equipment, sensors and monitor machinery, web logs tracks user behavior online.
!
Volume: Scale from Terabytes to Petabytes (1015 bytes) to Zetabytes (1018 bytes) !
Transactional data: Product IDs, prices, payment, manufacturer and distributor data, and much more. And others
Each day: 230M tweets, 2.7B comments to FB, 86400 hours of video to YouTube
Large Hadron Collider generates 40 terabytes/sec
Amazon.com: $10B in sales in Q3 2011, US pizza chain Domino's: 1 million customers per day
6
Big data can be very small Not all large datasets are big !
!
!
Big data chases election 2012 undecided voters From data mining to online organizing. Through Facebook, Twitter and other online sources, the campaign is working tirelessly to create a formidable data base collecting specific profiles of potential voters.
Big refers to big complexity rather than big volume. Big data that is very small ! Power stations, planes… have hundreds thousands sensors "complex of combinations of sensor readings? ! data streaming all sensors is big data even the size of the dataset is not as large (an hour of flying: 100,000 sensors x 60 minutes x 60 seconds x 8 bytes " less than 3GB).
They know what you read and where you shop, what kind of work you do and who you count as friends. They also know who your mother voted for in the last election. Obama has 16 million Twitter followers compared to Romney’s 500,000. On Facebook, Obama has nearly 27 million followers to Romney’s 1.8 million.
Large datasets that aren’t big ! Increasing number of systems that generate very large quantities of very simple data.
MIKE2.0
More than 150 techies are quietly peeling back the layers of your life.
7
8
Big data across the federal government
International collaboration on big data
29 March 2012, Retrieved 26 Sep 2012 84 different big data programs, 6 departments !
Defense: Autonomous systems (250M$/year)
!
Homeland security: COE on visualization and data analytics (from natural disaster to terrorist incidents), Rutgers & Perdue Univ.
!
Energy: High performance storage system to manage petabytes of data, mathematics for analysis of petascale data (machine learning, statistics,…)
!
Health and Human Services: Disease Control & Prevention
!
Food and Drug Administration (FDA)
!
National Aeronautics & Space Administration (NASA)
!
National Institutes of Health (NIH)
!
National Science Foundation (NSF): Core techniques and technologies for advancing big data S&E.
www.WhiteHouse.gov/OSTP
9
Big data, big analytics, big opportunity
Big data, big analytics, big opportunity Some very large manufacturing firms known in the past for mostly hardware engineering and now evolving into firms delivering services, such as business analytics.
You certainly have heard of scientific researchers using supercomputers to analyze massive amounts of data. The difference now is that big data is accessible to regular business intelligence users and is applicable to the enterprise. Example: !
Walmart analyzing real-time social media data for trends, then using that information to guide online ad purchases
!
IDC determined that the big data technology and services market was worth $3.2B USD in 2010 and is going to skyrocket to $16.9B by 2015.
10
11
!
IBM’s past: Producing servers, desktop computers, laptops, and other supporting infrastructure.
!
IBM’s today: Divested from several hardware initiatives, such as manufacturing laptops, and has instead spent billions in acquisitions to build its analytic credentials, trying to rebrand itself as a leader in business analytics.
!
IBM has acquired SPSS for over a billion dollars to capture the retail side of the Business analytics market. For large commercial ventures, IBM acquired Cognos to offer full service analytics.
http://dawn.com/2012/07/25/big-data-big-analytics-big-opportunity/ 25July 2012
12
Google’s Cloud Storage and BigQuery !
!
Turning big data into value
Google understands how to process and manage large volumes of data — at a scale much bigger than most companies. Google built their own technology for fast, interactive analysis of massive data: BigQuery (connected to Tableau), Cloud Storage. http://www.wired.com/insights/2012/11/ visual-analytics-brings-big-data-ingoogles-cloud-to-life/
Google Data Center
!
Big data analytics enable your organization to tackle complex problems that previously could not be solved " make better decisions and actions.
!
Competitiveness advantages.
!
Provide insights about the complex behavior of human societies.
!
Breakthrough in science.
!
etc.
Knowledge-driven approach to science Some knowledge of the domain Synthesis
Hypotheses to be tested Experiment observations
Data-driven approach to science Carefully designed data-generating experiment Analyze and test Inductive reasoning hypotheses by computation Generation of hypotheses
Data driven XYZ 14
13
Big data inquiries
Gartner prediction on big data
October 19, 2011-October 10, 2012
IT to spend $232B on Big Data over 5 years
by industry
by enterprise
by region Source: Forbes and Gartner, Oct. 15, 2012
15
16
Big data landscape
A framework of big data PUBLICATION ACCESS
DIRECTED ACTIONS TO HUMAN Browser
Mobile devices
Custom hand help Tag cloud
VISUALIZATION
ANALYTICS
DATA ANALYTICS
DIRECTED ACTIONS TO MACHINES Web services
Clustergram+
DATA MINING Classification Cluster Analysis Association rule Trend detection Predictive modeling …….
FTP and SFTP
History flow+
MACHINE LEARNING Supervised learning Unsupervised learning Reinforcement learning Graphical model Learn to rank …….
Commercial
MQ, JMS, Sockers
(NoSQL DB)
Spatial information flow
(RDBMS)
STATISTICS NETWORK ANALYSIS SPATIAL ANALYSIS
Unstructured
TIME SERIES ANALYSIS
Structured
CROWDSOURCE
MANAGEMENT
DATA MANAGEMENT
Distributed File System HDFS/GFS/… Parallel&compu,ng& Mapreduce/…+
Enterprise, Oracle, SAP, Customer, Systems, etc.
Data Security
…….
Semi-structured/un-structure data extraction
EXTRACT
DATA SOURCES
Data Cleaning Data Storage Key/value, Column Mast-slave, P2P …….
…….
Open source Sensors
Mobiles
Web/Unstructured
Source: WAMDM, Web group
…….
17
Challenges to big data
Outline
Big data requires new systems for centralizing, aggregating, analyzing, and visualizing these enormous data sets. In particular analyzing and understanding petabytes of structured and unstructured data poses the following unique challenges: ! ! ! ! !
Source: Cisco
Machine Learning and Data mining
Scalability Robustness Diversity Analytics Visualization of the data Computational Science and Engineering
19
20
Two main factors: data & learning tasks
Machine learning and Data mining
Machine learning
# To build computer systems
# # # # # # # # # #
# To find new and useful
that learn as well as human does.
knowledge from large datasets.
# ICML since 1982 (23th
# ACM SIGKDD since
ICML in 2006), ECML since 1989
1995
#
# ECML/PKDD since 2001 # ACML starts Nov. 2009
PKDD and PAKDD since 1997
# IEEE ICDM and SIAM
DM since 2000, etc.
Flat data tables
# Supervised learning
Relational databases
Multimedia data Materials science data
Kilo!
103"
Mega!
106"
Textual data
Giga!
109"
Web data
Tera!
1012"
Peta!
1015"
Exa!
1018"
Biological data
etc.
Neural networks Rule induction Support vector machines etc.
# Unsupervised learning o Clustering o Modeling and density estimation o etc.
# Reinforcement learning o Q-learning o Adaptive dynamic programming o etc.
Development of machine learning
Complexly structured data Social network
Successful applications
Symbolic concept induction
IR & ranking Data mining
Multi strategy learning
TACATTAGTTATTACATTGAGAAACTTTATAATTAAAAA AGATTCATGTAAATTTCTTATTTGTTTATTTAGAGGTT TTAAATTTAATTTCTAAGGGTTTGCTGGTTTCATTGTT AGAATATTTAACTTAATCAAATTATTTGAATTAAATTAG GATTAATTAGGTAAGCTAACAAATAAGTTAAATTTTTAA ATTTAAGGAGATAAAAATACTACTCTGTTTTATTATGG AAAGAAAGATTTAAATACTAAAGGGTTTATATATATGAA GTAGTTACCCTTAGAAAAATATGGTATAGAAAGCTTAA ATATTAAGAGTGATGAAGTATATTATGT…
Immense text
o o o o
Transactional databases
22
About machine learning
…
o Decision trees
Temporal & spatial data
“A breakthrough in machine learning would be worth ten Microsofts” (Bill Gates, Chairman, Microsoft)
A portion of the DNA sequence with length of 1,6 million characters
Learning tasks & methods
Types and size of data
Data mining
Minsky criticism
Math discovery AM
Sparse learning
Semi-supervised learning
1941 1950
1949 1960
Probabilistic graphical models
1968 1980 1970 ICML (1982)
enthusiasm
23
Statistical learning Ensemble methods
Reinforcement learning
1956 1970 1958
dark age
Nonparametric Bayesian
Structured prediction
1972 1990
ECML (1989)
renaissance
Deep learning
Dimensionality reduction
Experimental comparisons
Unsupervised learning
Rote learning
Bayesian methods
Revival of non-symbolic learning ILP
Supervised learning Neural modeling
Transfer learning
Kernel methods
Abduction, Analogy
PAC learning
Web linkage
Active & online learning
NN, GA, EBL, CBL
Pattern Recognition emerged
MIML
1982
KDD (1995)
maturity
1990
1986 2000 PAKDD (1997)
19972010 ACML (2009)
fast development 24
Dimensionality reduction
Sparse modeling Selection and construction of a small set of highly predictive variables in highdimensional datasets. (chọn và tạo ra một tập nhỏ các biến có khả năng dự đoán cao từ dữ liệu nhiều chiều). !
!
!
The process of reducing the number of random variables under consideration, and can be divided into feature selection and feature extraction. (quá trình rút gọn số biến ngẫu nhiên đang quan tâm, gồm lựa chọn biến và tạo biến mới).
Rapidly developing area on the intersection of statistics, machine learning and signal processing. Typically when data are of highdimensional, small-sample Sparse SVMs, sparse Gaussian processes, sparse Bayesian methods, sparse regression, sparse Q-learning, sparse topic models, etc.
Find small number of most relevant voxels (brain areas)?
25
Probabilistic graphical models
26
Graphical models Instances of graphical models
A way of describing/representing a reality by probabilistic relationships between random variables (observed and unobserved ones). (Mô tả và biểu diễn các hệ thống phức tạp bằng các quan hệ xác suất giữa các biến ngẫu nhiên (biến hiện và ẩn).
Probability Theory
+
Graph Theory
Naïve Bayes classifier
Probabilistic models Graphical models
Directed
LDA
Undirected
MINVOLSET
PULMEMBOLUS
Two key tasks
PAP
SHUNT
VENTMACH
VENTLUNG
Bayes nets
DISCONNECT
VENITUBE
Mixture models
PRESS MINOVL
Learning: The structure and parameters of the model Inference: Use observed variables to computer the posterior distributions of other variables?
KINKEDTUBE
INTUBATION
ANAPHYLAXIS
SAO2
TPR
HYPOVOLEMIA
LVEDVOLUME
CVP
PCWP
LVFAILURE
STROEVOLUME
FIO2
VENTALV
PVSAT
ARTCO2
DBNs
Conditional random fields
EXPCO2
INSUFFANESTH
Kalman filter model
CATECHOL
HISTORY
MRFs
ERRBLOWOUTPUT
CO
HR
HREKG
ERRCAUTER
HRSAT
HRBP
Hidden Markov Model (HMM)
MaxEnt
BP
Monitoring Intensive-Care Patients
27
Murphy, ML for life sciences
28
Fully sparse topic model
Probabilistic graphical models Topic models: Roadmap to text meaning
FSTM
w
Z
θ
π D+
topics 4
AP 3
FSTM
Θ"
Normalized cooccurrence matrix
LDA
2
STC
1.5 1 0.5 0
0
50
x 10
4
KOS 5
2.5
PLSA
Learning time (s)
Learning time (s)
topics
words
Φ"
x 10
2.5
2 1.5 1 0.5 0
100
x 10
4
gro
4
Learning time (s)
3
documents words
K+
!
Topic modeling is the key approach to automate the text meaning (idea: a topic is a set of words with a probability distribution, and a document is mixtures of latent topics).
!
Our sparse topic model allows dealing with big text data (millions documents and thousands topics) that current dense topic models cannot do (reducing the storage from 23.3 Gb to 33.3 Mb for 350,000 documents).
How fast can the models learn?
documents
C
Topic model: sparse vs. dense
β N+
0
50
Number of topics
3 2 1 0
100
Number of topics
0
50
100
Number of topics
Sparse vs. dense
29
60 40 20 0
0
50
KOS
100
80
100
80 60 40 20 0
Number of topics
0
50
100
1500
1000 500
0
#topics: thousand & hundreds
Inference time
Linear vs. non linear
Sparse topic representation
100 times smaller
Sparse document representation
350 times smaller
Storage
700 times smaller
Grolier
2000
Inference time (s)
Hidden variables, generative processes, and statistical inference are the foundation of probabilistic modeling of topics.
Blei, D., Ng, A., Jordan, M., Latent Dirichlet Allocation, JMLR, 2003
AP
100
Inference time (s)
!
Key idea: documents are mixtures of latent topics, where a topic is a probability distribution over words.
Inference time (s)
!
How fast can the models infer?
0
Number of topics
50
100
Number of topics
Khoat Than and Tu Bao Ho, papers in ECML 2012 and ACML 2012.
30
Previous work on clinical data
Hepatitis C virus (HCV) IFN/RBV therapy (interferon/ribavirin)
Other! 22%! HBV! 53%!
HCV! 25%!
ZTT: H>N-S Fibrosis stage
ZTT
HCC CLINICAL DATA
F4
Normal region
F3
non-SVR
F2
~50%
F1
SVR
F0
Prevalence of Hepatitis C carriers (~200M)
ZTT first was increasingly high then changed to the normal region and stable
HBV
HCV
2
time
SVR: Sustained Viral Response
Methods: APE (abstraction pattern extraction) TRE (temporal relation extraction) Ho et al., New Generation Computing, V25, N3; IEICE, Inf. Sys. V.E90D, N10, KDD2003, etc.
OMICS DATA
Murray et. al., Medical Microbiology 5th edition, 2005, Chapter 66 (left), Chapter 62 (right) published by Mosby Philadelphia, 32
Learning by topic model A protein sequence
RNA interference and hepatitis
A vector in original space
A vector of subsequence frequencies
A vector in topical space
Interpretation With 5 topics
Fire, A., Mello, C., Nature 391, 1998. Nobel Prize 2006
"
RNAi (siRNA and miRNA) is posttranscriptional gene silencing (PTGS) mechanism.
"
Chemically synthesized siRNAs can mimic the native siRNAs produced by RNAi but having different ability.
"
General target: Selection of potent siRNAs for gene silencing?
Visualization
33
RNA interference and hepatitis Which siRNA have high knockdown efficacy from 274.877.906.994 siRNA sequences of 19 characters from {A, C, G, U}? Machine learning approach
Empirical siRNA design rules
(Qiu, 2009; Takasaki 2009; Alistair 2008, etc.)
Position/ Nucleotide
A
C
G
U
17
C> A> G
A >U> C
12
A>C=G
A>U>C
A >U >G
C>G>U
…
…
…
…
…
U> C> G
Design rules for siRNAs siRNA sequences Sequence Label GUGGGAGCGCGUGAUGAAC VH CACUCUACUGCAGCAAAGC VH AUGUUCUUCUCCAAGGUGC L AACAUGUAAGGACUUUGAU L CUGCUUGUACCAAUUGCUA L ……
Design rules for siRNAs "
"
"
Transformed data Transaction 3 8 11 15 19 21 27 30 35 38 43 48 51 53 60 63 65 69 74 2 5 10 16 18 24 25 30 36 39 42 45 51 54 57 61 65 71 74 1 8 11 16 20 22 28 32 34 40 42 46 49 53 59 63 68 71 74 1 5 10 13 20 23 28 29 33 39 43 45 50 56 60 64 67 69 76 2 8 11 14 20 24 27 32 33 38 42 45 49 56 60 63 66 72 73 …….
Label VH VH L L L
Learn a function scoring knockdown efficacy of siRNAs? Learn a function scoring knockdown efficacy of siRNAs for a disease? Generate siRNA with highest knockdown efficacy?
Rule Label C-A--A-----------UGA---------G-----UGA-------------U-UG-A------U---G----G-A---------AG-----…..
VH VH VH VH VH
Descriptive rules for siRNAs
Outline
New paradigm of science and big data
Machine Learning and Data mining
Jim Gray (1944-2007)
CACM, Dec. 2010
Computational Science and Engineering
37
Computational science (CS) Computational science and engineering (CSE)
CACM, Sep. 2010
Computational science (using math and computation to do work in other sciences) vs. Computer science (making hardware and software for computation)
38
Competition on supercomputers
Components of computational science: $ Models and simulation $
Computer science: network, data analysis
$
Computer infrastructure (supercomputers)
CSE: development and application of computational models, often coupled with high-performance computing (HPC), to solve complex problems arising in engineering analysis and design (CE) as well as natural phenomena (CS). Source: PITAC report and SIAM
Mathematics CS E
Nov. 2010: China Tianhe-1A 2.56 petaflops, 23552 processors
June 2012: Japan’s K computer, 10.51 petaflops, 88128 processors
June 2012: SuperMUC, Europe fastest, 2.9 peteflops, 18432 processors.
Nov. 2012: Cray’s Titan computer, 17.59 petaflops, 560640 processors.
Computer Science
Science & Engineering
39
40
Some national-level problems
Lessons learned from Japan’s K computer
Started 21 application programs at the beginning of the K computer project.
!
To deal with climate changes and environment disasters (river flow, flood forecasting, ocean simulation, soil erosion...)
!
Prediction of risks of big construction projects such as nuclear plants, hydroelectric power plants, bank systems, etc.
!
CSE in the defense, society...(
Japan(national(key(project,(1(billion(USD((2007C2012)(
42
Scientific breakthroughs !
!
!
Life science, biomedicine: prediction of disease diffusion, resistance of malaria, etc. Materials science and nano-technologies: Development of multi-scale materials modelling from quantum mechanical understanding nanostructures to engineering applications.( Computational finance: models and simulation for decision making in trading, hedging, investment and their risk management.
Relationship between three domains Future work
SHIFT IN MEDICINE RESEARCH
Molecular medicine is essentially based on learning from omics data
SHIFT IN MEDICINE RESEARCH
• Big data needs views, methods, and high performance of CSE and data mining. • Solutions could be diverse: powerful models, smart programs, supercomputer, and all of them
• Key technology for big data. • Be motivated to develop more powerful methods or formulate new problems.
• Big data and data mining require better models, analytic tools and supercomputers. • CSE enriches the merit of big data and data mining. Black–Scholes European Call Option Pricing Surface
43
44
Cyber-physical service
Relationship of human and computer
Scope of ICT Usage
Creating Knowledge, Supporting Human Activities Human Centric
Business Process Innovation Productivity Improvement Computer Centric
Network Centric
Network
Internet PC Source: Fujitsu
1990
2000
45
Ubiquities terminals Mobile network 2010
Cloud computing Sensor technology 2020 Copyright 2011 FUJITSU LIMITED
Hadoop(Background(
One size does not fit all !
Big data and computational science and technology (CSE) are emerging technology and field that impact the future.
!
Machine learning & data mining have been fast changing with statistics, and are the key technology for big data & CSE.
!
No universal powerful method. Each of different contexts of big data, CSE and data mining needs its most appropriate solution.
!
Big opportunities but challenging.
!
Why and how these in Viet Nam?
Apache(Hadoop(is(a(software(framework(that(supports(dataCintensive( applications(under(a(free(license.(It(enables(applications(to(work(with( thousands(of(nodes(and(petabytes(of(data.(Hadoop(was(inspired(by(Google( Map/Reduce(and(Google(File(System(papers.(
!
Hadoop(is(a(topClevel(Apache(project(being(built(and(used(by(a(global( community(of(contributors,(using(the(Java(programming(language.(Yahoo( has(been(the(largest(contributor(to(the(project,(and(uses(Hadoop(extensively( across(its(businesses.(
!
Hadoop(is(a(paradigm(that(says(that(you(send(your(application(to(the(data( rather(than(sending(the(data(to(the(application(
!
Thanks 47
48
What Hadoop Is Not !
It is not a replacement for your Database & Warehouse strategy $
!
It is not a replacement for your ETL strategy $
!
Customers need hybrid database/warehouse & hadoop models Existing data flows aren’t typically changed, they are extended
It is not designed for real-time complex event processing like Streams $
Customers are asking for Streams & BigInsights integration