Outline Three emerging IT technologies BIG DATA ...

Viewer
Transcript

Outline Big Data, Machine Learning & Data mining, Computational Science and Engineering

Machine Learning and Data mining

Ho Tu Bao Japan Advanced Institute of Science and Technology (JAIST) John von Neumann Institute, VNU-HCM

Computational Science and Engineering

2

BIG DATA TORRENT

Three emerging IT technologies

BIG DATA VALUE

Smart devices, cloud computing, big data

cloud computing

Smartdevices Many IT companies planned for the future based on these three technologies

3

Source:(McKinsey(Global(Institute(

4

What is big data?

Where does big data come from?

Big data refers to data sets that are too large and complex to manage and analyze with traditional IT techniques.

Variety: Complexity of data in many

different structures, ranging from relational, to logs, to raw text

Velocity: Streaming data and large volume data movement

! 

Social media data: Insights to companies on consumer behavior and sentiment.

! 

Machine data: Industrial equipment, sensors and monitor machinery, web logs tracks user behavior online.

! 

Volume: Scale from Terabytes to Petabytes (1015 bytes) to Zetabytes (1018 bytes) ! 

Transactional data: Product IDs, prices, payment, manufacturer and distributor data, and much more. And others

Each day: 230M tweets, 2.7B comments to FB, 86400 hours of video to YouTube

Large Hadron Collider generates 40 terabytes/sec

Amazon.com: $10B in sales in Q3 2011, US pizza chain Domino's: 1 million customers per day

6

Big data can be very small Not all large datasets are big ! 

! 

! 

Big data chases election 2012 undecided voters From data mining to online organizing. Through Facebook, Twitter and other online sources, the campaign is working tirelessly to create a formidable data base collecting specific profiles of potential voters.

Big refers to big complexity rather than big volume. Big data that is very small !  Power stations, planes… have hundreds thousands sensors "complex of combinations of sensor readings? !  data streaming all sensors is big data even the size of the dataset is not as large (an hour of flying: 100,000 sensors x 60 minutes x 60 seconds x 8 bytes " less than 3GB).

They know what you read and where you shop, what kind of work you do and who you count as friends. They also know who your mother voted for in the last election. Obama has 16 million Twitter followers compared to Romney’s 500,000. On Facebook, Obama has nearly 27 million followers to Romney’s 1.8 million.

Large datasets that aren’t big !  Increasing number of systems that generate very large quantities of very simple data.

MIKE2.0

More than 150 techies are quietly peeling back the layers of your life.

7

8

Big data across the federal government

International collaboration on big data

29 March 2012, Retrieved 26 Sep 2012 84 different big data programs, 6 departments ! 

Defense: Autonomous systems (250M$/year)

! 

Homeland security: COE on visualization and data analytics (from natural disaster to terrorist incidents), Rutgers & Perdue Univ.

! 

Energy: High performance storage system to manage petabytes of data, mathematics for analysis of petascale data (machine learning, statistics,…)

! 

Health and Human Services: Disease Control & Prevention

! 

Food and Drug Administration (FDA)

! 

National Aeronautics & Space Administration (NASA)

! 

National Institutes of Health (NIH)

! 

National Science Foundation (NSF): Core techniques and technologies for advancing big data S&E.

www.WhiteHouse.gov/OSTP

9

Big data, big analytics, big opportunity

Big data, big analytics, big opportunity Some very large manufacturing firms known in the past for mostly hardware engineering and now evolving into firms delivering services, such as business analytics.

You certainly have heard of scientific researchers using supercomputers to analyze massive amounts of data. The difference now is that big data is accessible to regular business intelligence users and is applicable to the enterprise. Example: ! 

Walmart analyzing real-time social media data for trends, then using that information to guide online ad purchases

! 

IDC determined that the big data technology and services market was worth $3.2B USD in 2010 and is going to skyrocket to $16.9B by 2015.

10

11

! 

IBM’s past: Producing servers, desktop computers, laptops, and other supporting infrastructure.

! 

IBM’s today: Divested from several hardware initiatives, such as manufacturing laptops, and has instead spent billions in acquisitions to build its analytic credentials, trying to rebrand itself as a leader in business analytics.

! 

IBM has acquired SPSS for over a billion dollars to capture the retail side of the Business analytics market. For large commercial ventures, IBM acquired Cognos to offer full service analytics.

http://dawn.com/2012/07/25/big-data-big-analytics-big-opportunity/ 25July 2012

12

Google’s Cloud Storage and BigQuery ! 

! 

Turning big data into value

Google understands how to process and manage large volumes of data — at a scale much bigger than most companies. Google built their own technology for fast, interactive analysis of massive data: BigQuery (connected to Tableau), Cloud Storage. http://www.wired.com/insights/2012/11/ visual-analytics-brings-big-data-ingoogles-cloud-to-life/

Google Data Center

! 

Big data analytics enable your organization to tackle complex problems that previously could not be solved " make better decisions and actions.

! 

Competitiveness advantages.

! 

Provide insights about the complex behavior of human societies.

! 

Breakthrough in science.

! 

etc.

Knowledge-driven approach to science Some knowledge of the domain Synthesis

Hypotheses to be tested Experiment observations

Data-driven approach to science Carefully designed data-generating experiment Analyze and test Inductive reasoning hypotheses by computation Generation of hypotheses

Data driven XYZ 14

13

Big data inquiries

Gartner prediction on big data

October 19, 2011-October 10, 2012

IT to spend $232B on Big Data over 5 years

by industry

by enterprise

by region Source: Forbes and Gartner, Oct. 15, 2012

15

16

Big data landscape

A framework of big data PUBLICATION ACCESS

DIRECTED ACTIONS TO HUMAN Browser

Mobile devices

Custom hand help Tag cloud

VISUALIZATION

ANALYTICS

DATA ANALYTICS

DIRECTED ACTIONS TO MACHINES Web services

Clustergram+

DATA MINING Classification Cluster Analysis Association rule Trend detection Predictive modeling …….

FTP and SFTP

History flow+

MACHINE LEARNING Supervised learning Unsupervised learning Reinforcement learning Graphical model Learn to rank …….

Commercial

MQ, JMS, Sockers

(NoSQL DB)

Spatial information flow

(RDBMS)

STATISTICS NETWORK ANALYSIS SPATIAL ANALYSIS

Unstructured

TIME SERIES ANALYSIS

Structured

CROWDSOURCE

MANAGEMENT

DATA MANAGEMENT

Distributed File System HDFS/GFS/… Parallel&compu,ng& Mapreduce/…+

Enterprise, Oracle, SAP, Customer, Systems, etc.

Data Security

…….

Semi-structured/un-structure data extraction

EXTRACT

DATA SOURCES

Data Cleaning Data Storage Key/value, Column Mast-slave, P2P …….

…….

Open source Sensors

Mobiles

Web/Unstructured

Source: WAMDM, Web group

…….

17

Challenges to big data

Outline

Big data requires new systems for centralizing, aggregating, analyzing, and visualizing these enormous data sets. In particular analyzing and understanding petabytes of structured and unstructured data poses the following unique challenges: !  !  !  !  ! 

Source: Cisco

Machine Learning and Data mining

Scalability Robustness Diversity Analytics Visualization of the data Computational Science and Engineering

19

20

Two main factors: data & learning tasks

Machine learning and Data mining

Machine learning

#  To build computer systems

#  #  #  #  #  #  #  #  #  # 

#  To find new and useful

that learn as well as human does.

knowledge from large datasets.

#  ICML since 1982 (23th

#  ACM SIGKDD since

ICML in 2006), ECML since 1989

1995

# 

#  ECML/PKDD since 2001 #  ACML starts Nov. 2009

PKDD and PAKDD since 1997

#  IEEE ICDM and SIAM

DM since 2000, etc.

Flat data tables

#  Supervised learning

Relational databases

Multimedia data Materials science data

Kilo!

103"

Mega!

106"

Textual data

Giga!

109"

Web data

Tera!

1012"

Peta!

1015"

Exa!

1018"

Biological data

etc.

Neural networks Rule induction Support vector machines etc.

#  Unsupervised learning o  Clustering o  Modeling and density estimation o  etc.

#  Reinforcement learning o  Q-learning o  Adaptive dynamic programming o  etc.

Development of machine learning

Complexly structured data Social network

Successful applications

Symbolic concept induction

IR & ranking Data mining

Multi strategy learning

TACATTAGTTATTACATTGAGAAACTTTATAATTAAAAA AGATTCATGTAAATTTCTTATTTGTTTATTTAGAGGTT TTAAATTTAATTTCTAAGGGTTTGCTGGTTTCATTGTT AGAATATTTAACTTAATCAAATTATTTGAATTAAATTAG GATTAATTAGGTAAGCTAACAAATAAGTTAAATTTTTAA ATTTAAGGAGATAAAAATACTACTCTGTTTTATTATGG AAAGAAAGATTTAAATACTAAAGGGTTTATATATATGAA GTAGTTACCCTTAGAAAAATATGGTATAGAAAGCTTAA ATATTAAGAGTGATGAAGTATATTATGT…

Immense text

o  o  o  o 

Transactional databases

22

About machine learning

…

o  Decision trees

Temporal & spatial data

“A breakthrough in machine learning would be worth ten Microsofts” (Bill Gates, Chairman, Microsoft)

A portion of the DNA sequence with length of 1,6 million characters

Learning tasks & methods

Types and size of data

Data mining

Minsky criticism

Math discovery AM

Sparse learning

Semi-supervised learning

1941 1950

1949 1960

Probabilistic graphical models

1968 1980 1970 ICML (1982)

enthusiasm

23

Statistical learning Ensemble methods

Reinforcement learning

1956 1970 1958

dark age

Nonparametric Bayesian

Structured prediction

1972 1990

ECML (1989)

renaissance

Deep learning

Dimensionality reduction

Experimental comparisons

Unsupervised learning

Rote learning

Bayesian methods

Revival of non-symbolic learning ILP

Supervised learning Neural modeling

Transfer learning

Kernel methods

Abduction, Analogy

PAC learning

Web linkage

Active & online learning

NN, GA, EBL, CBL

Pattern Recognition emerged

MIML

1982

KDD (1995)

maturity

1990

1986 2000 PAKDD (1997)

19972010 ACML (2009)

fast development 24

Dimensionality reduction

Sparse modeling Selection and construction of a small set of highly predictive variables in highdimensional datasets. (chọn và tạo ra một tập nhỏ các biến có khả năng dự đoán cao từ dữ liệu nhiều chiều). ! 

! 

! 

The process of reducing the number of random variables under consideration, and can be divided into feature selection and feature extraction. (quá trình rút gọn số biến ngẫu nhiên đang quan tâm, gồm lựa chọn biến và tạo biến mới).

Rapidly developing area on the intersection of statistics, machine learning and signal processing. Typically when data are of highdimensional, small-sample Sparse SVMs, sparse Gaussian processes, sparse Bayesian methods, sparse regression, sparse Q-learning, sparse topic models, etc.

Find small number of most relevant voxels (brain areas)?

25

Probabilistic graphical models

26

Graphical models Instances of graphical models

A way of describing/representing a reality by probabilistic relationships between random variables (observed and unobserved ones). (Mô tả và biểu diễn các hệ thống phức tạp bằng các quan hệ xác suất giữa các biến ngẫu nhiên (biến hiện và ẩn).

Probability Theory

+

Graph Theory

Naïve Bayes classifier

Probabilistic models Graphical models

Directed

LDA

Undirected

MINVOLSET

PULMEMBOLUS

Two key tasks

PAP

SHUNT

VENTMACH

VENTLUNG

Bayes nets

DISCONNECT

VENITUBE

Mixture models

PRESS MINOVL

Learning: The structure and parameters of the model Inference: Use observed variables to computer the posterior distributions of other variables?

KINKEDTUBE

INTUBATION

ANAPHYLAXIS

SAO2

TPR

HYPOVOLEMIA

LVEDVOLUME

CVP

PCWP

LVFAILURE

STROEVOLUME

FIO2

VENTALV

PVSAT

ARTCO2

DBNs

Conditional random fields

EXPCO2

INSUFFANESTH

Kalman filter model

CATECHOL

HISTORY

MRFs

ERRBLOWOUTPUT

CO

HR

HREKG

ERRCAUTER

HRSAT

HRBP

Hidden Markov Model (HMM)

MaxEnt

BP

Monitoring Intensive-Care Patients

27

Murphy, ML for life sciences

28

Fully sparse topic model

Probabilistic graphical models Topic models: Roadmap to text meaning

FSTM

w

Z

θ

π D+

topics 4

AP 3

FSTM

Θ"

Normalized cooccurrence matrix

LDA

2

STC

1.5 1 0.5 0

0

50

x 10

4

KOS 5

2.5

PLSA

Learning time (s)

Learning time (s)

topics

words

Φ"

x 10

2.5

2 1.5 1 0.5 0

100

x 10

4

gro

4

Learning time (s)

3

documents words

K+

! 

Topic modeling is the key approach to automate the text meaning (idea: a topic is a set of words with a probability distribution, and a document is mixtures of latent topics).

! 

Our sparse topic model allows dealing with big text data (millions documents and thousands topics) that current dense topic models cannot do (reducing the storage from 23.3 Gb to 33.3 Mb for 350,000 documents).

How fast can the models learn?

documents

C

Topic model: sparse vs. dense

β N+

0

50

Number of topics

3 2 1 0

100

Number of topics

0

50

100

Number of topics

Sparse vs. dense

29

60 40 20 0

0

50

KOS

100

80

100

80 60 40 20 0

Number of topics

0

50

100

1500

1000 500

0

#topics: thousand & hundreds

Inference time

Linear vs. non linear

Sparse topic representation

100 times smaller

Sparse document representation

350 times smaller

Storage

700 times smaller

Grolier

2000

Inference time (s)

Hidden variables, generative processes, and statistical inference are the foundation of probabilistic modeling of topics.

Blei, D., Ng, A., Jordan, M., Latent Dirichlet Allocation, JMLR, 2003

AP

100

Inference time (s)

! 

Key idea: documents are mixtures of latent topics, where a topic is a probability distribution over words.

Inference time (s)

! 

How fast can the models infer?

0

Number of topics

50

100

Number of topics

Khoat Than and Tu Bao Ho, papers in ECML 2012 and ACML 2012.

30

Previous work on clinical data

Hepatitis C virus (HCV) IFN/RBV therapy (interferon/ribavirin)

Other! 22%! HBV! 53%!

HCV! 25%!

ZTT: H>N-S Fibrosis stage

ZTT

HCC CLINICAL DATA

F4

Normal region

F3

non-SVR

F2

~50%

F1

SVR

F0

Prevalence of Hepatitis C carriers (~200M)

ZTT first was increasingly high then changed to the normal region and stable

HBV

HCV

2

time

SVR: Sustained Viral Response

Methods: APE (abstraction pattern extraction) TRE (temporal relation extraction) Ho et al., New Generation Computing, V25, N3; IEICE, Inf. Sys. V.E90D, N10, KDD2003, etc.

OMICS DATA

Murray et. al., Medical Microbiology 5th edition, 2005, Chapter 66 (left), Chapter 62 (right) published by Mosby Philadelphia, 32

Learning by topic model A protein sequence

RNA interference and hepatitis

A vector in original space

A vector of subsequence frequencies

A vector in topical space

Interpretation With 5 topics

Fire, A., Mello, C., Nature 391, 1998. Nobel Prize 2006

"

RNAi (siRNA and miRNA) is posttranscriptional gene silencing (PTGS) mechanism.

"

Chemically synthesized siRNAs can mimic the native siRNAs produced by RNAi but having different ability.

"

General target: Selection of potent siRNAs for gene silencing?

Visualization

33

RNA interference and hepatitis Which siRNA have high knockdown efficacy from 274.877.906.994 siRNA sequences of 19 characters from {A, C, G, U}? Machine learning approach

Empirical siRNA design rules

(Qiu, 2009; Takasaki 2009; Alistair 2008, etc.)

Position/ Nucleotide

A

C

G

U

17

C> A> G

A >U> C

12

A>C=G

A>U>C

A >U >G

C>G>U

…

…

…

…

…

U> C> G

Design rules for siRNAs siRNA sequences Sequence Label GUGGGAGCGCGUGAUGAAC VH CACUCUACUGCAGCAAAGC VH AUGUUCUUCUCCAAGGUGC L AACAUGUAAGGACUUUGAU L CUGCUUGUACCAAUUGCUA L ……

Design rules for siRNAs "

"

"

Transformed data Transaction 3 8 11 15 19 21 27 30 35 38 43 48 51 53 60 63 65 69 74 2 5 10 16 18 24 25 30 36 39 42 45 51 54 57 61 65 71 74 1 8 11 16 20 22 28 32 34 40 42 46 49 53 59 63 68 71 74 1 5 10 13 20 23 28 29 33 39 43 45 50 56 60 64 67 69 76 2 8 11 14 20 24 27 32 33 38 42 45 49 56 60 63 66 72 73 …….

Label VH VH L L L

Learn a function scoring knockdown efficacy of siRNAs? Learn a function scoring knockdown efficacy of siRNAs for a disease? Generate siRNA with highest knockdown efficacy?

Rule Label C-A--A-----------UGA---------G-----UGA-------------U-UG-A------U---G----G-A---------AG-----…..

VH VH VH VH VH

Descriptive rules for siRNAs

Outline

New paradigm of science and big data

Machine Learning and Data mining

Jim Gray (1944-2007)

CACM, Dec. 2010

Computational Science and Engineering

37

Computational science (CS) Computational science and engineering (CSE)

CACM, Sep. 2010

Computational science (using math and computation to do work in other sciences) vs. Computer science (making hardware and software for computation)

38

Competition on supercomputers

Components of computational science: $  Models and simulation $ 

Computer science: network, data analysis

$ 

Computer infrastructure (supercomputers)

CSE: development and application of computational models, often coupled with high-performance computing (HPC), to solve complex problems arising in engineering analysis and design (CE) as well as natural phenomena (CS). Source: PITAC report and SIAM

Mathematics CS E

Nov. 2010: China Tianhe-1A 2.56 petaflops, 23552 processors

June 2012: Japan’s K computer, 10.51 petaflops, 88128 processors

June 2012: SuperMUC, Europe fastest, 2.9 peteflops, 18432 processors.

Nov. 2012: Cray’s Titan computer, 17.59 petaflops, 560640 processors.

Computer Science

Science & Engineering

39

40

Some national-level problems

Lessons learned from Japan’s K computer

Started 21 application programs at the beginning of the K computer project.

! 

To deal with climate changes and environment disasters (river flow, flood forecasting, ocean simulation, soil erosion...)

! 

Prediction of risks of big construction projects such as nuclear plants, hydroelectric power plants, bank systems, etc.

! 

CSE in the defense, society...(

Japan(national(key(project,(1(billion(USD((2007C2012)(

42

Scientific breakthroughs ! 

! 

! 

Life science, biomedicine: prediction of disease diffusion, resistance of malaria, etc. Materials science and nano-technologies: Development of multi-scale materials modelling from quantum mechanical understanding nanostructures to engineering applications.( Computational finance: models and simulation for decision making in trading, hedging, investment and their risk management.

Relationship between three domains Future work

SHIFT IN MEDICINE RESEARCH

Molecular medicine is essentially based on learning from omics data

SHIFT IN MEDICINE RESEARCH

•  Big data needs views, methods, and high performance of CSE and data mining. •  Solutions could be diverse: powerful models, smart programs, supercomputer, and all of them

•  Key technology for big data. •  Be motivated to develop more powerful methods or formulate new problems.

•  Big data and data mining require better models, analytic tools and supercomputers. •  CSE enriches the merit of big data and data mining. Black–Scholes European Call Option Pricing Surface

43

44

Cyber-physical service

Relationship of human and computer

Scope of ICT Usage

Creating Knowledge, Supporting Human Activities Human Centric

Business Process Innovation Productivity Improvement Computer Centric

Network Centric

Network

Internet PC Source: Fujitsu

1990

2000

45

Ubiquities terminals Mobile network 2010

Cloud computing Sensor technology 2020 Copyright 2011 FUJITSU LIMITED

Hadoop(Background(

One size does not fit all ! 

Big data and computational science and technology (CSE) are emerging technology and field that impact the future.

! 

Machine learning & data mining have been fast changing with statistics, and are the key technology for big data & CSE.

! 

No universal powerful method. Each of different contexts of big data, CSE and data mining needs its most appropriate solution.

! 

Big opportunities but challenging.

! 

Why and how these in Viet Nam?

Apache(Hadoop(is(a(software(framework(that(supports(dataCintensive( applications(under(a(free(license.(It(enables(applications(to(work(with( thousands(of(nodes(and(petabytes(of(data.(Hadoop(was(inspired(by(Google( Map/Reduce(and(Google(File(System(papers.(

! 

Hadoop(is(a(topClevel(Apache(project(being(built(and(used(by(a(global( community(of(contributors,(using(the(Java(programming(language.(Yahoo( has(been(the(largest(contributor(to(the(project,(and(uses(Hadoop(extensively( across(its(businesses.(

! 

Hadoop(is(a(paradigm(that(says(that(you(send(your(application(to(the(data( rather(than(sending(the(data(to(the(application(

! 

Thanks 47

48

What Hadoop Is Not ! 

It is not a replacement for your Database & Warehouse strategy $ 

! 

It is not a replacement for your ETL strategy $ 

! 

Customers need hybrid database/warehouse & hadoop models Existing data flows aren’t typically changed, they are extended

It is not designed for real-time complex event processing like Streams $ 

Customers are asking for Streams & BigInsights integration