Overview of Machine Learning and H2O.ai

Machine Learning Overview

What is machine learning?

-- Arthur Samuel, 1959

Why now? •

Data, computers, and algorithms are commodities

Unstructured data

Increasing competition in business

Estimating a model for inference

Training a model for prediction

What happened? Why?

What will happen?

Assumptions, parsimony, interpretation

Predictive accuracy, production deployment

Linear models, statistics

Machine learning

Models tend to be static

Many models can evolve elegantly

Machine Learning

Data Science Danger Zone?

Traditional Research

1. There is no perfect language.

2. There is no perfect algorithm.

3. Doing things right is always hard.

FREE LUNCH! If someone claims to have the perfect programming language, he is either a fool or a salesman or both. -- Bjarne Stroustrup

Algorithms that search for an extremum of a cost function perform exactly the same when averaged over all possible cost functions. -- D.H. Wolpert

Copyright © 2014, SAS Institute Inc. All rights reserved.

Developing and deploying ML systems is relatively fast and cheap, but maintaining them over time is difficult and expensive. -- Google, Hidden Technical Debt in Machine Learning Systems

H2O.ai Overview

Company Overview Founded

2011 Venture-backed, debuted in 2012


• • • •


Operationalize Data Science, and provide a platform for users to build beautiful data products


70 employees • Distributed Systems Engineers doing Machine Learning • World-class visualization designers


Mountain View, CA

H2O: In-Memory AI Prediction Engine Sparkling Water: Spark Integration Steam: Deployment engine Deep Water: Deep Learning

H2O.ai Offers AI Open Source Platform Product Suite to Operationalize Data Science 100% Open Source

Deep Water In-Memory, Distributed Machine Learning Algorithms with Speed and Accuracy

State-of-the-art Deep Learning on GPUs with TensorFlow, MXNet or Caffe with the ease of use of H2O

H2O Integration with Spark. Best Machine Learning on Spark.

Operationalize and Streamline Model Building, Training and Deployment Automatically and Elastically

H2O.ai Now Focused On Experience Beyond Algorithms and Data


H2O Flow

Single web-based Document for code execution, text, mathematics, plots and rich media


R, Python, Spark APIs Advanced, scalable ML in the language of your choice

H2O Steam Elastic ML & Auto ML Operationalize Data Science


Deep Water

High Level Architecture HDFS

H2O Compute Engine S3




Load Data Distributed In-Memory Loss-less Compression

Exploratory & Descriptive Analysis

Supervised & Unsupervised Modeling


Feature Engineering & Selection

Model Evaluation & Selection

Data & Model Storage

Data Prep Export: Plain Old Java Object

Model Export: Plain Old Java Object

Production Scoring Environment Your Imagination

Intro to Machine Learning Algos

Algorithms on H2O Unsupervised Learning

Supervised Learning Statistical Analysis

Decision Tree Ensembles Stacking

• •

• •

Penalized Linear Models: Super-fast, super-scalable, and interpretable Naïve Bayes: Straightforward linear classifier

Distributed Random Forest: Easy-touse tree-bagging ensembles Gradient Boosting Machine: Highly tunable tree-boosting ensembles

K-means: Partitions observations into similar groups; automatically detects number of groups

• •

Principal Component Analysis: Transforms correlated variables to independent components Generalized Low Rank Models: Extends the idea of PCA to handle arbitrary data consisting of numerical, Boolean, categorical, and missing data

Aggregator: Efficient, advanced


Dimensionality Reduction

Stacked Ensemble: Combine multiple types of models for better predictions


Deep neural networks: Multi-layer feed-forward neural networks for standard data mining tasks Convolutional neural networks: Sophisticated architectures for pattern recognition in images, sound, and text

Anomaly Detection

Term Embeddings

sampling that creates smaller data sets from larger data sets

Neural Networks

Multilayer Perceptron Deep Learning

Autoencoders: Find outliers using a nonlinear dimensionality reduction technique Word2vec: Generate context-sensitive numerical representations of a large text corpus

Supervised Learning Regression: How much will a customers spend?

Classification: Will a customer make a purchase? Yes or No yes no




H2O algos: Penalized Linear Models Random Forest Gradient Boosting Neural Networks Stacked Ensembles


H2O algos: Penalized Linear Models Naïve Bayes Random Forest Gradient Boosting Neural Networks Stacked Ensembles

Unsupervised Learning Clustering:

Feature extraction:

Grouping rows – e.g. creating groups of similar customers

Grouping columns – Create a small number of new representative dimensions

Anomaly detection: Detecting outlying rows - Finding high-value, fraudulent, or weird customers



HINRY Soccer mom




PC1 = -0.3 xi - 0.4 xi xi

H2O algos: k – means


H2O algos: Principal components Generalized low rank models Autoencoders Word2Vec


Weirdo xi

H2O algos: Principal components Generalized low rank models Autoencoders


Penalized Linear Models

Gradient Boosting Machines

Neural Networks (Deep learning & MLP)

Creates interpretable models with super-fast training time Nonlinear and interaction terms to be specified manually Can extrapolate beyond training data domain Select the correct target distribution Few hyperparameters to tune

• • • •

NAs Outliers/influential points Strongly correlated inputs Rare categorical levels in new data

• Regression • Classification

• • • • •

• Classification

• Nonlinear and interaction terms should be specified by users

• Linear independence assumption • Often less accurate than more sophisticated classifiers • Rare categorical levels in new data

• Regression • Classification

• • • •

Builds accurate models without overfitting Few hyperparameters to tune Requires less data prep Great for implicitly modeling interactions

• Difficulty extrapolating beyond training data domain • Can be difficult to interpret • Rare categorical levels in new data

• Regression • Classification

• Builds accurate models without overfitting (often more accurate than random forest) • Requires less data prep • Great for implicitly modeling interactions

• Many hyperparameters • Difficulty extrapolating beyond training data domain • Can be difficult to interpret • Rare categorical levels in new data

• Regression • Classification

• Great for modeling interactions in fully connected topologies • Can extrapolate beyond training data domain • Deep learning architectures best-suited for pattern recognition in images, videos, and sound

• NAs • Overfitting • Outliers/influential points • Long training times • Difficult to interpret

Naïve Bayes

Random Forest



• Many hyperparameters • Strongly correlated inputs • Rare categorical levels in new data


Generalized Low Rank Models

Autoencoders (Neural Networks)



• Clustering

• Great for creating Gaussian, non-overlapping, roughly equally sized clusters • The number of clusters can be unknown

• • • • •

• Feature extraction • Dimension reduction • Anomaly detection

• Great for extracting a number <= N of linear, orthogonal features from i.i.d. numeric data • Great for plotting extracted features in a reduceddimensional space to analyze data structure, e.g. clusters, hierarchy, sparsity, outliers

• NAs • Outliers/influential points • Categorical inputs

• • • •

• Great for extracting linear features from mixed data • Great for plotting extracted features in a reduceddimensional space to analyze data structure, e.g. clusters, hierarchy, sparsity, outliers • Great for imputing NAs

• Outliers/influential points

• Feature extraction • Dimension reduction • Anomaly detection

• Great for extracting a number of nonlinear features from mixed data • Great for plotting extracted features in a reduced dimensional space to analyze structure, e.g. clusters, hierarchy, sparsity, outliers

• NAs • Overtraining • Outliers/influential points • Long training times

• Highly representative feature extraction from text

• Great for extracting highly representative, context sensitive term embeddings (e.g. numerical vectors) from text • Great for text preprocessing prior to further supervised or unsupervised analysis

• Many Hyperparameters • Long training times • Overtraining • Specifying term weightings prior to training

k - means

Principal Components Analysis


Feature extraction Dimension reduction Anomaly detection Matrix completion

NAs Outliers/influential points Strongly correlated inputs Cluster labels sensitive to initialization Curse of dimensionality

• Many hyperparameters • Strongly correlated inputs • Rare categorical levels in new data

Overview of Machine Learning and H2O.ai - GitHub

Gradient Boosting Machine: Highly tunable tree-boosting ensembles. •. Deep neural networks: Multi-layer feed-forward neural networks for standard data mining tasks. •. Convolutional neural networks: Sophisticated architectures for pattern recognition in images, sound, and text. Algorithms on H2O. Unsupervised Learning.

4MB Sizes 7 Downloads 377 Views

Recommend Documents

Applied Machine Learning - GitHub
In Azure ML Studio, on the Notebooks tab, open the TimeSeries notebook you uploaded ... 9. Save and run the experiment, and visualize the output of the Select ...

Applied Machine Learning - GitHub
Then in the Upload a new notebook dialog box, browse to select the notebook .... 9. On the browser tab containing the dashboard page for your Azure ML web ...

Applied Machine Learning - GitHub
course. Exploring Spatial Data. In this exercise, you will explore the Meuse ... folder where you extracted the lab files on your local computer. ... When you have completed all of the coding tasks in the notebook, save your changes and then.

Essence of Machine Learning (and Deep Learning) - GitHub
... Expectation-Maximisation (EM), Variational Inference (VI), sampling-based inference methods. 4. Model selection. Keywords: cross-validation. 24. Modelling ...

Applied Math and Machine Learning Basics - GitHub
reality and using a training algorithm to minimize that cost function. This elementary framework is the basis for a broad variety of machine learning algorithms ...

Data Science and Machine Learning Essentials - GitHub
computer. Enter the following details as shown in the image below, and then click the ✓icon. • This is a ... Python in data science experiments in later modules.

Machine Learning Cheat Sheet - GitHub
get lost in the middle way of the derivation process. This cheat sheet ... 3. 2.2. A brief review of probability theory . . . . 3. 2.2.1. Basic concepts . . . . . . . . . . . . . . 3 ...... pdf of standard normal π ... call it classifier) or a decis

Overview - GitHub
This makes it impossible to update clones. When this happens, ... versions of the Yocto kernel (from the Yocto repository, or the Intel Github repositories on ...

Overview - GitHub
Switch system is mobile Cashier backend sale system for merchants, which provides the following base features: Management of Partners, Merchants, Users, Cashiers, Cash registers, mPOS Terminals and Merchant's Product catalogues. Processing Sales with

Brief Introduction to Machine Learning without Deep Learning - GitHub
is an excellent course “Deep Learning” taught at the NYU Center for Data ...... 1.7 for graphical illustration. .... PDF. CDF. Mean. Mode. (b) Gamma Distribution. Figure 2.1: In these two ...... widely read textbook [25] by Williams and Rasmussen

Iraq Country Overview - GitHub
is widespread contamination through sophisticated explosive devices, pockets of volatility and reports of violence countrywide. (UN OCHA July. Humanitarian Bulletin). • Internal displacement continues in low numbers throughout Ninewa. Families arri

Overview Instructions - GitHub
The build produces a kernel image, a root file system, and kernel header ... git1+973494766d7ca2401e3138f28b6257a5b899cf1d-r0/linux-lsisim-standard-build.

MeerKAT Overview - GitHub
Youth Into Science – skills development and training programme. ○. African VLBI Network. MeerKAT focus today… SKA SKA Project .... KAT-7 Software ...

IARPA Overview - GitHub
May 11, 2017 - 1. Coast Guard. Central Intelligence Agency. Army. Navy. Air Force. National ... We emphasize technical excellence & technical truth ...

Overview Instruction - GitHub
IMAGE_FSTYPES += "ext2". PREFERRED_PROVIDER_virtual/kernel = "linux-yocto-custom". Other optional settings for saving disk space and build time:.

Overview Instructions - GitHub
With U-Boot as the boot loader, the above need to be put into a format that U-Boot understands. The following describes using the FIT format (see doc/uImage.

BreedR Overview - GitHub
6 0 56. 72. 0. 55 1. 14 13. 4.775. 9 0 55. 73. 0. 22 1. 8. 13. 19.099 12 0 22. 74. 0 .... Predicted genetic values vs. ...... Plus some more specific metagene functions:.

Overview Instructions - GitHub
Just the Linux kernel. • Linux and the device tree. • Linux, the device tree, and a root file system. The simulator only supports using separate images for Linux ...

Overview Branches - GitHub
convention for a custom branch is custom-[organization domain]. For example custom- ccvonline. It is up to each of those organizations to determine how their ...

Overview Building - GitHub
Using the external or internal host, after loading the RTE,. $ ncpBootMem -a ... ACP2=> tftp 4010000 . ACP2=> ssp w 0 ...

Overview Local Builds and Modifications - GitHub
restore "u-boot-spl.bin" binary S:0x20000000 set var $pc ... restore "parameters" binary S:0x2003f000 ... It is possible to use the data path instead of the FEMAC.

Overview Local Builds and Modifications - GitHub
The first stage is part of the asic and loads the Secondary Program Loader. (SPL) into the asic's ... git checkout --track -b lsi-v2013.01.01 origin/lsi-v2013.01.01. 1 ...

Red Leaves implant - overview - GitHub
Mar 9, 2017 - 0x24. Enumerate users (including RDP / terminal services). 0x28 ..... 6https://www.cylance.com/en_us/blog/the-deception-project-a-new- ...