Andrea Marcelli Ph.D. Student - DAUIN

Practical intro to Machine Learning in Python with Scikit-learn and AutoML strategies

Outline ML in practice Tools Examples with sklearn About AutoML

2

ML in practice

3

Types of learning

source: https://upxacademy.com/introduction-machine-learning/

4

Steps to predictive modeling

source: https://upxacademy.com/introduction-machine-learning/

5

Get the data

Download a dataset or create your own Web scraping could be necessary CSV is the most common format Managing high quantity of data could be challenging (e.g., data transfer (API limits), storage, preprocessing)

6

Explore your data

Extract useful knowledge from your data Visualize your data Plot all your variables against the target variable being predicted Compute summary statistics.

7

Clean, prepare, manipulate data

Convert each column to a fixed type (e.g., int, float, ascii or unicode strings)

Manage missing data (e.g., remove incomplete data or assign default values)

Feature selections and normalization Several ways to encode categorical variables, sequences and text

8

Feature extraction

Some encodings for categorical data: Ordinal variables: (e.g., New York as 1, Tehran as 2 and New Jersey as 3) *beware of the distance meaning

One hot encoding: each category becomes a binary vector *can produce very high dimensionality *rare values can be collapsed in one category

Feature hashing: (e.g., Hash(New York) mod 5 = 3 -> (0,0,1,0,0)) represents categories in a “one hot encoding style” as a sparse matrix but with a much lower dimensions. *not interpretable *hash can generate collision

9

Feature extraction - part 2

Encoding from dataset statistics: (e.g., number of occurrences in the dataset, or within the same sample) Encoding from domain knowledge: (e.g., replace URLs with Alexa rankings) Extract categories from Word2Vec: categories are in a “one hot encoding style” in a sparse matrix but with a much lower dimensions. *leverage an unsupervised method

10

Feature normalization

If features have very different scales and contain some very large outliers, they can degrade the predictive performance of many machine learning algorithms example: StandardScaler removes the mean and scales the data to unit variance. http://scikit-learn.org/stable/modules/classes.html#module-sklearn.preprocessing

11

Supervised learning phases

Training phase: you present your data from your "gold standard" and train your model, by pairing the input with expected output Validation phase: look at your models and select the best performing approach using the validation data Test phase: in order to estimate how well your model has been trained and to estimate model properties

12

Train the model

Select a model Initially use default values Dimensionality reduction could be applied (e.g., PCA, auto encoders)

13

What data science methods are used?

source: https://www.kaggle.com/surveys/2017

14

Choosing the right estimator

source: http://scikit-learn.org/stable/tutorial/machine_learning_map/index.html

15

Test data Use K-Folds cross-validator: split data in train/test sets by splitting data into k consecutive folds. Each fold is then used once as a validation while the k - 1 remaining folds form the training set. Use several loss, score, and utility functions to measure model performance (e.g., mean error for numeric predictors, precision, recall, F1 score, ROC curve for classifier)

Be aware of common problems of ML (e.g., overfitting, course of dimensionality, data leakage) *Data Leakage is the creation of unexpected additional information in the training data, allowing a model or machine learning algorithm to make unrealistically good predictions. https://www.kaggle.com/wiki/Leakage

16

Improve your model

Try several algorithms Hyper-parameter Tuning Try a different distance metric Try a different set of features

17

Tools

18

What tools are used at work?

source: https://www.kaggle.com/surveys/2017

19

Tools Python 3, IPython and Jupyter Notebook Pandas, SciPy, NumPy, Networkx Scrapy, Statsmodel Matplotlib, Seaborn, Bokeh Scikit-learn, Keras (TensorFlow or Theano) NLTK, Gensim

20

How to install all the packages?

Manual installation with pip or install Anaconda https://docs.anaconda.com/anaconda/packages/py3.6_osx-64

21

Use case #1 You need to install different version of the same package on your system: Use python virtualenv, an isolated working copy of Python $ mkdir vvenv $ virtualenv vvenv/my_app $ source vvenv/my_app/bin/activate (my_app) $ pip install networkx==1.9 (my_app) $ python3 -c "import networkx as nx; print(nx.__version__)” 1.9 (my_app) $ deactivate

Packages are installed in: vvenv/my_app/lib/python3.6/site-packages/

22

Use case #2 You need to easily reproduce your result on different systems: Use a Docker container https://hub.docker.com/r/continuumio/anaconda3/

23

Examples with sklearn

24

Some examples Linear regression https://github.com/justmarkham/scikit-learn-videos/blob/master/06_linear_regression.ipynb http://www-bcf.usc.edu/~gareth/ISL/Advertising.csv

Classification https://www.kaggle.com/ash316/ml-from-scratch-part-2 https://www.kaggle.com/uciml/pima-indians-diabetes-database/downloads/diabetes.csv

TPOT https://github.com/jimmy-sonny/practical-intro-ml/blob/master/ sklean%20LinearRegression%20vs%20TPOT.ipynb

25

AutoML

26

AutoML ML success crucially relies on human experts to perform the following tasks: Preprocess the data Select appropriate features Select an appropriate model family Optimize model hyperparameters Postprocess machine learning models Critically analyze the results obtained.

source: http://www.ml4aad.org/automl/

27

AutoML There is a growing community around creating tools that study how to automate the tasks that are part of the machine learning workflow The scope of AML is ambitious, however, is it really effective? It depends: most machine learning problems require domain knowledge and human judgement to set up correctly Tasks like exploratory data analysis, pre-processing of data, hyperparameter tuning, model selection and putting models into production can be automated to some some extent with an Automated Machine Learning framework.

source: https://medium.com/airbnb-engineering/automated-machine-learning-a-paradigm-shift-thataccelerates-data-scientist-productivity-airbnb-f1f8a10d61f8

28

AutoML

More info at: https://blog.keras.io/the-future-of-deep-learning.html

29

AutoML with sklearn

TPOT https://github.com/EpistasisLab/tpot auto-sklearn https://github.com/automl/auto-sklearn machineJS https://github.com/ClimbsRocks/machineJS

30

AutoML with NN

source: https://research.googleblog.com/2018/03/using-evolutionary-automl-to-discover.html

31

AutoML with NN

auto-ml https://github.com/ClimbsRocks/auto_ml autokeras https://github.com/jhfjhfj1/autokeras

32

Andrea Marcelli Ph.D. Student

[email protected] jimmy-sonny.github.io

Andrea Marcelli - GitHub

example: StandardScaler removes the mean and scales the data to unit variance. http://scikit-learn.org/stable/modules/classes.html#module-sklearn.preprocessing ... Use case #2. 23. You need to easily reproduce your result on different systems: Use a Docker container https://hub.docker.com/r/continuumio/anaconda3/ ...

3MB Sizes 2 Downloads 228 Views

Recommend Documents

Andrea Locatelli
age, education, type of job and monthly income, exposure to malaria-related information ... This file is available online at: ...... effective preventive tool that is available at the present time; moreover, they are pretty cheap – at a price of ab

Andrea Smith.pdf
Page 1. Whoops! There was a problem loading more pages. Retrying... Andrea Smith.pdf. Andrea Smith.pdf. Open. Extract. Open with. Sign In. Main menu.

Dracula - Andrea Reider Design
Count, directing him to secure the best place on the coach for me; but on making ...... not alarm her mother by too early a repetition of my call. “Yours always.”.

andrea....pdf
There was a problem previewing this document. Retrying... Download. Connect more apps... Try one of the apps below to open or edit this item. andrea....pdf.

Andrea Locatelli
To the best of my knowledge, the first paper entirely devoted to analyze individual ..... being grade 6, with a maximum of 16 (= 10 years of school + college + ...

Dracula - Andrea Reider Design
Count, directing him to secure the best place on the coach for me; but on making inquiries ...... heard a sound near the castle except the howling of wolves. Some ..... solicitors had a system of agency one for the other, so that local work could be

Andrea-Maria KOSIAK.pdf
There was a problem previewing this document. Retrying... Download. Connect more apps... Try one of the apps below to open or edit this item. Andrea-Maria ...

portfolio andrea varjao.pdf
Page 1. Whoops! There was a problem loading more pages. Retrying... portfolio andrea varjao.pdf. portfolio andrea varjao.pdf. Open. Extract. Open with. Sign In.

andrea bocelli album.pdf
Loading… Page 1. Whoops! There was a problem loading more pages. Retrying... andrea bocelli album.pdf. andrea bocelli album.pdf. Open. Extract. Open with.

Giordano - Andrea Chénier.pdf
There was a problem previewing this document. Retrying... Download. Connect more apps... Try one of the apps below to open or edit this item. Giordano ...

Andrea Segrè bio inglese.pdf
Founder of Last Minute Market, academic spin off University of Bologna. (www.lastminutemarket.it). Andrea Segrè is Full Professor of International and Comparative Agricultural Policy at the University of Bologna and. Circular Economy at the Universi

san andrea senior school newsletter
Dec 7, 2007 - a thunderstorm, at the age of 56. He had quite a short life, don't you think? Not a very happy one either! Beethoven never married but he got en- gaged-only for a short while, though. He had quite a temper-as many instances in his life

san andrea senior school newsletter
Jun 9, 2008 - The House System at the school has been developed over the years to suit the increasing population of the sector. It aims at motivating our children to im- prove in their academics and in their physical education skills and also moti- v

Andrea Cornejo Resume 2017.pdf
There was a problem previewing this document. Retrying... Download. Connect more apps... Try one of the apps below to open or edit this item. Andrea Cornejo Resume 2017.pdf. Andrea Cornejo Resume 2017.pdf. Open. Extract. Open with. Sign In. Main menu

Galeano Andrea Barrio Pfizer.pdf
Download. Connect more apps... Try one of the apps below to open or edit this item. Galeano Andrea Barrio Pfizer.pdf. Galeano Andrea Barrio Pfizer.pdf. Open.

HOY COINA ANDREA S..pdf
There was a problem previewing this document. Retrying... Download. Connect more apps... Try one of the apps below to open or edit this item. HOY COINA ...

nightshade andrea cremer pdf
There was a problem previewing this document. Retrying... Download. Connect more apps... Try one of the apps below to open or edit this item. nightshade ...

INFORMATION WITHOUT TRUTH ANDREA ... - Wiley Online Library
INFORMATION WITHOUT TRUTH. ANDREA SCARANTINO AND GUALTIERO PICCININI. Abstract: According to the Veridicality Thesis, information requires truth. On this view, smoke carries information about there being a fire only if there is a fire, the propositio

M8- TORRICELLA ANDREA Subjetividades visuales.pdf ...
e infancia en Argentina entre 1940 y fines de 1950. Apellido y ... A Visual Economy of the Andean Image World, Princeton, .... M8- TORRICE ... isuales.pdf.

san andrea senior school newsletter
As the game pro- gresses they give you useful information over your headset. Also, there is a story- line flowing through the game: one of the artefacts is linked to ... Title: Artemis Fowl. Author: Eoin Colfer. Book: This book is about a boy named A