scikit-learn

C. Travis Johnson April 27, 2017 Mines Linux Users Group

Introduction

Machine Learning - What is it really?

• Goal: Extract Knowledge from Data • Sometimes called predictive analysis or statistical learning • Given a large matrix of observations X , fit a function f (x) that maps observation x to a response variable y

Machine Learning - What is it really?

• Goal: Extract Knowledge from Data • Sometimes called predictive analysis or statistical learning • Given a large matrix of observations X , fit a function f (x) that maps observation x to a response variable y

Machine Learning - What is it really?

• Goal: Extract Knowledge from Data • Sometimes called predictive analysis or statistical learning • Given a large matrix of observations X , fit a function f (x) that maps observation x to a response variable y

Important Terms Classifiers Algorithms that learn functions to map observations to a discrete response. E.g., is this tumor malignant or benign? Is this email spam or not? Regressors Algorithms that learn functions to map observations to a continuous response. E.g., how much should this house cost? Underfitting The learned function is too simple. “We barely studied for the exam.” Overfitting The learned function is too complex. “We memorized all the practice problems, but don’t understand the material.” Generalization How well does the learned function extend to new observations?

Scikit-Learn: Machine Learning in Python

• Provides many machine learning tools with a common Estimator interface1 • Built in helpers for common ML tasks (e.g., metrics, preprocessing) • Easily combine algorithms to make a complex pipeline2 • Relies heavily on numpy and scipy, often used with pandas

1

http://scikit-learn.org/stable/developers/contributing.html# apis-of-scikit-learn-objects 2 Sound familiar?

Scikit-Learn: Machine Learning in Python

• Provides many machine learning tools with a common Estimator interface1 • Built in helpers for common ML tasks (e.g., metrics, preprocessing) • Easily combine algorithms to make a complex pipeline2 • Relies heavily on numpy and scipy, often used with pandas

1

http://scikit-learn.org/stable/developers/contributing.html# apis-of-scikit-learn-objects 2 Sound familiar?

Scikit-Learn: Machine Learning in Python

• Provides many machine learning tools with a common Estimator interface1 • Built in helpers for common ML tasks (e.g., metrics, preprocessing) • Easily combine algorithms to make a complex pipeline2 • Relies heavily on numpy and scipy, often used with pandas

1

http://scikit-learn.org/stable/developers/contributing.html# apis-of-scikit-learn-objects 2 Sound familiar?

Scikit-Learn: Machine Learning in Python

• Provides many machine learning tools with a common Estimator interface1 • Built in helpers for common ML tasks (e.g., metrics, preprocessing) • Easily combine algorithms to make a complex pipeline2 • Relies heavily on numpy and scipy, often used with pandas

1

http://scikit-learn.org/stable/developers/contributing.html# apis-of-scikit-learn-objects 2 Sound familiar?

Supervised Learning

Learning to Predict Breast Cancer

from sklearn.tree import DecisionTreeClassifier from sklearn.datasets import load_breast_cancer from sklearn.model_selection import train_test_split cancer = load_breast_cancer() # Get some data X_train, X_test, y_train, y_test = train_test_split( cancer.data, cancer.target, stratify=cancer.target, random_state=1337) tree = DecisionTreeClassifier(random_state=7331) tree.fit(X_train, y_train) # Learn a Decision Function

Evaluating Accuracy of a Model

# How well did we do? train_acc = tree.score(X_train, y_train) test_acc = tree.score(X_test, y_test) print("Training Accuracy: {:.3f}".format(train_acc)) print("Testing Accuracy: {:.3f}".format(test_acc)) # Training Accuracy: 1.000 # Testing Accuracy: 0.923

Other Supervised Learning Models

• Decision trees are a common first step, because they’re easy to interpret and don’t require much preprocessing • Decision trees are prone to overfitting, so a good improvement is the RandomForest • Support Vector Machines, Logistic/Linear Regression, and Artificial Neural Networks are commonly the first algorithms studied • See the scikit-learn documentation for a comprehensive guide of available algorithms

Other Supervised Learning Models

• Decision trees are a common first step, because they’re easy to interpret and don’t require much preprocessing • Decision trees are prone to overfitting, so a good improvement is the RandomForest • Support Vector Machines, Logistic/Linear Regression, and Artificial Neural Networks are commonly the first algorithms studied • See the scikit-learn documentation for a comprehensive guide of available algorithms

Other Supervised Learning Models

• Decision trees are a common first step, because they’re easy to interpret and don’t require much preprocessing • Decision trees are prone to overfitting, so a good improvement is the RandomForest • Support Vector Machines, Logistic/Linear Regression, and Artificial Neural Networks are commonly the first algorithms studied • See the scikit-learn documentation for a comprehensive guide of available algorithms

Other Supervised Learning Models

• Decision trees are a common first step, because they’re easy to interpret and don’t require much preprocessing • Decision trees are prone to overfitting, so a good improvement is the RandomForest • Support Vector Machines, Logistic/Linear Regression, and Artificial Neural Networks are commonly the first algorithms studied • See the scikit-learn documentation for a comprehensive guide of available algorithms

Becoming a “Data Scientist”

1. Get some (more) data 2. Pick an algorithm (or algorithm chain) 3. Train the model 4. Test generalization ability of trained model 5. Good enough? Done. Else, go back to step 1 or 2. Then, tell people you’re a genius . . . it’s that easy!

Unsupervised Learning

Distinction from Supervised Learning

Supervised Learning You tell the model what the correct answers are for training examples. Unsupervised Learning You ask the model to extract information from a dataset. Unsupervised Clustering Partition data into similar groups. Example: K-Means Clustering Unsupervised Transformations Create new representations of data. Example: Principal Component Analysis

Model Evaluation and Improvement

Choice of Evaluation Metric

• Accuracy is not always the best metric for your system • Plenty of others exist, pick the best for your business costs • Look in the sklearn.metrics module for alternatives • You can also use your own evaluation function!

Choice of Evaluation Metric

• Accuracy is not always the best metric for your system • Plenty of others exist, pick the best for your business costs • Look in the sklearn.metrics module for alternatives • You can also use your own evaluation function!

Choice of Evaluation Metric

• Accuracy is not always the best metric for your system • Plenty of others exist, pick the best for your business costs • Look in the sklearn.metrics module for alternatives • You can also use your own evaluation function!

Choice of Evaluation Metric

• Accuracy is not always the best metric for your system • Plenty of others exist, pick the best for your business costs • Look in the sklearn.metrics module for alternatives • You can also use your own evaluation function!

Cross Validation Never Fit Models to Test Data! Ever! Learning the parameters of a prediction function and testing it on the same data is a methodological mistake: a model that would just repeat the labels of the samples that it has just seen would have a perfect score but would fail to predict anything useful on yet-unseen data. This situation is called overfitting.

Grid Search with Cross Validation from from from from

sklearn.tree import DecisionTreeClassifier sklearn.datasets import load_breast_cancer sklearn.model_selection import train_test_split sklearn.model_selection import GridSearchCV

cancer = load_breast_cancer() # Get some data X_train, X_test, y_train, y_test = train_test_split( cancer.data, cancer.target, stratify=cancer.target, random_state=1337) tree = DecisionTreeClassifier(random_state=7331) search_grid = {’criterion’: [’gini’, ’entropy’], ’max_depth’ : [5, 10, 15, 20]} # search_grid could also be a list of dicts search = GridSearchCV(tree, search_grid, cv=5) search.fit(X_train, y_train) print(search.best_params_)

Pipelines

Pipelines

Use Pipeline to combine multiple estimators into a single estimator. Two conveniences: 1. Convenience: You only have to call fit and predict once on your data to fit a whole sequence of estimators. 2. Joint parameter selection: You can grid search over parameters of all estimators in the pipeline at once.

A Simple Pipeline

>>> from sklearn.pipeline import Pipeline >>> from sklearn.svm import SVC >>> from sklearn.decomposition import PCA >>> estimators = [(’reduce_dim’, PCA()), (’clf’, SVC())] >>> pipe = Pipeline(estimators) >>> pipe Pipeline(steps=[(’reduce_dim’, PCA(copy=True, iterated_power=’auto’, n_components=None, random_state=None, svd_solver=’auto’, tol=0.0, whiten=False)), (’clf’, SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, decision_function_shape=None, degree=3, gamma=’auto’, kernel=’rbf’, max_iter=-1, probability=False, random_state=None, shrinking=True, tol=0.001, verbose=False))])

Grid Search - Tuning a Complex Pipeline

from sklearn.pipeline import make_pipeline from sklearn.svm import SVC from sklearn.decomposition import PCA from sklearn.preprocessing import StandardScaler from sklearn.model_selection import GridSearchCV pipe = make_pipeline(PCA(), StandardScaler(), SVC()) params = dict(pca__n_components=[2, 5, 10], svc__C=[0.1, 10, 100]) grid = GridSearchCV(pipe, param_grid=params) # Next, call grid.fit on some training data # This will use cross validation to estimation performance using each # combination of parameters for pipeline in params dict # With fitted model print(grid.best_params_)

Questions?

Copyright Notice

This presentation was from the Mines Linux Users Group. A mostly-complete archive of our presentations can be found online at https://lug.mines.edu. Individual authors may have certain copyright or licensing restrictions on their presentations. Please be certain to contact the original author to obtain permission to reuse or distribute these slides.

scikit-learn - GitHub

Apr 27, 2017 - Is this email spam or not? Regressors Algorithms that ... Decision trees are prone to overfitting, so a good improvement is the RandomForest.

234KB Sizes 11 Downloads 128 Views

Recommend Documents

GitHub
domain = meq.domain(10,20,0,10); cells = meq.cells(domain,num_freq=200, num_time=100); ...... This is now contaminator-free. – Observe the ghosts. Optional ...

GitHub
data can only be “corrected” for a single point on the sky. ... sufficient to predict it at the phase center (shifting ... errors (well this is actually good news, isn't it?)

Torsten - GitHub
Metrum Research Group has developed a prototype Pharmacokinetic/Pharmacodynamic (PKPD) model library for use in Stan 2.12. ... Torsten uses a development version of Stan, that follows the 2.12 release, in order to implement the matrix exponential fun

Untitled - GitHub
The next section reviews some approaches adopted for this problem, in astronomy and in computer vision gener- ... cussed below), we would question the sensitivity of a. Delaunay triangulation alone for capturing the .... computation to be improved fr

ECf000172411 - GitHub
Robert. Spec Sr Trading Supt. ENA West Power Fundamental Analysis. Timothy A Heizenrader. 1400 Smith St, Houston, Tx. Yes. Yes. Arnold. John. VP Trading.

Untitled - GitHub
Iwip a man in the middle implementation. TOR. Andrea Marcelli prof. Fulvio Risso. 1859. Page 3. from packets. PEX. CethernetDipo topo data. Private. Execution. Environment to the awareness of a connection. FROG develpment. Cethernet DipD tcpD data. P

BOOM - GitHub
Dec 4, 2016 - 3.2.3 Managing the Global History Register . ..... Put another way, instructions don't need to spend N cycles moving their way through the fetch ...

Supervisor - GitHub
When given an integer, the supervisor terminates the child process using. Process.exit(child, :shutdown) and waits for an exist signal within the time.

robtarr - GitHub
http://globalmoxie.com/blog/making-of-people-mobile.shtml. Saturday, October ... http://24ways.org/2011/conditional-loading-for-responsive-designs. Saturday ...

MY9221 - GitHub
The MY9221, 12-channels (R/G/B x 4) c o n s t a n t current APDM (Adaptive Pulse Density. Modulation) LED driver, operates over a 3V ~ 5.5V input voltage ...

fpYlll - GitHub
Jul 6, 2017 - fpylll is a Python (2 and 3) library for performing lattice reduction on ... expressiveness and ease-of-use beat raw performance.1. 1Okay, to ... py.test for testing Python. .... GSO complete API for plain Gram-Schmidt objects, all.

article - GitHub
2 Universidad Nacional de Tres de Febrero, Caseros, Argentina. ..... www-nlpir.nist.gov/projects/duc/guidelines/2002.html. 6. .... http://singhal.info/ieee2001.pdf.

PyBioMed - GitHub
calculate ten types of molecular descriptors to represent small molecules, including constitutional descriptors ... charge descriptors, molecular properties, kappa shape indices, MOE-type descriptors, and molecular ... The molecular weight (MW) is th

MOC3063 - GitHub
IF lies between max IFT (15mA for MOC3061M, 10mA for MOC3062M ..... Dual Cool™ ... Fairchild's Anti-Counterfeiting Policy is also stated on ourexternal website, ... Datasheet contains the design specifications for product development.

MLX90615 - GitHub
Nov 8, 2013 - of 0.02°C or via a 10-bit PWM (Pulse Width Modulated) signal from the device. ...... The chip supports a 2 wires serial protocol, build with pins SDA and SCL. ...... measure the temperature profile of the top of the can and keep the pe

Covarep - GitHub
Apr 23, 2014 - Gilles Degottex1, John Kane2, Thomas Drugman3, Tuomo Raitio4, Stefan .... Compile the Covarep.pdf document if Covarep.tex changed.

SeparableFilter11 - GitHub
1. SeparableFilter11. AMD Developer Relations. Overview ... Load the center sample(s) int2 i2KernelCenter ... Macro defines what happens at the kernel center.

Programming - GitHub
Jan 16, 2018 - The second you can only catch by thorough testing (see the HW). 5. Don't use magic numbers. 6. Use meaningful names. Don't do this: data("ChickWeight") out = lm(weight~Time+Chick+Diet, data=ChickWeight). 7. Comment things that aren't c

SoCsploitation - GitHub
Page 2 ... ( everything – {laptops, servers, etc.} ) • Cheap and low power! WTF is a SoC ... %20Advice_for_Shellcode_on_Embedded_Syst ems.pdf. Tell me more! ... didn't destroy one to have pretty pictures… Teridian ..... [email protected].

Datasheet - GitHub
Dec 18, 2014 - Compliant with Android K and L ..... 9.49 SENSORHUB10_REG (37h) . .... DocID026899 Rev 7. 10. Embedded functions register mapping .

Action - GitHub
Task Scheduling for Mobile Robots Using Interval Algebra. Mudrová and Hawes. .... W1. W2. W3. 0.9 action goto W2 from W1. 0.1. Why use an MDP? cost = 54 ...