Higgs Boson Machine Learning Challenge

Alexander Lavin

Overview

Machine Learning Approach

Results









The Higgs Challenge is a machine learning (ML) competition coordinated by CERN and Kaggle. The goal of the challenge is to use advanced ML methods to improve the discovery significance of the ATLAS experiment. The task at hand is to classify particle collision events as either signal (Higgs boson decay) or background noise.



Contact me with comments and questions.



CSV files of training and test datasets, with 250,000 and 550,000 samples, respectively.



Samples are vectors of values (floating and integer) for 30 features. Missing data is represented with -999. •





Classification is either signal Higgs event (S) or background noise (B). Classification ranks are also required. Classification is to optimize the approximate median significance (AMS) metric – a function of the weighted numbers of correctly and incorrectly labeled signal events.

Gradient boosting: statistical view on boosting, or a generalization of boosting to arbitrary loss functions (more).



Builds a prediction model as an ensemble of decision trees



Boosted decision trees have proven useful in high energy physics analyses [1].

GBC Advantages

GBC Disadvantages





Ability to work with heterogeneous data (features measured on different scales) Support different loss functions to fit a specific problem Automatically detect (non-linear) feature interactions, and transparency – not a total black box like neural networks

The boosting can be applied many times because gradient boosting is relatively robust to over-fitting.



Deviance plot (Fig. 1) is a diagnostic to determine if the model is overfitting.

Scikit-learn toolkit



Numpy





15% was chosen to maximize the AMS metric



Test set classifications are ranked by their probabilities of being signal events.



[1] “Boosted Decision Trees”, TMVA: http://tmva.sourceforge.net/#mva_bdt [2] P. Prettenhofer, “Gradient Boosted Regression Trees in scikit-learn”: https://www.youtube.com/watch?v=IXZKgIsZRm0 [3] Baumgartel, “Data in Practice”: http://dbaumgartel.wordpress.com/about

Figure 1: Training and validation error (deviance) as a function of the number of trees (model complexity). Typically the training and validation lines would diverge, but this plot shows the model is robust to overfitting.



The deeper the tree the more variance can be explained; depth controls the degree of feature interaction



Shrinkage: slow down learning by shrinking the predictions of each tree by some small scalar

Code can be found on my Github.

References •

1.76 times the baseline Naïve Bayes model (2.06036)

GBC parameters



Pandas

4.63% behind the competition winning AMS score 3.80581



Probability predictions >85% classified as signal events. This is shown in the probability prediction plot of Fig. 2, where the blue shaded region is this upper 15%.

Method Details •



Need for careful tuning Model is slow to train Model cannot extrapolate



Written in Python •

• •

Best AMS score as of above date = 3.62977

GBC model complexity controlled by number of trees and depth of individual trees, and carries the consequence of overfitting.

Tools •

Boosting: ensemble technique, where each member of the ensemble is an expert on the errors of its predecessor, iteratively re-weighing training examples based on errors.



• •

Data and Classification •

Gradient booster classifier with Python scikit-learn



Lower learning rate – dampening the gradient step of the algorithm with a constant value – requires a greater number of estimators



Trade-off between accuracy and runtime

Stochastic gradient boosting: •

Subsampling the training set before growing each tree



Subsampling the features before finding the best split node



Hyperparameter tuning: optimize the classifier parameters by cross-validation



Cross-validation by randomly pooling data into training (90%) and validation sets (10%) for each run

Several method alternatives/additions: •

Initial base estimators – make predictions prior to running GBC. The aim is to improve generalizability and robustness over a single estimator.



Principle component analysis (PCA) – prior to running the classifier, reduce dimensionality of the data into components that explain a maximum amount of the variance



Data preprocessing – imputation and standard scaling prior to running the model



Support vector machines (SVM), which typically perform well with high-dimensional data



Rerunning GBC over the samples classified with marginal confidence – 0.6 < 𝑃 < 0.9 Figure 2: Probability predictions plot for the GBC classifier. The training signal events and background are plotted in blue and red, respectively. Testing data is plotted in black, normalized to the training dataset..

higgs boson machine learning challenge.pdf

higgs boson machine learning challenge.pdf. higgs boson machine learning challenge.pdf. Open. Extract. Open with. Sign In. Main menu.

297KB Sizes 3 Downloads 213 Views

Recommend Documents

higgs boson machine learning challenge.pdf
There was a problem previewing this document. Retrying... Download. Connect more apps... Try one of the apps below to open or edit this item. higgs boson ...

The Higgs boson might not couple to B quarks
Dec 28, 2000 - violated at high energies for the reaction fL ¯fL →. W. +. W. − . However .... model. The decay via two virtual electroweak bosons represents a .... See also: G. 't Hooft, Lectures given at International School of Nuclear Physics:

Measurement of Higgs boson properties in the four-lepton decay ... - Infn
travel in opposite directions in separate beam pipes which are two tubes kept at ultra- ..... Electron identification is obtained applying a series of cuts on different ...

Applied Machine Learning - GitHub
In Azure ML Studio, on the Notebooks tab, open the TimeSeries notebook you uploaded ... 9. Save and run the experiment, and visualize the output of the Select ...

Machine learning - Royal Society
a vast number of examples, which machine learning .... for businesses about, for example, the value of machine ...... phone apps, but also used to automatically.

Applied Machine Learning - GitHub
Then in the Upload a new notebook dialog box, browse to select the notebook .... 9. On the browser tab containing the dashboard page for your Azure ML web ...

Machine learning - Royal Society
used on social media; voice recognition systems .... 10. MACHINE LEARNING: THE POWER AND PROMISE OF COMPUTERS THAT LEARN BY EXAMPLE ..... which show you websites or advertisements based on your web browsing habits'.

Applied Machine Learning - GitHub
course. Exploring Spatial Data. In this exercise, you will explore the Meuse ... folder where you extracted the lab files on your local computer. ... When you have completed all of the coding tasks in the notebook, save your changes and then.

Advanced Machine Learning
Page 1. Advanced Machine Learning. CSCI 6365. Spring 2017. Lecture 3. Claire Monteleoni. Computer Science. George Washington University. Page 2. Today k-‐means clustering (con0nued). • Issues with ini0aliza0on of Lloyd's algorithm (“k-‐means

Machine Learning Cheat Sheet - GitHub
get lost in the middle way of the derivation process. This cheat sheet ... 3. 2.2. A brief review of probability theory . . . . 3. 2.2.1. Basic concepts . . . . . . . . . . . . . . 3 ...... pdf of standard normal π ... call it classifier) or a decis

Machine Learning with OpenCV2 - bytefish.de
Feb 9, 2012 - 7.3 y = sin(10x) . ... support and OpenCV 2.3.1 now comes with a programming interface to C, C++, Python and Android. OpenCV is released ...

PDF Machine Learning
Related. Machine Learning: The Art and Science of Algorithms that Make Sense of Data · Artificial Intelligence for Humans, Volume 3: Deep Learning and Neural Networks · Artificial Intelligence for Humans, Volume 1: Fundamental Algorithms · Fundamenta

Machine Learning - UBC Computer Science
10. 1.3.2. Discovering latent factors. 11. 1.3.3. Discovering graph structure. 13. 1.3.4 ..... Application: Google's PageRank algorithm for web page ranking *. 600.