The Higgs Challenge is a machine learning (ML) competition coordinated by CERN and Kaggle. The goal of the challenge is to use advanced ML methods to improve the discovery significance of the ATLAS experiment. The task at hand is to classify particle collision events as either signal (Higgs boson decay) or background noise.
•
Contact me with comments and questions.
•
CSV files of training and test datasets, with 250,000 and 550,000 samples, respectively.
•
Samples are vectors of values (floating and integer) for 30 features. Missing data is represented with -999. •
•
•
Classification is either signal Higgs event (S) or background noise (B). Classification ranks are also required. Classification is to optimize the approximate median significance (AMS) metric – a function of the weighted numbers of correctly and incorrectly labeled signal events.
Gradient boosting: statistical view on boosting, or a generalization of boosting to arbitrary loss functions (more).
•
Builds a prediction model as an ensemble of decision trees
•
Boosted decision trees have proven useful in high energy physics analyses [1].
GBC Advantages
GBC Disadvantages
•
•
Ability to work with heterogeneous data (features measured on different scales) Support different loss functions to fit a specific problem Automatically detect (non-linear) feature interactions, and transparency – not a total black box like neural networks
The boosting can be applied many times because gradient boosting is relatively robust to over-fitting.
•
Deviance plot (Fig. 1) is a diagnostic to determine if the model is overfitting.
Scikit-learn toolkit
•
Numpy
•
•
15% was chosen to maximize the AMS metric
•
Test set classifications are ranked by their probabilities of being signal events.
•
[1] “Boosted Decision Trees”, TMVA: http://tmva.sourceforge.net/#mva_bdt [2] P. Prettenhofer, “Gradient Boosted Regression Trees in scikit-learn”: https://www.youtube.com/watch?v=IXZKgIsZRm0 [3] Baumgartel, “Data in Practice”: http://dbaumgartel.wordpress.com/about
Figure 1: Training and validation error (deviance) as a function of the number of trees (model complexity). Typically the training and validation lines would diverge, but this plot shows the model is robust to overfitting.
•
The deeper the tree the more variance can be explained; depth controls the degree of feature interaction
•
Shrinkage: slow down learning by shrinking the predictions of each tree by some small scalar
Code can be found on my Github.
References •
1.76 times the baseline Naïve Bayes model (2.06036)
GBC parameters
•
Pandas
4.63% behind the competition winning AMS score 3.80581
•
Probability predictions >85% classified as signal events. This is shown in the probability prediction plot of Fig. 2, where the blue shaded region is this upper 15%.
Method Details •
•
Need for careful tuning Model is slow to train Model cannot extrapolate
•
Written in Python •
• •
Best AMS score as of above date = 3.62977
GBC model complexity controlled by number of trees and depth of individual trees, and carries the consequence of overfitting.
Tools •
Boosting: ensemble technique, where each member of the ensemble is an expert on the errors of its predecessor, iteratively re-weighing training examples based on errors.
•
• •
Data and Classification •
Gradient booster classifier with Python scikit-learn
•
Lower learning rate – dampening the gradient step of the algorithm with a constant value – requires a greater number of estimators
•
Trade-off between accuracy and runtime
Stochastic gradient boosting: •
Subsampling the training set before growing each tree
•
Subsampling the features before finding the best split node
•
Hyperparameter tuning: optimize the classifier parameters by cross-validation
•
Cross-validation by randomly pooling data into training (90%) and validation sets (10%) for each run
Several method alternatives/additions: •
Initial base estimators – make predictions prior to running GBC. The aim is to improve generalizability and robustness over a single estimator.
•
Principle component analysis (PCA) – prior to running the classifier, reduce dimensionality of the data into components that explain a maximum amount of the variance
•
Data preprocessing – imputation and standard scaling prior to running the model
•
Support vector machines (SVM), which typically perform well with high-dimensional data
•
Rerunning GBC over the samples classified with marginal confidence – 0.6 < 𝑃 < 0.9 Figure 2: Probability predictions plot for the GBC classifier. The training signal events and background are plotted in blue and red, respectively. Testing data is plotted in black, normalized to the training dataset..
There was a problem previewing this document. Retrying... Download. Connect more apps... Try one of the apps below to open or edit this item. higgs boson ...
Dec 28, 2000 - violated at high energies for the reaction fL ¯fL â. W. +. W. â . However .... model. The decay via two virtual electroweak bosons represents a .... See also: G. 't Hooft, Lectures given at International School of Nuclear Physics:
travel in opposite directions in separate beam pipes which are two tubes kept at ultra- ..... Electron identification is obtained applying a series of cuts on different ...
In Azure ML Studio, on the Notebooks tab, open the TimeSeries notebook you uploaded ... 9. Save and run the experiment, and visualize the output of the Select ...
a vast number of examples, which machine learning .... for businesses about, for example, the value of machine ...... phone apps, but also used to automatically.
Then in the Upload a new notebook dialog box, browse to select the notebook .... 9. On the browser tab containing the dashboard page for your Azure ML web ...
used on social media; voice recognition systems .... 10. MACHINE LEARNING: THE POWER AND PROMISE OF COMPUTERS THAT LEARN BY EXAMPLE ..... which show you websites or advertisements based on your web browsing habits'.
course. Exploring Spatial Data. In this exercise, you will explore the Meuse ... folder where you extracted the lab files on your local computer. ... When you have completed all of the coding tasks in the notebook, save your changes and then.
get lost in the middle way of the derivation process. This cheat sheet ... 3. 2.2. A brief review of probability theory . . . . 3. 2.2.1. Basic concepts . . . . . . . . . . . . . . 3 ...... pdf of standard normal Ï ... call it classifier) or a decis
Feb 9, 2012 - 7.3 y = sin(10x) . ... support and OpenCV 2.3.1 now comes with a programming interface to C, C++, Python and Android. OpenCV is released ...
Related. Machine Learning: The Art and Science of Algorithms that Make Sense of Data · Artificial Intelligence for Humans, Volume 3: Deep Learning and Neural Networks · Artificial Intelligence for Humans, Volume 1: Fundamental Algorithms · Fundamenta