Building and Evaluating Interpretable Models using ...

Viewer
Transcript

JMLR: Workshop and Conference Proceedings 1–8

Building and Evaluating Interpretable Models using Symbolic Regression and Generalized Additive Models Khaled Sharif

[email protected]

Abstract In this paper, we propose a novel method for generating interpretable regression models by forming a generalized additive model using symbolic constituents. We then outline the construction of a high level Python framework that simplifies and automates the construction of this hybrid model. Moreover, we describe further operations on the learned model that help with its interpretation and analysis. We apply this method to a meteorological observations dataset and recover intuitions from the generated model that scientists would agree with. We also apply the framework to multiple regression datasets and compare prediction performance to decision trees, random forests, and symbolic regression; our framework achieves competitive results in both accuracy and interpretability. Keywords: automatic machine learning; regression; generalized additive models; symbolic regression; interpretable machine learning.

1. Introduction and Related Work The idea of Generalized Additive Models (GAM) first stemmed from the Kolmogorov Arnold representation theorem, which states that every multi-variable continuous function can be represented as a superposition of continuous functions of two variables (Braun and Griebel (2009)). Various work has been produced since then that uses GAM, with all work citing the main advantage of using GAM was producing accurate models of the underlying processes (Yee and Mitchell (1991); Guisan et al. (2002); Dominici et al. (2002)). GAMs are known for their interpretability: Caruana et al. (2015) presented two case studies where high performance GAMs with pairwise interactions are applied to real healthcare problems yielding intelligible models with state-of-the-art accuracy. Symbolic regression (SR) has been used previously to distill natural laws from experimental data; Schmidt and Lipson (2009) demonstrated the discovery of physical laws, from scratch, directly from experimentally captured data with the use of a computational search. In this paper, we will first outline the proposed model that forms a GAM using symbolic constituents in a way that is novel. We then outline the algorithm that iteratively generates increasingly complex and accurate models through a genetic search. We then briefly describe how our framework abstracts and automates both this process and the analysis of the resulting model. Finally, we compare the prediction accuracy results with other frameworks and algorithms.

c K. Sharif.

Khaled Sharif

2. Generating the Proposed Additive Model The model proposed takes the form of a linear additive combination of constituents. Each constituent in the model takes the following form, where a is an input parameter, f is a differentiable function, and xn are free-form variables. x0 · f (x1 · a + x2 ) + x3

(1)

This form was chosen because of its simplicity (the function f in the previous equation can be a sine, cosine, exponential, etc.) and is thought to be a ”building block” from which highly complex models can be built from. Assuming the function in the constituent is easily differentiable, finding the coefficients in the equation that best fit the model to the output is done by using Gradient Decent (GD) or any similar optimization method. It is also possible to capture the relationships in between parameters in this model M . We can form this in our model by the extension of the constituent form to two dimensions, as show below. M = x0 · f (x1 · a + x2 ) · g(x3 · b + x4 ) + x5

(2)

We can then extend this to multiple dimensions n, and therefore capture relationships between any number of parameters in one constituent Cn . Cn = x 0 + x 1 ·

n Y

fi (xi · ai + xi+1 )

(3)

i=1

Our final model M is therefore the summation of a defined number m of constituents, each having the ability to incorporate multiple parameters as their input. m n X Y M= [xj + xj+1 · f (xi · xi+1 + c)] j=1

(4)

i=1

Finally, we define our cost function that will allow us to iteratively maximize the correlation between the model M and the desired output O, as M evolves and grows in complexity. We will illustrate this model evolution in the next section. J(x0 , x1 , ...xn ) = corr(O,

m X

[xj + xj+1 ·

j=1

n Y

f (xi · xi+1 + c)])

(5)

i=1

3. Iterative Evolution of the Model The evolution of a model mimics the genetic algorithm search. It begins by randomly generating a number of models that each have random constituents. The constituents are random in their number (total number of constituents per model), their chosen function, and the parameter within the function (or multiple parameters within multiple functions in the case of multi-dimensional constituents). Each member of a generation is in some sense a representation of the hyper parameters of the model. Each member of the generation is trained on the same data, and we remove

2

Building and Evaluating Interpretable Models

all members that did not converge or those that produced models with very little correlation (less than 5%). We then rank models by the same metric that we used for the training loss function, and only select a certain percentage of the generation that is highest performing (largest accuracy). If we have no models to rank, we remove the entire generation and randomly generate a new one to repeat the process. We prune the best models by removing the negligible constituents from each model. We consider a constituent negligible if it has very little effect on the output, given average values for the input of that particular constituent and the output (we will call this the constituent contribution factor). Doing this in between model evolution generation helps keep the model clean in the sense that all its constituents are of value. This also favors models that have less constituents and avoids the generation of overly complex models for situations where a simpler model would have achieved comparable accuracy. Using the best models from the previous generation, we can easily cross over the models to produce new models. This is because of the additive property of our models. We assume that, given two high performing models, the addition of their constituents together will form a new model that, when re-trained, should produce a better model. We can also mutate models due to the same additive property. Mutation in the genetic algorithm is beneficial because it allows us to search along new pathways to try to exit a local minimum during a search. This exit is akin to randomly removing a constituent from a model, or randomly adding an entirely new constituent. Algorithm 1: Iterative Model Evolution 1. Normalize all input parameters to be within 0 and 1. 2. Randomly generate a fixed number G of models. 3. For k = 1 to maximum number of iterations: (a) Use an optimization algorithm to fit each model in G. (b) Evaluate each model in G and select the Pareto Front P . (c) Cross over, mutate, and prune each model in P to become the next G. i. Cross over: adding each model in P to one another. ii. Mutation: randomly adding or removing constituent to each model in P . iii. Pruning: for each model in P , remove negligible constituents. 4. Use the best model generated to produce the final symbolic equation. It is a good idea to speed up the genetic algorithm convergence by carefully selecting the initial generation in the search, instead of relying on completely random procedures. The method proposed by the framework involves feature selection using feature importance based methods, such as a Decision Tree Regressor. We select a number of input parameters that are of high importance (and are in some way highly correlated to the output). We then place them in functions that would suit their correlation to the output; for example, if parameter X is correlated to the output Y, but the transformation of X by using f(X) has 3

Khaled Sharif

higher correlation to the output, then we place that parameter in a constituent that has f has its function. We can replicate this easily for multi-dimensional constituents too. Algorithm 2: Head Starting an Evolutionary Search 1. For a = 1 to maximum number of constituents (a) Select any n input parameters randomly: x0 , x1 , ...xn . (b) Iterate over functions f and evaluate correlation Co between output and f (x0 , x1 , ...xn ). (c) Evaluate the average correlation between the input parameters Ci . (d) Add the previous to model if Co is above a certain threshold and Ci is below another certain threshold.

4. Implementation Details The framework makes use of an automatic symbolic differentiation library, Tensorflow, to automatically derive and iterate over a large and complex equation formed of an arbitrary number of constituents using the GD. In practice, there are many downsides to the GD method, and it is possible to use the framework with more advanced optimization algorithms, such as the Adam optimization algorithm. The use of Tensorflow not only allows the framework to generate arbitrarily large equations using a mix of simplistic functions and compositive functions too; it also allows the framework to run the optimization algorithm on the GPU, yielding extraordinary speedup in time needed for the algorithm to converge to a minimum. The use of this library and hardware acceleration is thought to make this method of symbolic regression the most performant compared to other methods. The framework code is open sourced on Github and can be viewed by clicking here.

5. Empirical Analysis We evaluate the framework on a meteorological dataset obtained from multiple weather stations in the same city. We are therefore trying to build a model to solve a time series regression task. The dataset contains over two years of observations from 10 different weather stations distributed across the city. Each observation is spaced 5 minutes apart, and there are gaps in the data for some of the stations. Throughout this example, we will be modeling the ambient temperature (in Celsius) with all other observations being the input parameters (such as humidity, wind speed, rain rate, etc). We split the dataset into 80% training data and 20% testing data, making sure there is no overlap or intersection between datasets. We repeat each experiment 10 times and average the evaluation metrics that result. We refrain from using K-fold cross-validation evaluation schemes because of the inherent serial correlation and potential non-stationarity of the data. Usually when evaluating a time-series prediction, an out-of-sample (OOS) evaluation is preferred (Bergmeir et al. (2015)). During our testing, we found that there was a significant difference between the output of the two

4

Building and Evaluating Interpretable Models

methods (the OOS method produced evaluation metrics with significantly less accuracy). We therefore only present the results of the OOS method. We also include below additional results from three other datasets obtained from the UCI Machine Learning Data Repository. Namely, the datasets are the Beijing PM2.5, Airfoil Self Noise, and Istanbul Stock Market datasets (Liang et al. (2015); Lau et al. (2006); Akbilgic et al. (2014)). For all datasets we use the same OOS method as previously described. For comparison, we use the Decision Tree Regressor (default parameters) and Random Forest algorithms (with 250 estimators) present in the Sci-Kit Learn library. We also compare to traditional symbolic regression methods by using the gplearn library (using 100 generations). To compare between different frameworks, we use the R2 score (also known as explained variance score) to evaluate framework performance and accuracy on the dataset, and the results are displayed in Table 1. We also include the Pearson correlation coefficient in Table 2. Through observation of the results, we find that symbolic regression failed to produce any meaningful results when the R2 score was used, and scored the worst on average when the Pearson correlation coefficient was used. Our framework achieved comparable results to a Random Forest algorithm, and achieved significantly better results on some datasets (namely the meteorological dataset). It is worth noting however that the Random Forest uses a very complex architecture that is much harder to interpret than the symbolic equation produced by our framework. Dataset

Symbolic Regr.

Decision Tree

Random Forest

Our Work

Meteorology Beijing Airfoil Istanbul

-∞ -∞ -∞ -∞

-19.8 -15.9 51.2 -17.9

37.4 31.3 69.4 38.9

58.7 31.5 62.6 43.3

Table 1: Comparison between different frameworks using the Explained Variance Score. Higher numbers indicate increased accuracy.

Dataset

Symbolic Regr.

Decision Tree

Random Forest

Our Work

Meteorology Beijing Airfoil Istanbul

1.5 12.2 27.4 52.6

28.1 36.6 73.8 42.2

37.4 56.2 84.2 64.1

79.0 56.4 81.0 66.2

Table 2: Comparison between different frameworks using the Pearson Correlation Coefficient. Higher numbers indicate increased accuracy.

5

Khaled Sharif

6. Constituent Contribution Analysis We previously defined each individual in the generated model as a constituent. To achieve the desired output, each constituent is evaluated, and then all the evaluations are summed; hence the term additive model. We define the ratio of a certain constituent upon evaluation to the total output of the additive model as the constituent contribution. The constituent contribution is represented as a percentage of the output given average input and corresponding output parameters. Simply put, if a constituent contribution is found to be X%, then on average, this constituent will evaluate to X% of the total sum of the additive model it is within. Our framework easily facilitates the generation of this analysis for any generated model. We used the previously described meteorological dataset to produce an accurate model for ambient temperature, and then used the framework to produce a constituent contribution analysis. Upon interpretation, the contributions of each constituent made sense to a meteorologist: the highest contributor to temperature was the time of day (because of the presence or absence of sunlight), and the lowest contributor was ultra violet light (which has little effect on ambient temperature).

7. Partial Differential Analysis The models produced by the framework are symbolic (they use the Sympy library), and are therefore very easy to partially differentiate with respect to individual constituents. The framework allows for this to be done easily during or after model creation. Moreover, the framework can provide correlation analysis between partial differential equations that are present in the built model and between those present in the true data. This is done by iterating over all the parameters in a particular model, obtaining a partial differential equation for that parameter with respect to the output, then obtaining evaluations for the equations using the dataset. These evaluations are estimates of the partial derivatives of the parameters. We can then correlate these evaluations with those obtained from the dataset and display these correlations are metrics to the framework user, along side the partial differential equations obtained. This automatic discovery of underlying relationships in the dataset is extremely useful to scientists studying the data. We again used the framework to model ambient temperature, similar to what we did previously, and we assessed the partial derivative of temperature with respect to humidity. Through our interpretation of the partial derivative obtained, we observed a non-linear negative correlation between the ambient temperature and humidity. This is to say that an increase in relative humidity yields a non-linear decrease in surface temperature. While this is, of course, a gross over-simplification, this is true (on average) from a meteorological point of view.

8. Conclusions Our proposed model and accompanying framework simplify the automatic generation of interpretable symbolic models that achieve competitive accuracy on a number of regression problem datasets. The framework also simplifies and abstracts the automatic analysis and interpretation of generated models. 6

Building and Evaluating Interpretable Models

References Oguz Akbilgic, Hamparsum Bozdogan, and M Erdal Balaban. A novel hybrid rbf neural networks model as a forecaster. Statistics and Computing, 24(3):365–375, 2014. Christoph Bergmeir, Rob J Hyndman, Bonsoo Koo, et al. A note on the validity of crossvalidation for evaluating time series prediction. 2015. J¨ urgen Braun and Michael Griebel. On a constructive proof of kolmogorovs superposition theorem. Constructive approximation, 30(3):653, 2009. Rich Caruana, Yin Lou, Johannes Gehrke, Paul Koch, Marc Sturm, and Noemie Elhadad. Intelligible models for healthcare: Predicting pneumonia risk and hospital 30-day readmission. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1721–1730. ACM, 2015. Francesca Dominici, Aidan McDermott, Scott L Zeger, and Jonathan M Samet. On the use of generalized additive models in time-series studies of air pollution and health. American journal of epidemiology, 156(3):193–203, 2002. Antoine Guisan, Thomas C Edwards, and Trevor Hastie. Generalized linear and generalized additive models in studies of species distributions: setting the scene. Ecological modelling, 157(2):89–100, 2002. Kevin Lau, R L´ opez, E O˜ nate, E Ortega, R Flores, M Mier-Torrecilla, SR Idelsohn, C Sacco, and E Gonz´ alez. A neural networks approach for aerofoil noise prediction. Master thesis, 2006. Xuan Liang, Tao Zou, Bin Guo, Shuo Li, Haozhe Zhang, Shuyi Zhang, Hui Huang, and Song Xi Chen. Assessing beijing’s pm2. 5 pollution: severity, weather impact, apec and winter heating. In Proc. R. Soc. A, volume 471, page 20150257. The Royal Society, 2015. Michael Schmidt and Hod Lipson. Distilling free-form natural laws from experimental data. science, 324(5923):81–85, 2009. Thomas W Yee and Neil D Mitchell. Generalized additive models in plant ecology. Journal of vegetation science, 2(5):587–602, 1991.

7

Khaled Sharif

Appendix A. Answering the Reviewer’s Questions a. How is the method described in this paper different or better than Schmidt and Lipson (2009)? The framework in that paper utilizes symbolic regression only. In their paper, they perform similar experiments to those performed in this paper: they automatically extract natural laws from experimental data through a process that favours a balance between accuracy and parsimony (in this context, a parsimonious model is one that is simplistic). Their framework is based solely on symbolic regression (SR). In our paper, we automatically form a generalized additive model using method similar to traditional SR. Our framework therefore uses a more complex algorithm that can generate arbitrarily large equations with more intricate forms than those generated by a traditional SR method. In addition to this, our method produces models faster and with greater accuracy compared to the SR method. b. Do you expect the GPU to give a large speedup for low-dimensional, complex functions? Yes, most significantly when the model itself is sufficiently large in length. Our experiments have shown a speedup of around 10x for sufficiently large models generated by the framework, when compared to performance on a CPU. c. Why do you maximize correlation, instead of RMSE, or even better, predictive log-probability? We chose to maximize correlation in the experiments in this paper because they produced better metrics overall. The framework allows for a number of different optimization metrics to be used, including RMSE. The use of RMSE in regression tasks may sometimes result in a model that always predicts the average value to minimize RMSE. We do not use predictive log probability because this framework deals with regression, not classification. d. Does your method use both Tensorflow and Sympy to differentiate the models? The creation and training of the model (and all differentiation related to those methods) is done entirely in Tensorflow. The framework facilitates both the training and an abstract link between Tensorflow and Sympy so that, to the user, it appears as though they are training a Sympy equation. It is worth noting however that the framework uses Sympy to differentiate the model after it has been produced (for analysis purposes).

8

Evaluating Teachers and Schools Using Student Growth Models ...

Evaluating vector space models using human semantic ... - Tal Linzen

Evaluating Teachers and Schools Using Student ... - Semantic Scholar

Evaluating Embeddings using Syntax-based ...

Scalable and interpretable data representation ... - People.csail.mit.edu

Evaluating Intuitiveness of Vertical-Aware Click Models

Evaluating Comprehensive School Reform Models at Scale: Focus on ...

BAYESIAN DEFORMABLE MODELS BUILDING VIA ...

Building patient-level predictive models - GitHub

Flexible Spatial Models for Kriging and Cokriging Using ...

An Interpretable and Sparse Neural Network Model for ...

Method and system for building and using intelligent vector objects

Detecting Cars Using Gaussian Mixture Models - MATLAB ...

Customer Targeting Models Using Actively ... - Semantic Scholar

Issues in evaluating semantic spaces using word ...

Using job-shop scheduling tasks for evaluating collocated

Evaluating websites using hoax sites Activity 2.pdf