Data Transformation and Attribute Subset Selection: Do they Help Make Differences in Software Failure Prediction? Hao Jia1, Fengdi Shu1, Ye Yang1, Qi Li2 1

Institute of Software, Chinese Academy of Sciences, China. 2

University of Southern California, USA.

{jiahao, fdshu, ye}@itechs.iscas.ac.cn, [email protected]

Abstract

(DT)

Data transformation and attribute subset selection have been adopted in

Attribute Subset Selection

improving software defect/failure prediction methods. However, little consensus

(AttrSS)

was achieved on their effectiveness. This paper reports a comparative study on these two kinds of techniques combined with four classifier and datasets from

min-max z-score

min-max z-score log

Log Transformation Statistical Methods Principal component analysis PCA Search-based Methods information gain attribute ranking InfoGain correlation-based feature selection CFS consistency-based subset evaluation CBS

two projects. The results indicate that data transformation displays unobvious We conduct the study on datasets from two projects of different types, open-source

influence on improving the performance, while attribute subset selection

Eclipse and closed-sourced QMP. To mitigate the potential bias caused by using single

methods show distinguishably inconsistent output. Besides, consistency across

classifier, we select four classifiers - J48 decision tree (J48), Naïve Bayes (NB), IB1,

releases and discrepancy between the open-source and in-house maintenance

and Random Forest (RF). Besides, to remedy problems in measuring the predictive

projects in the evaluation of these methods are discussed.

performance caused by routinely adopted indicator-accuracy, area under the receiver operating characteristic (AUC) is calculated and averaged over ten 10-fold cross

1. Introduction Various code attributes have been widely used as the independent predictors.

validation. This study predicts failure-prone binaries with following procedures respectively:

However, the distributions of many numeric attributes in defect datasets tend to be asymmetric (skewed), violating statistical criteria regarding distributional assumptions. Besides, in practice most data sets contain attributes either irrelevant or even harmful to a Defect database

particular

Project

prediction

goal.

Hence, how to dispose of the data and attributes effectively

1) using classifiers directly on original data (raw classifiers for short); 2) using DT methods; and if they improve the performance, then using AttrSS methods on the transformed data; otherwise, using them directly on original data.

3. Results and interpretation 

comes

Extract project and defect data

a

challenging

used to remove the potential

Select relevant attributes

inconsistency,

Attribute subset selection

and Performance evaluation

Figure 1. Procedure of predicting failure-proneness.

irrelevancy,

redundancy,

transformation

data

(DT)

and

In details, most of transforming methods behave rare superiority to raw classifier, represented by few wins. Specifically, for classifiers NB and IB1, most transformation methods result in degradations; as for classifiers J48 and RF, neither improvement nor degradation is discerned with few exceptions. 

attribute (AttrSS)

subset are

selection the

two

commonly adopted strategies, which have been introduced into the defect prediction

Effects of data transformation methods These DT methods rarely improve or debase the performance as shown in figure 2 a).

issue. Among all the methods

Data transformation

Predict failureproneness

into

Effects of attribute subset selection methods More that 80% cases record distinguishable performances between predictions with

and without AttrSS methods as shown in figure 2 b). Given by pairwise comparisons, each classifier is improved by at least one AttrSS method, even though there is no

literature. Our study aims at improving the convergence of the selection of DT and AttrSS methods to predict the binaries with post-release defects across releases for iteratively developed projects. Another goal is to explore the results in the evaluation of these methods between an open-source project (i.e. Eclipse) and a closed-sourced project (i.e.

consistency in their ranks across classifiers. Especially, the performance of J48 is improved by all methods in a significant level. There is a slender exception that, for RF, even though CBS and InfoGain improve it, the differences are not evident enough to bear out their excellence.  Consistency analysis across releases and granularities

Software quality management platform - QMP).

In evaluation of data transformation methods, Eclipse dataset can be treated insensitively to different releases and granularities. For QMP datasets, much more

2. Study design With respect to the selection criteria as reported satisfactory effects and ease to use,

little discrepancy emerges.

we study two kinds of DT methods as shown in table 1.

In the evaluation of attribute subset selection methods, for Eclipse dataset, the

Table 1. Studied DT and AttrSS methods. Strategy Data Transformation

Normalization

Method

evident consistency across releases according to NB and IB1 are found, With J48 and RF,

Abbr.

discrepancies among methods grow obscure when granularity goes coarse. For QMP

datasets, similar conditions occur. In conclusion, unless InfoGain is available with fittest

search for more concrete and conclusive recommendations in practice.

threshold, the choice of AttrSS methods displays little consistency in preference across

2:00pm - 3:30pm, Wednesday September 23, 2009

releases and granularities. 0.8

0.8

0.75

0.75 raw

0.7

log

0.65

norm z-score

0.6

Location: Empire Ballroom raw

0.7

PCA InfoGain

0.65

CFS

0.6

CBS

0.55

0.55 NB

IB1

J48

RF

NB

a ) DT methods

IB1

J48

RF

b) AttrSS methods

Figure 2. Mean AUC values obtained by a) DT methods and b) AttrSS methods.

4. Discussion 

Evaluation of data transformation methods  Z-score normalization works better than min-max normalization.  Log transformation outputs unacceptable results with NB and IB1.  Recommendations. Data transformation display unobvious influence on improving classifiers’ effects for failure prediction, while log transformation is of high possibility to degrade Naïve Bayes.



Evaluation of attribute subset selection methods  Information gain behaves the best with fittest threshold.  Predictive performance is generally degraded by PCA.  CBS and CFS perform generally well across classifiers.  Recommendations. Attribute subset selection is a useful way to enhance the predictive performance. InfoGain is optimal, but when the fittest subset is not at hand (since the computation is effort-consuming), its performance goes unsatisfactory; PCA seems useless or even harmful; CBS and CFS may be a tradeoff.

 Discrepancy in the evaluation between Eclipse and QMP  Eclipse datasets at file and package level vs. QMP datasets at module level.  Recommendations. Concerning the distinctive properties of these two types of projects, separate strategies should be planned for them.  Threats to validity The main threat of this study to external validity is that we cannot assume a priori that the results of a study on one project generalize all the projects with the same type, even though we take actions to reduce the potential bias by adopting subject projects with different types and developing methodologies, various classifiers, and candidate methods. Possible threats to internal validity include issues such as the choice of AUC as performance indicator, selection of statistical test methods, and the 4 classifiers used in our study.

5. Conclusions This paper studied the applicability and efficiency of data transformation and attribute subset selection methods in failure prediction. It offers various beneficial recommendations in selecting attribute subset selection and data transformation methods for improving classifiers’ predictive performance, which is helpful in maintaining iteratively developed projects. Future work includes investigation and extension of current study to extend the scope of current analysis to cover more projects to improve the applicability of our results and

Data Transformation and Attribute Subset Selection: Do ...

Data Transformation and Attribute Subset Selection: Do they Help Make Differences in Software Failure Prediction? Hao Jia1, Fengdi Shu1, Ye Yang1, Qi Li2.

233KB Sizes 0 Downloads 149 Views

Recommend Documents

Greedy Column Subset Selection - JMLR Workshop and Conference ...
some advantages with CSS include flexibility, interpretabil- ... Novel analysis of Greedy. For any ε ...... Greedy column subset selection for large-scale data sets.

Greedy Column Subset Selection - JMLR Workshop and Conference ...
dimensional data is crucial for computers to recognize pat- terns in ... Novel analysis of Greedy. For any ε > 0, we .... Let us state our algorithm and analysis in a slightly general form. ...... Feldman, D., Schmidt, M., and Sohler, C. Turning big

1 feature subset selection using a genetic algorithm - Semantic Scholar
Department of Computer Science. 226 Atanaso Hall. Iowa State ...... He holds a B.S. in Computer Science from Sogang University (Seoul, Korea), and an M.S. in ...

Distributed Column Subset Selection on ... - Ahmed K. Farahat
in the big data era as it enables data analysts to understand the insights of the .... or uses this criterion to assess the quality of the proposed. CSS algorithms.

Recursive Attribute Factoring - Audentia
The World Wide Knowledge Base Project (Available at http://cs.cmu.edu/∼WebKB). 1998. [12] Sergey Brin and Lawrence Page. The anatomy of a large-scale ...

extraction and transformation of data from semi ...
computer-science experts interactions, become an inadequate solution, time consuming ...... managing automatic retrieval and data extraction from text files.

A Real Time Data Extraction, Transformation and ...
custom developed software tools capable of addressing short term necessities .... A Real Time Data ETL Solution for Semi-structured Text Files. 385 ... Near real-time visualization of ongoing Space Weather and Spacecraft conditions through ...

The Data-Driven Transformation Services
experience with a brand requires reaching them with relevant information or offers that have value at each micro-moment of their journey. To do that well, marketers must be able to understand and quickly act on all the data available to them. Marketi

Efficient Approaches to Subset Construction
presented to the University of Waterloo. in ful lment of the. thesis requirement for the degree of. Master of Mathematics. in. Computer Science. Waterloo, Ontario ...

Kin Selection, Multi-Level Selection, and Model Selection
In particular, it can appear to vindicate the kinds of fallacious inferences ..... comparison between GKST and WKST can be seen as a statistical inference problem ...

Variable selection in PCA in sensory descriptive and consumer data
Keywords: PCA; Descriptive sensory data; Consumer data; Variable selection; Validation. 1. Introduction. In multivariate analysis where data-tables with sen-.

Variable selection in PCA in sensory descriptive and consumer data
used to demonstrate how this aids the data-analyst in interpreting loading plots by ... Keywords: PCA; Descriptive sensory data; Consumer data; Variable ...

Optimal Training Data Selection for Rule-based Data ...
affair employing domain experts, and hence only small .... free rules. A diverse set of difficult textual records are given to set of people making sure that each record is given to a ..... writes when presented with the 100 chosen patterns. A.

E cient Approaches to Subset Construction
Technology Research Center, and the Natural Sciences and Engineering ... necessary to analyze searching and sorting routines, abstract data types, and .... 6.4 Virtual Power-Bit Vector : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 5

Attribute-efficient learning of decision lists and linear threshold ...
concentrated on a constant number of elements of the domain then the L2 norm ... if the probability mass is spread uniformly over a domain of size N then the L2 ...

Enforcing Message Privacy Using Attribute Based ... - IJRIT
When making decision on use of cloud computing, consumers must have a clear ... identifier (GID) to bind a user's access ability at all authorities by using an ...

Syntax Macros: Attribute Redefinitions
Syntax macros extend the concrete syntax of a language by adding production rules for new concrete ..... Assuming that there are no inherited attributes, a type signature of the semantics for a ...... Electronic Notes in theoretical Computer Sci-.

Recursive Attribute Factoring - Research at Google
the case with a collection of web pages or scientific papers), building a joint model of document ... Negative Matrix Factorization [3] adds constraints that all compo- .... 6000 web pages from computer science depart- .... 4This approach can, of cou

Attribute+Train+Game.pdf
Sign in. Loading… Whoops! There was a problem loading more pages. Retrying... Whoops! There was a problem previewing this document. Retrying.

Opaque Attribute Alignment presentation 3.29.12
SAIC. All rights reserved. Approach – Kernel Density Estimation. • Non-parametric. • Probability distribution. • Estimates density. • Used to perform image analysis. • Not typically used for ontology alignment http://upload.wikimedia.org/

Privacy beyond Single Sensitive Attribute
Given a bitmap transformed table, for a pair of SAs Ai and Aj, their MI is. I(Ai,Aj) = ∑ v∈Ai ..... ICS, 2007. http://www.ics.uci.edu/˜mlearn/MLRepository.html. 2.

Interesting Subset Discovery and its Application on ...
As an example, in a database containing responses gath- ered from an employee ... services in order to serve client's transaction requests. Each transaction is ...