When a Patch Goes Bad: Exploring the Properties of ... - IEEE Xplore

Viewer
Transcript

2013 ACM / IEEE International Symposium on Empirical Software Engineering and Measurement

When a Patch Goes Bad: Exploring the Properties of Vulnerability-Contributing Commits Andrew Meneely, Harshavardhan Srinivasan, Ayemi Musa, Alberto Rodríguez Tejeda, Matthew Mokary, Brian Spates Department of Software Engineering Rochester Institute of Technology Rochester, NY, USA [email protected], {hxs8839, ajm661, acr921, mxm6060, bxs4361}@rit.edu

abuse cases [5], and automated static analysis [2]–[4]. While these practices have been shown to be effective, they can also be inefficient. Development teams are then faced with the challenge of prioritizing their fortification efforts within the entire development process. Developers might know what is possible, but lack a firm grip on what is probable. As a result, an uninformed development team can easily focus on the wrong areas for fortification. Fortunately, an historical, longitudinal analysis of how vulnerabilities originated in professional products can inform fortification prioritization. Understanding the specific trends of how vulnerabilities can arise in a software development product can help developers understand where to look and what to look for in their own product. Some of these trends have been quantified in vulnerability prediction [6]–[10] studies using metrics aggregated at the file level, but little has been done to explore the original coding mistakes that contributed the vulnerabilities in the first place. In this study, we have identified and analyzed original coding mistakes as Vulnerability-Contributing Commits (VCCs), or commits in the version control repository that contributed to the introduction of a post-release vulnerability. A myriad of factors can lead to the introduction and lack of detection of vulnerabilities. A developer may make a single massive change to the system, leaving his peers with an overwhelmingly large review. Furthermore, a developer may make small, incremental changes, but his work might be affecting the work of many other developers. Or, a developer may forget to disseminate her work in the change notes and so the code may miss out on be reviewed entirely. The objective of this research is to improve software security by analyzing the size, interactive churn, and community dissemination of VCCs. We conducted an empirical case study of the Apache HTTP Server project (HTTPD). Using a multi-researcher, cross-validating, semi-automated, semi-manual process, we identified the VCCs for each known post-release vulnerability in HTTPD. To explore commit size, we analyzed three code churn metrics. Interactive churn is a suite of five recently-developed [6] socio-technical variants of code churn metrics that measure the degree to which developers’ changes overwrite each others’ code at the line level. To explore community dissemination, we analyzed the

Abstract—Security is a harsh reality for software teams today. Developers must engineer secure software by preventing vulnerabilities, which are design and coding mistakes that have security consequences. Even in open source projects, vulnerable source code can remain unnoticed for years. In this paper, we traced 68 vulnerabilities in the Apache HTTP server back to the version control commits that contributed the vulnerable code originally. We manually found 124 Vulnerability-Contributing Commits (VCCs), spanning 17 years. In this exploratory study, we analyzed these VCCs quantitatively and qualitatively with the over-arching question: “What could developers have looked for to identify security concerns in this commit?” Specifically, we examined the size of the commit via code churn metrics, the amount developers overwrite each others’ code via interactive churn metrics, exposure time between VCC and fix, and dissemination of the VCC to the development community via release notes and voting mechanisms. Our results show that VCCs are large: more than twice as much code churn on average than non-VCCs, even when normalized against lines of code. Furthermore, a commit was twice as likely to be a VCC when the author was a new developer to the source code. The insight from this study can help developers understand how vulnerabilities originate in a system so that security-related mistakes can be prevented or caught in the future. Index Terms— vulnerability, churn, socio-technical, empirical.

I. INTRODUCTION Security is a harsh reality for software teams today. Insecure software is not only expensive to maintain, but can cause immeasurable damage to a brand, or worse, to the livelihood of customers, patients, and citizens. To software developers, the key to secure software lies in preventing vulnerabilities. Software vulnerabilities are special types of “faults that violate an [implicit or explicit] security policy” [1]. If developers want to find and fix vulnerabilities they must focus beyond making the system work as specified and prevent the system’s functionality from being abused. According to security experts [2]–[4], finding vulnerabilities requires expertise in both the specific product and in software security in general. The field of engineering secure software has a plethora of security practices for finding vulnerabilities, such as threat modeling, penetration testing, code inspections, misuse and 978-0-7695-5056-5/13 $26.00 © 2013 IEEE DOI 10.1109/ESEM.2013.19

65

itself, they found a strong correlation (Pearson r=0.728, p<0.05) between the code churn of modules and the number of faults in those modules. Compared to this paper, their analysis was focused on the number of faults as opposed to specific commits that contributed to vulnerabilities. Nagappan et al. [15]–[17] has provided several empirical analyses of code churn on case studies such as Microsoft Windows. In one study, Nagappan et al. [1] proposed that researchers normalize their code churn measure by the number of source lines of code of the file. In that study, they found that normalized, or “relative” code churn measures were better predictors of product quality metrics. In our empirical study, we analyzed made use of the relative form of code churn and normalized our interactive churn metrics. Casebolt et al. [18] examined the concept of “author entropy” regarding developer collaboration. Author entropy is an analysis of author contribution to a given file. From their research, files with low author entropy have a dominant author and files with high entropy do not have dominant authors. The author entropy metric provides a snapshot contribution summary but does not provide details of the relationship between the developers in a given file. The research only identifies the spread of contribution based on number of commits. Our research provides a finer-grained analysis of authors changing lines of code at the line level. In previous studies [6]–[9], we investigated code churn along with complexity and developer activity metrics as indicators of security vulnerabilities. We studied the likelihood of a file being vulnerable by collecting the metrics before a release and predicting them against the reported vulnerabilities post-release. We also developed some socio-technical metrics for the study, such as metrics dealing with contribution to a file and the socio-technical developer network surrounding a file. We conducted studies on the Red Hat Enterprise Linux kernel and Mozilla Firefox. The combined model of code churn, complexity, and developer activity metrics resulted in predicting 80% of known vulnerable files and less than 25% false positives. We also have studied and developed sociotechnical metrics in various other case studies [19]–[22]. Some researchers have developed methods of identifying commits that contributed to bugs. The most notable of these is the SZZ algorithm [23]–[25]. The main difference between their algorithm and our methodology is our employment of git bisect instead of their use of git blame (or equivalent). We also applied additional human judgment and cross-checking than is reported in studies that apply the SZZ algorithm.

exposure time of vulnerabilities, mined the change notes, mined the voting mechanisms, and identified if the VCC was a new source code file. In total, the contributions of this study are: x Results of a quantitative study regarding code churn metrics, interactive churn metrics, and exposure time in relation to VCCs; x Results of a qualitative study regarding the amount of dissemination a VCC had to the rest of the development community. II. BACKGROUND AND DEFINITIONS In this paper, we use several terms regarding version control systems and process metrics. The interactive churn metrics we propose are socio-technical metrics, a term we borrow from sociology to describe the connection between two people in the context of work-related collaboration[11], which is of a social and technical nature. The term “technical” is not referring to technology-related activities, but to the more general idea of technicality, skill, and labor. Thus, a sociotechnical metric focuses on people and their interactions with others. We use the term commit to describe a recorded, individual change to the source code as recorded by the version control system. Some version control systems call these “revisions”, or “change sets”. Each commit contains a diff, or a compared difference between a source code file prior to the commit and after the commit. Another name for diff is a “patch”. A diff is often computed by a built-in tool, such as GNU Diff 1. Each commit also contains an author, which is the name and email of the person who made the commit. Most modern version control systems have a blame tool (e.g. git blame, svn blame, cvs annotate). The blame tool is a command that traverses the history of a given source code file, aggregating all of the changes, to show the most recent author and commit for each line in the file. We also make use of the Git2 feature called git bisect [12]. With bisect, we can automatically conduct a binary search over the revision history to identify a commit where a vulnerability was introduced. Section IV details how we used git bisect. III. RELATED WORK Analyzing version control repositories for code churn and vulnerabilities is a common occurrence in today’s software engineering research. Code churn has been used in several prediction models, and often appears among the stronger correlations with faults and vulnerabilities. To our knowledge, no other researchers have conducted a rigorous empirical study based on identifying VCCs in a large software system. Munson and Elbaum [13], [14] were among the first to explore the concept of code churn. Analyzing code churn in conjunction with several complexity and size metrics, the authors combined their metrics together to predict faults in a large embedded software system. Regarding code churn by 1

IV. METHODOLOGY FOR IDENTIFYING VCCS The process we used for identifying VCCs from source code is both manual and automated in nature. In related work [23]–[25], other researchers suggest using the equivalent of git blame to determine the commit that introduced a vulnerability. While this methodology is sound, we found it to be less efficient than using the git bisect feature [12], which is a feature designed for this express purpose. The git bisect command utilizes a binary search over the commit history to determine which commit was the first

http://www.gnu.org/software/diffutils 2 http://git-scm.com/

66

When a given vulnerability had multiple fix commits, we considered those commits to be one complete fix. If a vulnerability was incorrectly fixed (i.e. a regression), and the development team discovered this regression at a later date, we deferred to the development team as to whether or not the regression was considered a new vulnerability (in HTTPD, regressions typically were considered new vulnerabilities when significant time had passed). In our case study, we were able to trace every reported vulnerability from HTTPD back to their original fix commits in the system.

commit where the system started exhibiting some behavior. Of particular interest to us was how git bisect takes into account divergent branches to make sure no commit is missed. With git bisect, the researcher defines the last known “good” version of the system, then marks a known commit when the system was vulnerable. Then, the researcher provides a program that automatically identifies if the system has a given vulnerability, and git bisect will tell the researcher which commit contributed to the introduction of the vulnerability. We note that we do not claim that our methodology is novel; rather, we are providing this description for transparency and reproducibility. Related literature [13], [23]–[25] uses terms such as “injecting”, “fix-inducing”, or “fault introducing” commits to describe what we call VCCs. We find those words to be misleading as the original commits identified by this method did not cause a developer to immediately fix it as words like “induce” imply, rather they contributed to the problem. Furthermore, we find that the word “contributing” is more apt for describing these commits because we found that, upon closer scrutiny, multiple commits often contributed to the introduction of a given vulnerability. Thus, one VCC may not always “introduce” a vulnerability per se, but still be a developer mistake that played a part in the security-related mistake. Semantics aside, VCCs are a subset of what other researchers have dubbed “fix-inducing” commits. In this study, three researchers in all conducted this portion of the project, including the first author. Two researchers made the original identifications, and the fourth author was assigned to randomly and independently re-creating the bisect to check for correctness. The first author inspected all of these identifications as well. This process of identifying VCCs required hundreds of man-hours over six months to collect and vet this data set. Our VCC identification process can be summarized in these steps: 1. Identify the fix commit(s) of the vulnerability; 2. From the fix, write an ad hoc detection script to identify the coding mistake for the given vulnerability automatically; 3. Use git bisect to binary search the commit history for the VCC; 4. Inspect the potential VCC, revising the detection script and re-running bisect as needed.

2. From the fix, write an ad hoc detection script to identify the coding mistake for the given vulnerability automatically For this step, we examined the fix, description, and surrounding information for the given vulnerability and identified the specific coding mistakes that led to the vulnerability. Thus, the code prior to the fix was the vulnerable code that our script would detect statically via a string search based on our understanding of the vulnerability as whole. For example, if the vulnerability was Cross-Site Scripting, then the coding mistake the developer made was to output data that was not sanitized of HTML characters. Specifically, the developer outputted user input and forgot to make an API call to sanitize the HTML characters. To detect if this vulnerability existed, the string search would be looking for that instance where a developer outputted user input without sanitization. Context would be added to the detection script to ensure that the search does not provide false positives (although a false positive would be clearly apparent to the researcher anyway). When the vulnerability fix involves entirely new code, we examine the context to determine what error of omission the developer made. For example, if the vulnerability involved forgetting to check for a null pointer, then our script would detect when the surrounding lines would pass from one line to the next without checking for a null pointer. Some vulnerabilities might have no context with which we can bisect. For example, the fix for a vulnerability might include declaring a new function in a utility library. Developers can place new functions at a variety of locations in a given file when order does not matter. The surrounding context of that new function has nothing to do with the code being vulnerable. Thus, in that situation, the vulnerable file cannot be bisected and that data point was removed from the study. Some vulnerabilities might have multiple regions in a given file where a fix was required. In that case, we treat each of those regions with a separate detection script to maximize the number of potential VCCs we can find. As described in Section V, we found that only 12 of 134 fix commits for files could not be bisected for this reason. In each of those situations, other VCCs were identifiable for the same vulnerability. Thus, no vulnerabilities were missed by the analysis, only a few files. This step in particular requires human understanding of the security concerns of the case study system and the specific methods of mitigation that the developers undertook to fix the vulnerability. This step cannot be automated, and requires time

1. Identify the fix commit(s) of the vulnerability. A fix commit is a specific change in the system’s version control repository where they altered the code to fix a vulnerability. For this step, we conduct a manual investigation into each vulnerability to determine the fix commit. Sometimes the development team kept records of which commit fixed a vulnerability, sometimes we needed to search using information in NIST’s National Vulnerability Database3 (NVD), relevant project-specific notes (e.g. the CHANGES or STATUS file), commit messages in Git, and other vulnerability disclosure information. 3

http://nvd.nist.gov/

67

TABLE I. SUMMARY STATISTICS OF HTTPD

and expertise to hone. As discussed above, to mitigate subjectivity we used three researchers who cross-checked each other’s work and helped each other understand the specific coding mistakes the developers made.

Vulnerabilities Traced Non 3rd Party Vulnerabilities Commits Commit-File data points VCCs found Vulnerability data points with no VCC Vulnerabilities with no VCCs Number of Authors of VCCs

3. Use git bisect to binary search the commit history Once we are satisfied that our script detects the vulnerable chunk of code, we use the command git bisect run to determine the vulnerable file. Git bisect will conduct a binary search of the commits and run our detection script to determine the commit where the code base was initially vulnerable. The output of this step is a commit.

83 68 24,061 25,847 124 12 0 31

multiple regions of a single file may be modified in a given fix, so we bisected each region of a fix to identify VCCs there. Finally, a given VCC can also affect more files than the vulnerable file in question. Thus, we define a VCC as a modification to a given file, and consider the not-known-to-bevulnerable files of the offending commit to be not a VCC. In HTTPD, we had 25,847 commit-file data points. Of those data points, 124 were VCCs. Furthermore, a total of 12 files from 7 vulnerabilities had no possible VCC and were removed from study. These situations were when the fix vulnerability involved defining a new constant or function in the system (as described in Section IV). Of those 7 vulnerabilities, other VCCs were found for the other files in that vulnerability, so all 68 vulnerabilities remained covered by our analysis. Table 1 depicts these metrics.

4. Inspect the resulting commit, revising the detection script and re-running as needed Upon running the bisect command, the researcher inspects the resulting commit to see if it was, in fact, contributing to the introduction of a vulnerability. For repeatability, the researchers would correct the script and re-run the bisect until the resulting commit was clearly the one that contributed to the introduction of the vulnerability. For example, when a file would be renamed, the detection script would need to check the old file. Or, if the context of the vulnerability was refactored in such a way that was irrelevant to contributing to a vulnerability (e.g. a developer renamed a method, changing comments), the string search would be updated to handle the changing context.

VI. SIZE OF VULNERABILITY-CONTRIBUTING COMMITS One of the most common methods of measuring version control commits is with code churn metrics [9], [14], [15], [27]. In recent history, software researchers have discovered evidence of a “churn” effect in that frequently-changing code is statistically more likely to have faults and vulnerabilities. The motivation for code churn is intuitively appealing: the more developers change code, the more problems they are likely to introduce. But is code churn truly to blame for introducing faults and vulnerabilities? If the motivation for studying code churn is that high amounts of change in the code base can lead to developers making mistakes, then VCCs ought to have higher code churn than non-VCCs, on average. Thus, we examine the research question: Q1. Churn. Are VCCs larger than non-VCCs? To measure commit size, we use three metrics: Code Churn, Relative Churn, and 30-Day churn. The Code Churn metric is typically computed for a given commit of a file. The version control system computes a diff, which is a matching of which lines of code were changed in the given source code file, denoted by lines deleted and lines added. More changes to a file indicates more lines added or deleted, therefore more Code Churn. One concern from Nagappan et al.[15] regarding the raw Code Churn metric is its relation to the overall size of the file. Code churn of 100 lines has a much different meaning for a file of 200 lines than 20,000 lines. Thus, we provide an additional metric, called Relative Churn, where we normalize the code churn of the file by number of lines of code (LOC) in the file after the commit. For example, a file with 100 lines of churn and ended with 200 LOC would have a Relative Churn of 50%.

V. CASE STUDY: APACHE HTTP SERVER The Apache HTTP web server (HTTPD) has been the most commonly used web server in the world since 1996, and as of December 2012 is the server for 63.7% of active websites on the World Wide Web [26]. Since 2002, the HTTPD team has been publicly documenting their vulnerabilities. HTTPD is primarily written in C, and provides a range of functionality via various modules and networking protocols. In this study, we were able to trace all 83 documented vulnerabilities back to their original source code fix. We found two vulnerabilities in the NVD that were never acknowledged by HTTPD and never fixed, so they were removed from this study. Some of the vulnerabilities reported by the HTTPD team were actually vulnerabilities in third-party libraries. As a result, the HTTPD team did not release a fix to their own code base to fix the vulnerability, they simply used a patched version of the library. We removed 15 such vulnerabilities from this study since no VCC can exist for them. One notable external project that was extensively used by HTTPD was the Apache Portable Runtime (APR), which we considered to be a dependency. Thus, this study covers 68 vulnerabilities from HTTPD. To keep comparisons of code churn the same across languages, we studied only source code in the C language. Two vulnerabilities involved configuration of environment shell scripts, and we removed those files from the code churn study. No other languages were involved in the 68 vulnerabilities. A single vulnerability can, and often did, involve multiple VCCs due to multiple files and/or multiple developer mistakes. Fixing a vulnerability can involve multiple files, so we bisected each of those files to find VCCs for each of those. Furthermore,

68

We note that if a file was fully rewritten in a single commit, it can have a code churn exceeding 100%. A disadvantage of those metrics is that they do not take into account what has been happening to the system recently. For example, a commit with 10 lines of Code Churn may be considered a low risk, but if that commit is taken together by a burst of five other 10-line commits in the last month, the probability of introducing vulnerabilities may increase. Thus, one commit may not explain the entire story of a given source code file in its temporal context. We collected a third churn metric: 30-Day Churn. We decided upon 30 days as our threshold after an analysis of the commit regularity of HTTPD developers. The 30-Day churn metric only covers the commits immediately prior in the last 30 days. Thus, 30-Day Churn does not double-count commits from its single-commit counterpart. Also, throughout our data collection, we set all of our diff utilities to ignore whitespace changes. In summary, our metric definitions are: x Code Churn is the number of lines inserted plus the number of lines deleted for a commit diff, ignoring whitespace. x Relative Churn is the Code Churn divided by the total lines of code for the file after the commit x 30-Day Churn is the sum of Code Churn for a given source code file in the 30 days prior to the commit. As an example of computing code churn, Fig. 1 shows an abbreviated example diff for a single commit (git hash 08c38d0831) in HTTPD. The commit involved changing API calls, depicted by three insertions and three deletions, so the Code Churn in this example is six. The file had 560 lines after this commit, so the Relative Churn was 1.1%. To evaluate how the churn metrics are related to security vulnerabilities, we examine the difference between VCCs and

TABLE II. ASSOCIATION RESULTS: MEANS AND MWW TEST

Metric

VCC

Code Churn Relative Churn 30-Day Churn

608.5 55.7% 1012.3

Mean non-VCC 42.2 23.1% 266.7

MWW pvalue p<0.00001 p<0.00001 p<0.00001

the non-VCCs in terms of each metric. As suggested in other metrics validation studies [28]–[30] for not having a normality assumption, we use the non-parametric Mann-WhitneyWilcoxon (MWW) test. We compare the means and the pvalue to 0.05 for VCCs and non-VCCs. We present the results of our association analysis in Table II. Based on these results, we find that code churn metrics are empirically associated with VCCs. All three of Code Churn Relative Churn, and 30-Day churn tend to have higher churn. Thus, we can conclude that bigger commits and bursts of big commits have historically been associated with the commits that have been found to contribute to the introduction of vulnerabilities in HTTPD. VII. SOCIO-TECHNICAL CONCERNS Code churn metrics alone do not take into account a critical factor of any software project: the developers. Specifically, code churn metrics ignore who is making the changes and who is affected by those changes. We examine this additional, socio-technical form of code churn to better gauge how developers are interacting via source code changes. The result, however, is not a measurement of commit size, but in developer interaction via commits. We examine two research questions: x Q2. Interactive Churn. Are VCCs associated with churn that affects other developers? x Q3. New Effective Author. Is a commit more likely to be a VCC when the author is a new committer to the code? We recently introduced a few interactive churn metrics [6]. In that study, we found that interactive churn metrics, when aggregated at the file level, are statistically associated with source code files that had post-release vulnerabilities in the PHP programming language. This is our first study of analyzing interactive churn metrics at the commit level, along with the 30-Day versions of interactive churn metrics. In this section, we will first explain our five interactive churn metrics (Section A), empirically answer our research questions (Section B) while providing a brief discussion about how interactive churn metrics can be actionable to software developers.

$ git log -1 -p -U0 08c38d0831 commit 08c38d0831c46ed5b62e2f83e42a4c84e111d553 Author: Jeff Trawick Date: Tue Aug 7 14:49:44 2012 +0000 Mutex directive: finish support of DefaultRuntimeDir diff --git a/server/util_mutex.c b/server/util_mutex.c --- a/server/util_mutex.c +++ b/server/util_mutex.c @@ -120 +120 @@ AP_DECLARE(apr_status_t) ap_parse_mutex[…] - *mutexfile = ap_server_root_relative(pool, file); + *mutexfile = ap_runtime_dir_relative(pool, file); @@ -307 +307 @@ static const char *get_mutex_filename(a[…] - return ap_server_root_relative(p, + return ap_runtime_dir_relative(p, @@ -555 +555 @@ AP_CORE_DECLARE(void) ap_dump_mutexes(a[…] - dir = ap_server_root_relative(p, mxcfg->dir); + dir = ap_runtime_dir_relative(p, mxcfg->dir);

A. Computing Interactive Churn Metrics The idea behind interactive churn [6] is to examine who is making the changes and whose code is being changed at the line level in source code. In a single commit, a developer may be revising her own code, or changing her colleague’s code. While developers may not have records of explicit code ownership practices, the version control system can provide a listing of who was the last person to modify a given line of

Fig. 1. Abbreviated commit and diff in HTTPD from Git $ git blame 08c38d0831^ -- server/util_mutex.c | \ grep -e " 120)" -e " 307)" -e " 555)" 55fcb2ed (Jim Jagielski 120) *mutexfile = ap_server[…] c391b9d1 (Jeff Trawick 307) return ap_server_root_[…] ff444d9c (Stefan Fritsch 555) dir = ap_server_root_r[…] Fig. 2. Abbreviated blame output for computing PIC and NAA metrics

69

code via a built-in blame tool. Each line that a developer’s commit affects was last modified either by him- or herself, or by another developer. This line-level analysis provides a finegrained record of developers interacting (knowingly or unknowingly) via specific lines of source code. To compute interactive churn metrics for a given commit and a given file, we do the following: 1. Process the commit diff to identify the author and the lines of code that were affected by the given commit. 2. Run the blame tool on the file to identify the authors of the lines affected. 3. Look up the lines affected by the diff in the blame output, aggregating the different authors affected by the commit. For example, Fig. 1 shows that the author of the commit was Jeff Trawick and that the three lines affected were at lines 120, 307, 555 (denoted by @@). To compute interactive churn, we need to know if the three affected lines were last modified by Jeff, or by someone else, so we use the git blame tool. Figure 2 shows the output of the blame command, filtered by the three lines in question. The output shows that two of the three lines affected by Jeff’s commit were lines previously modified by two other developers, Jim and Stefan. Thus, we say that the number of interactive churn lines in this commit is two out of a possible three. We note that Fig. 2 displays a basic grep command for filtering the output for the example. This specific method is too simplistic for actual data collection, as it can lead to false positives. Our scripts for collecting interactive churn metrics handle the blame output more carefully to prevent such false positives. We also leverage the blame tool to measure the developer activity regarding the most recent authors of the file. When a file undergoes high amounts of change activity, many different developers may be involved. Those developers can be overwriting each others’ changes, so prior developers’ code may no longer exist in the latest version of the file. Furthermore, for a given commit, a developer may be making changes to a file for the first time (or, for effectively the first time if his code was rewritten since his last commit). To account for when developers are new to the file or not, we define an Effective Author as one who exists in at least one line in the blame output. In summary, we define three interactive churn metrics: x PIC is the percentage of the interactive churn lines to the total number of lines of code affected by the commit. If no lines of code were affected (e.g. only insertions), PIC is undefined due to division by zero; x NAA is the number of distinct authors besides the commit author whose lines were affected by the given commit. If PIC is undefined, NAA is zero; x NEA? is a nominal “Yes” or “No” for when an author is a New Effective Author, or not found by the blame command on the entire file prior to the commit In following with our example from Fig. 1 and Fig. 2, the PIC of the commit is 66% (two of the three affected lines were

last modified by authors other than Jeff), and the NAA is two (for Jim and Stefan). We know from line 307 that Jeff had previously changed the file, so his NEA? is a “No”. Additionally, in the same spirit of 30-Day churn, we examined 30-Day PIC and 30-Day NAA. We define those metrics as: x 30-Day PIC is the total percentage of lines that affected other developers for all commits to the given file for the prior 30 days. x 30-Day NAA is the total number of distinct developers affected by the last 30 days of commits for the given file. In practice, we recommend that developers use PIC, NAA, and NEA? as supplements to Code Churn and Relative Churn because together they provide a more complete picture of how the code changed in terms of people. Historically, Code Churn has been a prominent predictor of bugs and vulnerabilities [15], [31]. However, one of the disadvantages of Code Churn is how developers can interpret it. One may simplistically view high Code Churn as an indication to just avoid changing code. For developers, however, code must change so it can be improved upon. Thus, we find the Code Churn metric to lack property of being actionable. Consider the following scenarios, all of which involve high amounts of Code Churn: x Self Churn: A developer is making major revisions to mostly her own code. Thus, her commits overall would have high Code Churn, but low PIC, low NAA, and she is not a NEA. x Small Team Churn: A team of 3 developers are revising a large feature together. In aggregate, these commits would have high Code Churn, high PIC, but a low NAA and the developers would not be a NEA. x Newbie Rewrite: A new developer is making massive changes to large parts of the system, some of which he is completely unfamiliar with. These commits would be high Code Churn, high PIC, high NAA, and often NEA. In a software development team, each of those scenarios can be risky or not depending on the context. More importantly, however, all three of those scenarios are considerably different situations, yet they would all yield high Code Churn. With interactive churn metrics, we can separate out different situations for a better understanding of what is happening in terms of developer activity. B. Analyzing Interactive Churn Metrics and VCCs To examine the effects of how the activity of developers interact on the source code line level could potentially be associated with making security-related mistakes, we ask the following question: Q2. Interactive Churn. Are VCCs associated with churn that affects other developers? We used the PIC, NAA, 30-Day PIC and 30-Day NAA metrics for this analysis. We also used the same analysis as the churn metrics (MWW test of association). Our results are detailed in Table III.

70

TABLE III. ASSOCIATION RESULTS: MEANS AND MWW TEST

Interestingly, the 30-Day PIC is lower on average for VCCs than for non-VCCs. Thus, historically in HTTPD, VCCs happened more often when the prior 30-Day commits had self churn rather than interactive churn. This intriguingly counterintuitive result is consistent with our prior study of the vulnerabilities of the PHP programming language [6]. The NAA metric was statistically significant as well. This result indicates that, historically, commits that affected more developers (specifically closer to 2 other developers on average than 1 other developer) were more likely to be VCCs. Regarding the two statistically insignificant results, PIC and 30-Day NAA, no conclusions can be drawn. Statistically speaking, we do not have enough evidence to claim that PIC and 30-Day NAA are any different for VCCs and non-VCCs. This result does not preclude PIC and 30-Day NAA from being useful to developers, however, in identifying different kinds of churn. Thus, our answer to Q2 is “yes, but with a few exceptions”.

Metric PIC NAA 30-Day PIC 30-Day NAA

VCC 70.8% 1.78 55.6% 3.45

Mean non-VCC 66.1% 1.01 65.0% 2.70

MWW pvalue p=0.54 p<0.05 p<0.01 p=0.60

TABLE IV. CONTINGENCY TABLE FOR NEA? METRIC

x x

Beyond interactive churn metrics, we can also discover if a developer was new to a given source code file using the NEA? metric. This leads to the following research question: Q3. New Effective Author. Is a commit more likely to be a VCC when the author is a new committer to the source code? We used the NEA? metric defined in the latter section for this question. We collected the NEA? metric for the entire data set to compare between VCCs and non-VCCs. Since the NEA? metric is nominal (i.e. has an outcome of “Yes” or “No”), we use the Chi-squared contingency table test, as suggested in the literature [28]–[30]. In total, 52 (41.9%) of VCCs were from a New Effective Author. The contingency table is shown in Table IV, and the results were statistically significant (p<0.0001). Thus, the empirical evidence favors that a commit is more likely to be a VCC if it was authored by a NEA. Furthermore, we note that VCCs are extremely rare occurrences to begin with. But, when the author of a commit is a New Effective Author the probability of that commit being a VCC more than doubles (from 0.3% to 0.8%). Thus, we conclude that a commit is more likely to be a VCC when the author is effectively a new developer to that file.

x

VCC? No Yes 19,206 (99.6%) 72 (0.3%) No NEA? 6,517 (99.2%) 52 (0.8%) Yes row percentages in parentheses Q5. Baseline. How often was a VCC part of an original source code import? Q6. Known Offender. How many VCCs occurred in files that had already been patched for a different vulnerability? Q7. Notable Changes. Were VCCs likely to be noted in the change log or status files?

The typical time between a VCC and its corresponding fix indicates how much time is available for developers to conduct their code reviews and tests. Q4. Exposure. How long did vulnerabilities remain in the system? We found that most vulnerabilities remained in the system from VCC to fix for a long time, on average 1,175 days or a median of 853 days. Figure 3 (see following page) depicts a visualization of all source code files that had vulnerabilities, counting the number of vulnerabilities in the system at each given time. Figure 4 depicts a histogram for the number of days between a VCC and its fix. Only 5% of the VCCs had an exposure that was fewer than 30 days, and 26% of the VCCs were in the system for fewer than 365 days. By contrast, 6% of the VCCs remained in the system for over a decade. In mature systems such as HTTPD, however, code can remain unchanged in the system for a long period of time. But, as developers make commits to each file, they get an opportunity to review and test the code for other problems. To measure this, we computed the number of commits between each VCC and its corresponding fix. Figure 5 depicts a histogram, with outliers 484 and 586 not shown for visual reasons. The median for number of commits between VCC and fix was 48 commits, with the average being skewed by outliers to 76.9 commits. 100% of VCCs had at least one commit between VCC and fix. Interestingly, the two outliers where the number of commits between VCC and fix were 484 and 596 were http_protocol.c and mod_rewrite.c respectively. Both of these files parse un-trusted user data and have each had three vulnerabilities over the years, yet vulnerabilities remained unnoticed for a very long time despite consistent developer

VIII. COMMUNITY DISSEMINATION One of the key components to any open source software project is to leverage the community of developers to review changes. Eric Raymond declared in his famous essay [32], that “many eyes make all bugs shallow”. A key part of leveraging the development community, however, is to disseminate one’s work so that it can be reviewed. In HTTPD, every commit is automatically sent to a mailing list, and the developers have various venues for discussing their changes to the system. But, some commits are explicitly recorded in other fashions, such as the change logs that get released to users. In this section, we investigate community dissemination with the following research questions: x Q4. Exposure. How long did vulnerabilities remain in the system?

71

activity. Our analysis of Known Offenders (Q6) examines this phenomenon more deeply.

Q6. Known Offender. How many VCCs occurred in files that had already been patched for a different vulnerability? To measure Q6, we counted the number of VCCs that occurred after the fix of a different vulnerability on the same file. We found that 33 (26.6%) of VCCs occurred on files that were Known Offenders. These 33 VCCs covered 14 (20.1%) of the vulnerabilities that HTTPD has patched over the years. This result may seem less surprising given the results of Exposure (Q4) that vulnerabilities remain in the system for long periods of time. Thus, while we believe the Known Offender property of VCCs is prevalent enough for developers to pay attention to, it does not explain the majority of VCCs nor even a quarter of the vulnerabilities in HTTPD.

Since Q1 told us that most VCCs are big commits, and that Q4 indicates most vulnerabilities are old, one possibility is that that most vulnerabilities pre-existed in the initial writing of the system or for a given feature. When a new feature is written, the initial commit tends to be large, and HTTPD has been a stable and mature product for well over a decade. In terms of community dissemination, developers may be able to find more vulnerabilities by focusing on new source code files. We examine this possibility in Q5. Q5. Baseline. How often was a VCC part of an original source code import? To count this, we identified which VCCs were “baseline” commits, that is, commits where the file was new to the system. The git tool identified baseline commits for us, and we manually inspected the results to ensure that the source file was truly new to the system and not renamed or reorganized in the directory structure. We also consulted the change notes and other artifacts to triangulate this data. Additionally, 31 (22.7%) of the VCCs were associated with more than one vulnerability, so we only count those commits once. In total, only 13.5% of VCCs were baseline commits. If we double-count one VCC for multiple vulnerabilities, then 23.5% of VCCs were baseline commits. While these numbers are not small, they not account for the majority of commits. In other words, vulnerabilities arise in pre-existing source code more often than new source code files. We note here that we do not equate “new features” with “new source code file”. We did not have enough contextual information to classify each VCC into a bug fix or a feature.

Finally, the HTTPD project has a wide variety of ways they collaborate on commits. Two ways that we were able to mine were the STATUS file and the CHANGES file in their source tree. The STATUS file is an informal record of what each developer is currently working on that is checked into version control. Developers also use this file for brief discussion or voting on an issue at hand. The CHANGES file is a more formal document used for the logging of major changes to the system, such as new features or bug fixes that affect users. Thus, if a developer notes her changes in one of these files, she is effectively disseminating the work to the HTTPD community for potential review. Q7. Notable Changes. Were VCCs likely to be noted in the change log or status files? To answer this question, we examined each VCC and determined if the developer decided to disseminate his or her change to the community via the CHANGES or STATUS file. We looked at four days on either side of each VCC to catch situations where the developers checked in their votes or change note statements in a separate commit. This investigation was a qualitative one that involved delving into triangulating developer discussions and notes. To mitigate the subjective nature of this evaluation, we always had at least two researchers make this assessment individually, then resolve any disagreements by consensus. To compare our results against non-VCCs, we also examined a control group by randomly

Another element of community dissemination is the ability for the developers to react to past vulnerabilities in the system. As Figure 3 showed, the fixes for the 68 vulnerabilities in HTTPD have been spread out over long periods of time. Thus, some source code files can be “Known Offenders”, or files that have been fixed for a vulnerability in the past. If VCCs primarily occur on Known Offender files, then developers can focus their efforts more acutely on the small subset of files that have been affected by a post-release vulnerability.

Fig. 4 Histogram of days of exposure between VCC and fix

Fig. 5. Histogram of commits between VCC and fix.

Fig. 3. Timeline for source code files in HTTPD with vulnerabilities

72

TABLE VI. SUMMARY OF COMMUNITY DISSEMINATION RESULTS

Question (Section) Q1. Churn (VI) Q2. Interactive Churn (VII) Q3. New Effective Author Q4. Exposure (VIII) Q5. Baseline Q6. Known Offenders Q7. Notable Changes

Selected Results VCCs average 608.5 lines of churn (vs. 42.2 non-VCCs), or 55% VCC relative churn (vs. 23.1%) VCCs average 1.78 authors affected (vs. 1.01 non-VCCs) 41.9% of VCCs were changed by a New Effective Author, 0.8% of commits (vs. 0.3% non-VCCs) Median days from VCC to fix was 853, median commits between VCC and fix was 48 13.5% of VCCs were original source code imports 26.6% of VCCs were to known offender files, covering 20% of the vulnerabilities 48.6% of VCCs were mentioned in CHANGES or STATUS, (vs. 44.0% non-VCCs via sampling)

sampling 150 commits to HTTPD, limiting our sampling to only commits that affected C source code files. Again, we note that 22.7% of our VCCs were associated with one vulnerability, so we only count those commits once. To compare the difference between VCCs and our control group of non-VCCs, we conducted a Chi-squared test of contingency tables. We consider a commit to be notable if we found it to be in the STATUS or CHANGES file Our results of CHANGES and STATUS can be found in Table V. Our Chi-squared test of the bottom row of Table V showed that, in fact, VCCs are more frequently noted in STATUS or CHANGES (p<0.05). This result indicates that VCCs are being publicized more often than their counterparts, although not by much. Thus, the issue is not necessarily that VCCs are any less publicized than non-VCCs; rather, developers may be forgetting security concerns when they review changes to HTTPD.

yet can explain all vulnerabilities, or even the majority of vulnerabilities. X. LIMITATIONS The VCC identification process involves a mixture of automation coupled with human judgment about what constitutes a coding mistake that led to a vulnerability. The human judgment can lead to some subjectivity, and potentially variability in the data. To mitigate this, we used three researchers to check each others’ work, debate the difference until agreement, and make corrections as necessary. Furthermore, our method of identifying VCCs leans more toward sound results than complete results. In other words, we do not know that our VCCs were, in fact, the only VCCs in the system. We are confident that the ones we found are correct (i.e. sound), though. Fortunately, the vastness of the non-VCCs in the data set means that a few false negatives would not skew the results very far. Also, since our process of identifying VCCs was static, we do not know if the given vulnerability was truly exploitable at the time of the VCC. Most vulnerabilities in systems do not have public exploits, and constructing exploits for vulnerabilities would have made this project infeasible in terms of time and expertise. But, we mitigated this by focusing on the static coding mistakes of the vulnerabilities from the fixes and by taking a holistic view of the vulnerability by way of all the relevant artifacts. Finally, we do not know that the 68 vulnerabilities in HTTPD were, in fact, all of the vulnerabilities that HTTPD has. New vulnerabilities are being found in HTTPD all the time, so more VCCs may exist that we do not know about. However, this fact is true of nearly every empirical study of bugs or vulnerabilities; we cannot conclude about bugs or vulnerabilities that we do not know about.

IX. DISCUSSION We summarize the results of Q1-Q7 in Table VI. To us, the most telling results are that VCCs tend to be large commits (Q1), yet tend not to be the baseline commits (Q5). The fact that that vulnerabilities exist in the code for many years at a time (Q5) and that only around a quarter of the VCCs were in Known Offender files (Q6) may indicate that many more vulnerabilities may still exist in HTTPD to be discovered. The Baseline (Q5) result also indicates that VCCs are likely to be occurring in the future since vulnerabilities tend to be introduced as part of evolution, not as the initial import. Furthermore, VCCs historically have been Notable Changes (Q7), yet the development community missed the security concerns when the commit entered the system. Vulnerabilities also tend to be spread out across the system, as shown by our timeline in Figure 3. In particular, the figure depicts most vulnerable files as having only one or two vulnerabilities exposed at any given time. Finally, none of the properties in this study covered the majority of VCCs nor vulnerabilities. Having poured over these vulnerabilities one by one, we testify that these vulnerabilities and their VCCs are quite a diverse set when looked at qualitatively. Trends exist, but no single property we know of

XI. SUMMARY The objective of this research is to improve software security by exploring code churn and other socio-technical properties of VCCs. We adapted a semi-automated methodology for identifying the coding mistakes that led to 68 vulnerabilities in the Apache HTTPD server. We identified 124 VCCs in HTTPD and conducted an exploratory analysis of various properties of these VCCs. We examined seven research questions covering a wide variety of potential properties that contribute to vulnerabilities. We analyzed code churn metrics, interactive churn metrics, and explored questions of community dissemination. Developers can use this insight to

TABLE V. SUMMARY OF COMMUNITY DISSEMINATION RESULTS

Noted in STATUS Noted in CHANGES STATUS or CHANGES

VCCs 9 (8.6%) 46 (43.8%) 51 (48.6%)

Non-VCC Sample 20 (13.3%) 41 (27.3%) 66 (44.0%)

73

[16] N. Nagappan, B. Murphy, and V. Basili, “The Influence of Organizational Structure on Software Quality: An Empirical Case Study,” in 30th International Conference on Software Engineering (ICSE), Leipzig, Germany, 2008, pp. 521–530. [17] N. Nagappan, “Toward a Software Testing and Reliability Early Warning Metric Suite,” in Proceedings of the 26th International Conference on Software Engineering, 2004, pp. 60–62. [18] J. R. Casebolt, J. L. Krein, A. C. MacLean, C. D. Knutson, and D. P. Delorey, “Author entropy vs. file size in the gnome suite of applications,” in Mining Software Repositories, 2009, pp. 91 –94. [19] A. Meneely, L. Williams, W. Snipes, and J. Osborne, “Predicting Failures with Developer Networks and Social Network Analysis,” in Proceedings of the 16th ACM SIGSOFT International Symposium on Foundations of software engineering, Atlanta, Georgia, 2008, pp. 13–23. [20] A. Meneely and L. Williams, “Socio-Technical Developer Networks: Should We Trust Our Measurements?,” in International Conference on Software Engineering (ICSE), Waikiki, Hawaii, USA, 2011, pp. 281–290. [21] A. Meneely, M. Corcoran, and L. Williams, “Improving developer activity metrics with issue tracking annotations,” in WETSoM 2010 Workshop on Emerging Trends in Software Metrics, Cape Town, South Africa, 2010, pp. 75–80. [22] A. Meneely and L. Williams, “On the Use of Issue Tracking Annotations for Improving Developer Activity Metrics,” Advances in Software Engineering, vol. 2010, pp. 1–9, 2010. [23] C. Williams and J. Spacco, “SZZ revisited: verifying when changes induce fixes,” in Proceedings of the 2008 workshop on Defects in large software systems, New York, NY, USA, 2008, pp. 32–36. [24] S. Kim, T. Zimmermann, K. Pan, and E. J. Whitehead, “Automatic Identification of Bug-Introducing Changes,” in 21st IEEE/ACM International Conference on Automated Software Engineering, 2006. ASE ’06, 2006, pp. 81 –90. [25] J. Śliwerski, T. Zimmermann, and A. Zeller, “When do changes induce fixes?,” SIGSOFT Softw. Eng. Notes, vol. 30, no. 4, pp. 1–5, May 2005. [26] Netcraft, “December 2012 Web Server Survey,” Internet Research. [Online]. Available: http://news.netcraft.com/archives/2012/12/04/december-2012web-server-survey.html. [Accessed: 15-Feb-2013]. [27] S. A. Ajila and R. T. Dumitrescu, “Experimental use of code delta, code churn, and rate of change to understand software product line evolution,” J. Syst. Softw., vol. 80, no. 1, pp. 74–91, 2007. [28] A. Meneely, B. Smith, and L. Williams, “Validating Software Metrics: A Spectrum of Philosophies,” TOSEM, vol. 21, no. 4, pp. 24–48, Oct. 2012. [29] N. F. Schneidewind, “Validating Software Metrics: Producing Quality Discriminators,” pp. 225–232, May 1991. [30] N. F. Schneidewind, “Methodology for Validating Software Metrics,” IEEE Transactions on Software Engineering (TSE)., vol. 18, no. 5, pp. 410–422, 1992. [31] Yonghee Shin, A. Meneely, L. Williams, and J. Osborne, “Evaluating Complexity, Code Churn, and Developer Activity Metrics as Indicators of Software Vulnerabilities,” IEEE Trans. Softw. Eng., vol. to appear, 2011. [32] E. S. Raymond, The Cathedral and the Bazaar: Musings on Linux and Open Source by an Accidental Revolutionary, 1st ed. O’Reilly Media, 2010.

better understand how vulnerabilities arise in a software project. In the future, we plan to expand this research into more artifacts such as email discussions, more case studies, to the reliability realm, to improve churn metrics, and to increase insight into the meaning of interactive churn metrics in the context of software processes and socio-technical concerns. REFERENCES [1] I. V. Krsual, “Software Vulnerability Analysis.” PhD Dissertation, Purdue University, 1998. [2] J. Allen, S. Barnum, R. Ellison, G. McGraw, and N. Mead, Software Security Engineering, 1st ed. Addison-Wesley Professional. [3] G. McGraw, Software Security: Building Security In. AddisonWesley Professional, 2006. [4] C. Wysopal, L. Nelson, D. D. Zovi, and E. Dustin, The Art of Software Security Testing: Identifying Software Security Flaws, 1st ed. Addison-Wesley Professional, 2006. [5] P. Hope, G. McGraw, and A. I. Anton, “Misuse and abuse cases: getting past the positive,” IEEE Security & Privacy, vol. 2, no. 3, pp. 90–92, Jun. 2004. [6] A. Meneely and O. Williams, “Interactive Churn: SocioTechnical Variants on Code Churn Metrics,” in Int’l Workshop on Software Quality, 2012, pp. 1–10. [7] A. Meneely and L. Williams, “Strengthening the Empirical Analysis of the Relationship Between Linus’ Law and Software Security,” in Empirical Software Engineering and Measurement, Bolzano-Bozen, Italy, 2010, pp. 1–10. [8] A. Meneely and L. Williams, “Secure Open Source Collaboration: an Empirical Study of Linus’ Law,” in Int’l Conference on Computer and Communications Security (CCS), Chicago, Illinois, USA, 2009, pp. 453–462. [9] Y. Shin, A. Meneely, L. Williams, and J. Osborne, “Evaluating Complexity, Code Churn, and Developer Activity Metrics as Indicators of Software Vulnerabilities,” TSE, vol. 37, no. 6, pp. 772–787, 2011. [10] S. Neuhaus, T. Zimmermann, C. Holler, and A. Zeller, “Predicting vulnerable software components,” in Computer and Communications Security, New York, NY, USA, 2007, pp. 529–540. [11] E. L. Trist and K. W. Bamforth, “Some social and psychological consequences of the longwall method of coal-getting,” Technology, Organizations and Innovation: The early debates, p. 79, 2000. [12] “git-bisect.” [Online]. Available: https://www.kernel.org/pub/software/scm/git/docs/gitbisect.html. [Accessed: 01-Apr-2013]. [13] S. G. Elbaum and J. C. Munson, “Getting a handle on the fault injection process: validation of measurement tools,” in metrics, 1998, p. 133. [14] J. C. Munson and S. G. Elbaum, “Code churn: a measure for estimating the impact of code change,” in Software Maintenance, 1998. Proceedings. International Conference on, 1998, pp. 24 –31. [15] N. Nagappan and T. Ball, “Use of Relative Code Churn Measures to Predict System Defect Density,” in 27th international Conference on Software Engineering (ICSE), St. Louis, MO, USA, 2005, pp. 284–292.

74

When a Patch Goes Bad: Exploring the Properties of ... - IEEE Xplore

AbstractâSecurity is a harsh reality for software teams today. Developers must engineer secure software by preventing vulnerabilities, which are design and coding mistakes that have security consequences. Even in open source projects, vulnerable source code can remain unnoticed for years. In this paper, we traced 68 ...

Download PDF

314KB Sizes 2 Downloads 175 Views

Report

When a Patch Goes Bad: Exploring the Properties of ... - IEEE Xplore

Recommend Documents