Data Mining for Software Engineering Tao Xie

Jian Pei

North Carolina State University www.csc.ncsu.edu/faculty/xie [email protected]

Simon Fraser University www.cs.sfu.ca/~jpei [email protected]

An up-to-date version of this tutorial is available at http://ase.csc.ncsu.edu/dmse/dmse.pdf

Outline • Introduction • What software engineering tasks can be helped by data mining? • What kinds of software engineering data can be mined? • How are data mining techniques used in software engineering? • Case studies 10 minute break • Conclusions T. Xie and J. Pei: Data Mining for Software Engineering

2

Introduction • A large amount of data is produced in software development – Data from software repositories – Data from program executions

• Data mining techniques can be used to analyze software engineering data – Understand software artifacts or processes – Assist software engineering tasks T. Xie and J. Pei: Data Mining for Software Engineering

3

Examples • Data in software development – Programming: versions of programs – Testing: execution traces – Deployment: error/bug reports – Reuse: open source packages

• Software development needs data analysis – How should I use this Java class? – Where are the bugs? – How to implement a typical functionality? T. Xie and J. Pei: Data Mining for Software Engineering

4

Overview programming

defect detection

testing

debugging

maintenance

software engineering tasks helped by data mining

association/ patterns

classification

clustering



data mining techniques

code bases

change history

program states

structural entities

bug reports

software engineering data T. Xie and J. Pei: Data Mining for Software Engineering

5

Software Engineering Tasks • • • • •

Programming Static defect detection Testing Debugging Maintenance

T. Xie and J. Pei: Data Mining for Software Engineering

6

Software Categorization – Why? • SourceForge hosts 70,000+ software systems – How can one find the software needed? – How can developers collaborate effectively?

• Why software categorization? – SourceForge categorizes software according to their primary function (editors, databases, etc.) • Software foundries – related software

– Keep developers informed about related software • Learn the “best practice” • Promote software reuse [Kawaguchi et al. 04] T. Xie and J. Pei: Data Mining for Software Engineering

7

Software Categorization – What? • Organize software systems into categories – Software systems in each category share a somehow same theme – A software system may belong to one or multiple categories

• What are the categories? – Defined by domain experts manually – Discovered automatically

• Example system: MUDABlue T. Xie and J. Pei: Data Mining for Software Engineering

[Kawaguchi et al. 04] 8

API Usage • How should an API be used correctly? – An API may serve multiple functionalities – Different styles of API usage

• “I know what type of object I need, but I don’t know how to write the code to get the object” [Mandelin et al. 05] – Can we synthesize jungloid code fragments automatically? – Given a simple query describing the desired code in terms of input and output types, return a code segment

• “I know what method call I need, but I don’t know how to write code before and after this method call” [Xie & Pei 06] T. Xie and J. Pei: Data Mining for Software Engineering

9

Software Framework Reuse • Issues in reusing software frameworks – Which components should I use? – What is the right way to use? – Multiple components may often be used in combinations, e.g., Smalltalk’s Model/View/Controller

• Frequent patterns help – Specifically, inheritance information is important – Example: most application classes inheriting from library class Widget tend to override its member function paint(); most application classes instantiating library class Painter and calling its member function begin() also call end()

[Michail 99/00]

T. Xie and J. Pei: Data Mining for Software Engineering

10

Source Code Search • What does the current code segment look like in previous versions? – How have they been changed over versions?

• Using standard search tools, e.g., grep? – Source code may not be well documented – The code may be changed

• Can we have some source-code-friendly search engines? – E.g., www.koders.com, corp.krugle.com, demo.spars.info T. Xie and J. Pei: Data Mining for Software Engineering

11

Software Engineering Tasks • • • • •

Programming Static defect detection Testing Debugging Maintenance

T. Xie and J. Pei: Data Mining for Software Engineering

12

Inferring Errors from Source Code • A system must follow some correctness rules – Unfortunately, the rules are not documented, or specified in an ad hoc manner

• Deriving the rules requires a lot of a priori knowledge • Can we detect some errors without knowing the rules by data mining? [Engler et al. 01] T. Xie and J. Pei: Data Mining for Software Engineering

13

Locating Matching Method Calls • Many bugs due to unmatched method calls – E.g., fail to call free() to deallocate a data structure – One-line-code-changes: many bugs can be fixed by changing only one line in source code

• Problem: how to find highly correlated pairs of method calls – E.g., , [Li&Zhou 05, Livshits&Zimmermann 05, Yang et al. 06] T. Xie and J. Pei: Data Mining for Software Engineering

14

Detecting Copy-Paste-Related Bugs • Copy-pasted code is common in large systems – Code reuse

• Prone to bugs – E.g., identifiers are not changed consistently

• How to detect copy-paste code? – How to scale up to large software? – How to handle minor modifications? [Li et al. 04] T. Xie and J. Pei: Data Mining for Software Engineering

15

Software Engineering Tasks • • • • •

Programming Static defect detection Testing Debugging Maintenance

T. Xie and J. Pei: Data Mining for Software Engineering

16

Mining Specifications • Specifications are very useful for checking program behavior during testing • Major obstacle: specifications are often unavailable – Example: what is the right way to use the Vector class or socket API?

• How can data mining help? – If a usage pattern is held in well tested programs (i.e., their executions), it is likely valid [Ammons et al. 02] T. Xie and J. Pei: Data Mining for Software Engineering

17

Mining Object Behavior • How do we know whether an object behaves correctly during program execution?

Behavior model for JAVA Vector class. Picture from “Mining object behavior with ADABU” [Dallmeier et al. WODA 06] T. Xie and J. Pei: Data Mining for Software Engineering

18

Mining API Protocols

Specification

Does the above code follow the correct socket API protocol? [Ammons et al. 02] T. Xie and J. Pei: Data Mining for Software Engineering

19

Software Engineering Tasks • • • • •

Programming Static defect detection Testing Debugging Maintenance

T. Xie and J. Pei: Data Mining for Software Engineering

20

Analyzing Bug Reports • Most open source software development projects have bug repositories – Report and track problems and potential enhancements – Valuable information for both developers and users

• Bug repositories contain duplicate bug reports – Duplicate detection

• Bug report assignment is time-consuming – Developer recommendation T. Xie and J. Pei: Data Mining for Software Engineering

[Anvik et al. 06] 21

Fault Localization • Running tests produces execution traces – Some tests fail and the other tests pass

• Given many execution traces generated by tests, can we suggest likely faulty statements? [Liblit et al. 03/05, Liu et al. 05] – Some traces may lead to program failures

• Multiple faults complicate the situation

T. Xie and J. Pei: Data Mining for Software Engineering

22

Stabilizing Buggy Applications • Users may report bugs in a program, can those bug reports be used to prevent the program from crashing? – When a user attempts an action that led to some errors before, a warning should be issued

• Given a program state S and an event e, predict whether e likely results in a bug – Positive samples: past bugs – Negative samples: “not bug” reports [Michail&Xie 05] T. Xie and J. Pei: Data Mining for Software Engineering

23

Software Engineering Tasks • • • • •

Programming Static defect detection Testing Debugging Maintenance

T. Xie and J. Pei: Data Mining for Software Engineering

24

Guiding Software Changes • Programmers start changing some locations – Suggest locations that other programmers have changed together with this location E.g., “Programmers who changed this function also changed …”

• Mine association rules from change histories – Coarse-granular entities: directories, modules, files – Fine-granular entities: methods, variables, sections [Zimmermann et al. 04, Ying et al. 04] T. Xie and J. Pei: Data Mining for Software Engineering

25

Aspect Mining • Discover cross-cutting concerns that can be potentially turned into one place (an aspect in aspect-oriented programs) – E.g., logging, timing, communication

• Mine recurring execution patterns – Event traces [Breu&Krinke 04, Tonella&Ceccato 04] – Source code [Shepherd et al. 05] T. Xie and J. Pei: Data Mining for Software Engineering

26

Software Engineering Data • • • • •

Static code bases Software change history Profiled program states Profiled structural entities Bug reports

T. Xie and J. Pei: Data Mining for Software Engineering

27

Code Entities Source data Variable names and function names Statement seq in a basic block Set of functions, variables, and data types within a C function Sequence of methods within a Java method API method signatures T. Xie and J. Pei: Data Mining for Software Engineering

Mined info Software categories [Kawaguchi et al. 04] Copy-paste code [Li et al. 04] Programming rules [Li&Zhou 05] API usages [Xie&Pei 05] API Jungloids [Mandelin et al. 05] 28

Relationships btw Code Entities • Mine framework reuse patterns [Michail 00] – Membership relationships • A class contains membership functions

– Reuse relationships • Class inheritance/ instantiation • Function invocations/overriding

• Mine software plagiarism [Liu et al. 06] – Program dependence graphs [Michail 99/00] http://codeweb.sourceforge.net/ for C++ T. Xie and J. Pei: Data Mining for Software Engineering

29

Software Engineering Data • • • • •

Static code bases Software change history Profiled program states Profiled structural entities Bug reports

T. Xie and J. Pei: Data Mining for Software Engineering

30

Concurrent Versions System (CVS) Comments

[Chen et al. 01] http://cvssearch.sourceforge.net/ T. Xie and J. Pei: Data Mining for Software Engineering

31

CVS Comments

RCS files:/repository/file.h,v Working file: file.h head: 1.5 ... description: ---------------------------Revision 1.5 Date: ... cvs comment ... ---------------------------...

• cvs log – displays for all revisions and its comments for each file • cvs diff – shows …RCS file: /repository/file.h,v differences between …9c9,10 different versions of a <---old line > new line file > another new line • Used for program understanding [Chen et al. 01] http://cvssearch.sourceforge.net/ T. Xie and J. Pei: Data Mining for Software Engineering

32

Code Version Histories • CVS provides file versioning – Group individual per-file changes into individual transactions: checked in by the same author with the same check-in comment close in time

• CVS manages only files and line numbers – Associate syntactic entities with line ranges

• Filter out long transactions not corresponding to meaningful atomic changes – E.g., feature requests, bug fixes, branch merging

• Used to mine co-change entities

[Ying et al. 04] [Zimmermann et al. 04] http://www.st.cs.uni-sb.de/softevo/erose/

T. Xie and J. Pei: Data Mining for Software Engineering

33

Software Engineering Data • • • • •

Static code bases Software change history Profiled program states Profiled structural entities Bug reports

T. Xie and J. Pei: Data Mining for Software Engineering

34

Method-Entry/Exit States • State of an object – Values of transitively reachable fields

• Method-entry state – Receiver-object state, method argument values

• Method-exit state – Receiver-object state, updated method argument values, method return value

• Used to mine object behavior [Ernst et al. 02] http://pag.csail.mit.edu/daikon/ [Xie&Notkin 04/05][Dallmeier et al. 06] http://www.st.cs.uni-sb.de/models/ T. Xie and J. Pei: Data Mining for Software Engineering

35

Other Profiled Program States • Values of variables at certain code locations [Hangal&Lam 02]

– Object/static field read/write – Method-call arguments – Method returns

• Sampled predicates on values of variables [Liblit et al. 03/05]

• Used to detect or locate bugs [Hangal&Lam 02] http://diduce.sourceforge.net/ [Liblit et al. 03/05] http://www.cs.wisc.edu/cbi/ T. Xie and J. Pei: Data Mining for Software Engineering

36

Software Engineering Data • • • • •

Static code bases Software change history Profiled program states Profiled structural entities Bug reports

T. Xie and J. Pei: Data Mining for Software Engineering

37

Executed Structural Entities • Executed branches/paths, def-use pairs • Executed function/method calls – Group methods invoked on the same object

• Profiling options – Execution hit vs. count – Execution order (sequences)

• Used to locate bugs [Dallmeier et al. 05] http://www.st.cs.uni-sb.de/ample/ More related tools: http://www.csc.ncsu.edu/faculty/xie/research.htm#related T. Xie and J. Pei: Data Mining for Software Engineering

38

Software Engineering Data • • • • •

Static code bases Software change history Profiled program states Profiled structural entities Bug reports

T. Xie and J. Pei: Data Mining for Software Engineering

39

Processing Bug Reports

Bug

Triager

User

Developer

Report

Bug Repository

Duplicate

T. Xie and J. Pei: Data Mining for Software Engineering

Works For Me

Invalid

Won’t Fix

Adapted from Anvik et al.’s slides

40

Sample Bugzilla Bug Report • Bug report image • Overlay the triage questions Assigned To: ? Duplicate? Reproducible? Bugzilla: open source bug tracking tool http://www.bugzilla.org/ [Anvik et al. 06] http://www.cs.ubc.ca/labs/spl/projects/bugTriage.html T. Xie and J. Pei: Data Mining for Software Engineering

Adapted from Anvik et al.’s slides

41

Eclipse Bug Data • Defect counts are listed

as count at the plug-in, package and compilationunit levels.

• The value field contains the actual number of pre- ("pre") and post-release defects ("post"). • The average ("avg") and maximum ("max") values refer to the defects found in the compilation units ("compilationunits"). [Schröter et al. 06] http://www.st.cs.uni-sb.de/softevo/bug-data/eclipse/ T. Xie and J. Pei: Data Mining for Software Engineering

42

Data Mining Techniques in SE • • • •

Association rules and frequent patterns Classification Clustering Misc.

T. Xie and J. Pei: Data Mining for Software Engineering

43

Association Rules • (Time∈{Fri, Sat}) ∧ buy(X, diaper) Æ buy(X, beer) – Dads taking care of babies in weekends drink beers

• Itemsets should be frequent – It can be applied extensively

• Rules should be confident – With strong prediction capability T. Xie and J. Pei: Data Mining for Software Engineering

44

Various Types of Rules • Boolean vs. quantitative associations – buys(x, “SQLServer”) ^ buys(x, “DMBook”) Æ buys(x, “DM Software”) [0.2%, 60%] – age(x, “30..39”) ^ income(x, “42..48K”) Æ buys(x, “PC”) [1%, 75%]

• Single dimension vs. multiple dimensional associations • Single level vs. multiple-level analysis – What brands of beers are associated with what brands of diapers? T. Xie and J. Pei: Data Mining for Software Engineering

45

A Simple Case • Finding highly correlated method call pairs • Confidence of pairs helps – Conf()=support()/support()

• Check the revisions (fixes to bugs), find the pairs of method calls whose confidences are improved dramatically by frequent added fixes – Those are the matching method call pairs that may often be violated by programmers [Livshits&Zimmermann 05] T. Xie and J. Pei: Data Mining for Software Engineering

46

Conflicting Patterns • 999 out of 1000 times spin_lock is followed by spin_unlock – The single time that spin_unlock does not follow may likely be an error

• We can detect an error without knowing the correctness rules

[Li&Zhou 05, Livshits&Zimmermann 05, Yang et al. 06] T. Xie and J. Pei: Data Mining for Software Engineering

47

Detect Copy-Paste Code • Apply closed sequential pattern mining techniques • Customizing the techniques – A copy-paste segment typically does not have big gaps – use a maximum gap threshold to control – Output the instances of patterns (i.e., the copy-pasted code segments) instead of the patterns – Use small copy-pasted segments to form larger ones – Prune false positives: tiny segments, unmappable segments, overlapping segments, and segments with large gaps [Li et al. 04] T. Xie and J. Pei: Data Mining for Software Engineering

48

Find Bugs in Copy-Pasted Segments • For two copy-pasted segments, are the modifications consistent? – Identifier a in segment S1 is changed to b in segment S2 3 times, but remains unchanged once – likely a bug – The heuristic may not be right all the time

• The lower the unchanged rate of an identifier, the more likely there is a bug [Li et al. 04] T. Xie and J. Pei: Data Mining for Software Engineering

49

Mining Rules in Traces • Mine association rules or sequential patterns S Æ F, where S is a statement and F is the status of program failure • The higher the confidence, the more likely S is faulty or related to a fault • Using only one statement at the left side of the rule can be misleading, since a fault may be led by a combination of statements – Frequent patterns can be used to improve [Denmat et al. 05] T. Xie and J. Pei: Data Mining for Software Engineering

50

Mining Emerging Patterns in Traces • A method executed only in failing runs is likely to point to the defect – Comparing the coverage of passing and failing program runs helps

• Mining patterns frequent in failing program runs but infrequent in passing program runs – Sequential patterns may be used [Dallmeier et al. 05, Denmat et al. 05, Yang et al. 06] T. Xie and J. Pei: Data Mining for Software Engineering

51

Data Mining Techniques in SE • • • •

Association rules and frequent patterns Classification Clustering Misc.

T. Xie and J. Pei: Data Mining for Software Engineering

52

Classification: A 2-step Process • Model construction: describe a set of predetermined classes – Training dataset: tuples for model construction • Each tuple/sample belongs to a predefined class

– Classification rules, decision trees, or math formulae

• Model application: classify unseen objects – Estimate accuracy of the model using an independent test set – Acceptable accuracy Æ apply the model to classify tuples with unknown class labels T. Xie and J. Pei: Data Mining for Software Engineering

53

GUI-Application Stabilizer • Given a program state S and an event e, predict whether e likely results in a bug – Positive samples: past bugs – Negative samples: “not bug” reports

• A k-NN based approach – Consider the k closest cases reported before – Compare Σ 1/d for bug cases and not-bug cases, where d is the similarity between the current state and the reported states – If the current state is more similar to bugs, predict a bug [Michail&Xie 05] T. Xie and J. Pei: Data Mining for Software Engineering

54

Data Mining Techniques in SE • • • •

Association rules and frequent patterns Classification Clustering Misc.

T. Xie and J. Pei: Data Mining for Software Engineering

55

What Is Clustering? • Group data into clusters – Similar to one another within the same cluster – Dissimilar to the objects in other clusters – Unsupervised learning: no predefined classes Outliers Cluster 1 Cluster 2

T. Xie and J. Pei: Data Mining for Software Engineering

56

Clustering and Categorization • Software categorization – Partitioning software systems into categories

• Categories predefined – a classification problem • Categories discovered automatically – a clustering problem

T. Xie and J. Pei: Data Mining for Software Engineering

57

Software Categorization - MUDABlue • Understanding source code – Use latent semantic analysis (LSA) to find similarity between software systems – Use identifiers (e.g., variable names, function names) as features • “gtk_window” represents some window • The source code near “gtk_window” contains some GUI operation on the window

• Extracting categories using frequent identifiers – “gtk_window”, “gtk_main”, and “gpointer” Æ GTK related software system – Use LSA to find relationships between identifiers [Kawaguchi et al. 04] T. Xie and J. Pei: Data Mining for Software Engineering

58

Data Mining Techniques in SE • • • •

Association rules and frequent patterns Classification Clustering Misc.

T. Xie and J. Pei: Data Mining for Software Engineering

59

Sampling Programs • During the execution of a program, each execution of a statement takes a probability to be sampled – Performance slowdown unnoticeable by users – Many traces can be collected at many user sites – Sampling large programs becomes feasible

• Bug isolation by analyzing samples – Correlation between some specific statements or function calls with program errors/crashes [Liblit et al. 03/05] T. Xie and J. Pei: Data Mining for Software Engineering

60

Outline • Introduction • What software engineering tasks can be helped by data mining? • What kinds of software engineering data can be mined? • How are data mining techniques used in software engineering? • Case studies • Conclusions T. Xie and J. Pei: Data Mining for Software Engineering

61

Case Studies • MAPO: mining API usages from open source repositories [Xie&Pei 06] • DynaMine: mining error/usage patterns from code revision histories [Livshits&Zimmermann 05] • BugTriage: learning bug assignments from historical bug reports [Anvik et al. 06]

T. Xie and J. Pei: Data Mining for Software Engineering

62

Motivation • APIs in class libraries or frameworks are popularly reused in software development. • An example programming task: “instrument the bytecode of a Java class by adding an extra method to the class” – org.apache.bcel.generic.ClassGen public void addMethod(Method m)

T. Xie and J. Pei: Data Mining for Software Engineering

63

First Try: ClassGen Java API Doc addMethod public void addMethod(Method m) Add a method to this class. Parameters: m - method to add

T. Xie and J. Pei: Data Mining for Software Engineering

64

Second Try: Code Search Engine

T. Xie and J. Pei: Data Mining for Software Engineering

65

MAPO Approach • Analyze code segments relevant to a given API and disclose the inherent usage patterns – Input: an API characterized by a method, class, or package – Code search engine: used to search relevant source files from open source repositories – Frequent sequence miner: use BIDE [Wang&Han 04] to mine closed sequential patterns from extracted methodcall sequences – Output: a short list of frequent API usage patterns related to the API T. Xie and J. Pei: Data Mining for Software Engineering

66

Sequence Extraction • Method sequences: extracted from Java source files returned from code search engines Source code

Call sequence

public void generateStubMethod(ClassGen c) InstructionList il = InstructionList.() new InstructionList(); genFromISList(InstructionList) MethodGen m= genFromISList(il); MethodGen.setMaxStack() m.setMaxLocals(); MethodGen.setMaxLocals() m.setMaxStack(); MethodGen.getMethod() c.addMethod(m.getMethod()); ClassGen.addMethod(Method) System.out.println(“…”); PrintStream.println(String) … … } T. Xie and J. Pei: Data Mining for Software Engineering

67

Sequence Preprocessing • Remove common Java library calls • Inline callees of the same class • Remove sequences that contain no query words: ClassGen and addMethod public void generateStubMethod(ClassGen c) InstructionList il = InstructionList.() new InstructionList(); genFromISList(InstructionList) MethodGen m= genFromISList(il); MethodGen.setMaxStack() m.setMaxLocals(); MethodGen.setMaxLocals() m.setMaxStack(); MethodGen.getMethod() c.addMethod(m.getMethod()); ClassGen.addMethod(Method) System.out.println(“…”); PrintStream.println(String) … … } T. Xie and J. Pei: Data Mining for Software Engineering

68

Frequent Seq Postprocessing • Remove sequences that contain no query words: ClassGen and addMethod • Compress consecutive calls of the same method into one, e.g., abbba Î aba • Remove duplicate frequent sequences after the compression, e.g., aba, aba Î aba • Reduce a seq if it is a subseq of another, e.g., aba, abab Î abab T. Xie and J. Pei: Data Mining for Software Engineering

69

Tool Architecture e.g. koders.com

T. Xie and J. Pei: Data Mining for Software Engineering

70

Sample Tool Output InstructionList.() InstructionFactory.createLoad(Type, int) InstructionList.append(Instruction) InstructionFactory.createReturn(Type) InstructionList.append(Instruction) MethodGen.setMaxStack() MethodGen.setMaxLocals() MethodGen.getMethod() ClassGen.addMethod(Method) InstructionList.dispose() T. Xie and J. Pei: Data Mining for Software Engineering

71

Case Studies • MAPO: mining API usages from open source repositories [Xie&Pei 06] • DynaMine: mining error/usage patterns from code revision histories [Livshits&Zimmermann 05] • BugTriage: learning bug assignments from historical bug reports [Anvik et al. 06]

T. Xie and J. Pei: Data Mining for Software Engineering

72

Co-Change Pattern • Pattern formed by things that are frequently changed together • E.g., co-added method calls public void createPartControl(Composite parent) { ... // add listener for editor page activation getSite().getPage().addPartListener(partListener); } public void dispose() { ...

co-added

getSite().getPage().removePartListener(partListener); } T. Xie and J. Pei: Data Mining for Software Engineering

Adapted from Livshits et al.’s slides

73

DynaMine revision history mining

mine CVS histories

patterns

rank and filter instrument relevant method calls run the application

dynamic analysis

post-process

reporting

usage patterns

error patterns

report patterns

report bugs

T. Xie and J. Pei: Data Mining for Software Engineering

unlikely patterns

Adapted from Livshits et al.’s slides

74

Mining Patterns revision history mining

mine CVS histories

patterns

rank and filter instrument relevant method calls run the application

dynamic analysis

post-process usage patterns reporting

report patterns

T. Xie and J. Pei: Data Mining for Software Engineering

error patterns

unlikely patterns

report Adaptedbugs from Livshits et al.’s slides 75

Mining Method Calls Foo.java 1.12

o1.addListener() o1.removeListener()

Bar.java 1.47

o2.addListener() o2.removeListener() System.out.println()

Baz.java 1.23

o3.addListener() o3.removeListener() list.iterator() iter.hasNext() iter.next()

Qux.java 1.41

o4.addListener() System.out.println()

1.42

o4.removeListener()

T. Xie and J. Pei: Data Mining for Software Engineering

Adapted from Livshits et al.’s slides

76

Finding Pairs Foo.java 1.12

o1.addListener() o1.removeListener()

1 Pair

Bar.java 1.47

o2.addListener() o2.removeListener() System.out.println()

1 Pair

Baz.java 1.23

o3.addListener() o3.removeListener() list.iterator() iter.hasNext() iter.next()

2 Pairs

Qux.java 1.41

o4.addListener() System.out.println()

0 Pairs

1.42

o4.removeListener()

0 Pairs

T. Xie and J. Pei: Data Mining for Software Engineering

Adapted from Livshits et al.’s slides

77

Finding Patterns Find “frequent itemsets” (with Apriori) o.enterAlignment() o.enterAlignment() o.enterAlignment() o.enterAlignment() o.exitAlignment() o.exitAlignment() o.exitAlignment() o.exitAlignment() o.redoAlignment() o.redoAlignment() o.redoAlignment() o.redoAlignment() iter.hasNext() iter.hasNext() iter.hasNext() iter.hasNext() iter.next() iter.next() iter.next() iter.next()

\ {enterAlignment(), exitAlignment(), redoAlignment()} T. Xie and J. Pei: Data Mining for Software Engineering

Adapted from Livshits et al.’s slides

78

Ranking Patterns Foo.java 1.12

o1.addListener() o1.removeListener()

Bar.java 1.47

o2.addListener() o2.removeListener() System.out.println()

Baz.java 1.23

o3.addListener() o3.removeListener() list.iterator() iter.hasNext() iter.next()

Qux.java 1.41

o4.addListener() System.out.println()

1.42

o4.removeListener()

T. Xie and J. Pei: Data Mining for Software Engineering

Support count = #occurrences of a pattern Confidence = strength of a pattern, P(A|B) Adapted from Livshits et al.’s slides

79

Ranking Patterns Foo.java 1.12

o1.addListener() o1.removeListener()

Bar.java 1.47

o2.addListener() o2.removeListener() System.out.println()

Baz.java 1.23

o3.addListener() o3.removeListener() list.iterator() iter.hasNext() iter.next()

Qux.java 1.41

o4.addListener() System.out.println()

This is a fix!

1.42

o4.removeListener()

Rank removeListener() patterns higher

T. Xie and J. Pei: Data Mining for Software Engineering

Adapted from Livshits et al.’s slides

80

Dynamic Validation revision history mining

mine CVS histories

patterns

rank and filter instrument relevant method calls run the application

dynamic analysis

post-process usage patterns reporting

report patterns

T. Xie and J. Pei: Data Mining for Software Engineering

error patterns

unlikely patterns

report Adaptedbugs from Livshits et al.’s slides 81

Pattern classification

post-process v validations, e violations

usage patterns

error patterns

unlikely patterns

e
v/10<=e<=2v

otherwise

T. Xie and J. Pei: Data Mining for Software Engineering

Adapted from Livshits et al.’s slides

82

Results since

JEDIT 2000

ECLIPSE 2001

developers

92

112

lines of code

700,000

2,900,000

revisions

40,000

400,000

total 57 patterns T. Xie and J. Pei: Data Mining for Software Engineering

Adapted from Livshits et al.’s slides

83

Case Studies • MAPO: mining API usages from open source repositories [Xie&Pei 06] • DynaMine: mining error/usage patterns from code revision histories [Livshits&Zimmermann 05] • BugTriage: learning bug assignments from historical bug reports [Anvik et al. 06]

T. Xie and J. Pei: Data Mining for Software Engineering

84

Assigning a Bug • Many considerations – who has the expertise? – who is available?

• Not always an obvious or correct assignment – multiple developers may be suitable – difficult to know what the bug is about

T. Xie and J. Pei: Data Mining for Software Engineering

Adapted from Anvik et al.’s slides

85

Assigning a Bug Today

[email protected]

T. Xie and J. Pei: Data Mining for Software Engineering

Adapted from Anvik et al.’s slides

86

Recommending assignment

[email protected] [email protected] [email protected]

T. Xie and J. Pei: Data Mining for Software Engineering

Adapted from Anvik et al.’s slides

87

Overview of approach Approach tuned using Eclipse and Firefox

[email protected] [email protected] [email protected]

Machine Learning Resolved

Algorithm

Assignment Recommender

Bug Reports

T. Xie and J. Pei: Data Mining for Software Engineering

Adapted from Anvik et al.’s slides

88

Steps to the approach 1. Characterize the reports 2. Label the reports 3. Select the reports 4. Use a machine learning algorithm

T. Xie and J. Pei: Data Mining for Software Engineering

Adapted from Anvik et al.’s slides

89

Step 1: Characterizing a report • Based on two fields – textual summary – description

• Use text categorization approach – represent with a word vector – remove stop words – intra- and inter-document frequency

T. Xie and J. Pei: Data Mining for Software Engineering

Adapted from Anvik et al.’s slides

90

Step 2: Labeling a report • Must determine who really fixed it – “Assigned-to” field is not accurate

• Project-specific heuristics

T. Xie and J. Pei: Data Mining for Software Engineering

Adapted from Anvik et al.’s slides

91

Step 2: Labeling a report • Must determine who really fixed it – “Assigned-to” field is not accurate a report is FIXED, label with who • Project-specificIfheuristics

– simple

marked it as fixed. (Eclipse)

If a report is DUPLICATE, use the label of the report it duplicates. (Eclipse and Firefox)

T. Xie and J. Pei: Data Mining for Software Engineering

Adapted from Anvik et al.’s slides

92

Step 2: Labeling a report • Must determine who really fixed it – “Assigned-to” field is not accurate (Firefox)

the report is FIXED and has • Project-specificIfheuristics

– simple – complex

attachments that are approved by a reviewer, then – If one submitter of patches, use their name. – If more than one submitter, choose name of who submitted the most patches. – If cannot determine submitters, label with the person assigned to the report.

T. Xie and J. Pei: Data Mining for Software Engineering

Adapted from Anvik et al.’s slides

93

Step 2: Labeling a report • Must determine who really fixed it – “Assigned-to” field is not accurate (Firefox)

marked as WONTFIX are • Project-specificReports heuristics

– simple – complex – unclassifiable

often resolved after discussion and developers reaching a consensus. – Unknown who would have fixed the bug – Report is labeled unclassifiable.

T. Xie and J. Pei: Data Mining for Software Engineering

Adapted from Anvik et al.’s slides

94

Step 2: Labeling a report • Must determine who really fixed it – “Assigned-to” field is not accurate

• Project-specific heuristics – simple – complex – unclassifiable

Eclipse Firefox Simple 5 4 Complex 2 1 Unclassifiable 1 4

T. Xie and J. Pei: Data Mining for Software Engineering

Adapted from Anvik et al.’s slides

95

Step 3: Selecting the reports • Exclude those with no label • Include those of active developers – developer profiles 40

40

35

35

30

30

25

25

20

20

15

15

10

10

5

5 0

0 Sep-04

Oct-04

Nov-04

Dec-04

Jan-05

Feb-05

Mar-05

Apr-05

T. Xie and J. Pei: Data Mining for Software Engineering

Sep-04

Oct-04

Nov-04

Dec-04

Jan-05

Feb-05

Mar-05

Apr-05

Adapted from Anvik et al.’s slides

96

Step 3: Selecting the reports 40 35 30 25

3 reports / month

20 15 10 5 0 Sep-04

Oct-04

Nov-04

Dec-04

Jan-05

T. Xie and J. Pei: Data Mining for Software Engineering

Feb-05

Mar-05

Apr-05

Adapted from Anvik et al.’s slides

97

Step 4: Use a ML algorithm • Supervised Algorithms – Naïve Bayes – C4.5 – Support Vector Machines

• Unsupervised Algorithms – Expectation Maximization

• Incremental Algorithms – Naïve Bayes T. Xie and J. Pei: Data Mining for Software Engineering

Adapted from Anvik et al.’s slides

98

Evaluating Recommenders # of relevant recommenda tions Precision = # of recommenda tions made

# of relevant recommenda tions Recall = # of possibly relevant developers

T. Xie and J. Pei: Data Mining for Software Engineering

Adapted from Anvik et al.’s slides

99

Precision vs. Recall A small set of “right” developers (precision) more important than the set of all possible developers 100%

100%

90%

90%

80%

80%

70%

70%

60%

60%

Recall

Precision

(recall)

50% 40%

50% 40%

30%

30%

20%

20%

10%

10%

0% Multi. NB

C4.5

Eclipse

Firefox

SVM

0% Multi. NB

C4.5

SVM

gcc

Eclipse

Precision T. Xie and J. Pei: Data Mining for Software Engineering

Firefox

gcc

Recall Adapted from Anvik et al.’s slides 100

Case Studies: Summary Tools MAPO

SE Data Code bases

Mining algs Frequent seq miner

SE Tasks API usages

DynaMine Code revisions Frequent

Error/usage itemset miner patterns

BugTriage Bug reports

Classifier

T. Xie and J. Pei: Data Mining for Software Engineering

Bug assignments

101

Demand-Driven Or Not Any-gold mining DynaMine, …

Demand-driven mining MAPO, BugTriage, …

Advantages

Surface up only cases that are applicable

Exploit demands to filter out irrelevant information

Issues

How much gold is How high percentage of good enough given the cases would work well? amount of data to be mined?

Examples

T. Xie and J. Pei: Data Mining for Software Engineering

102

Code vs. Non-Code

Examples

Advantages

Code/ Programming Langs MAPO, DynaMine, …

Non-Code/ Natural Langs BugTriage, CVS/Code comments, emails, docs

Relatively stable and consistent representation

Common source of capturing programmers’ intentions What project/contextspecific heuristics to use?

Issues

T. Xie and J. Pei: Data Mining for Software Engineering

103

Static vs. Dynamic

Examples

Advantages

Issues

Static Data: code Dynamic Data: prog bases, change histories states, structural profiles MAPO, DynaMine, … Spec discovery, … No need to set up exec More-precise info environment; More scalable How to reduce false How to reduce false positives? negatives? Where tests come from?

T. Xie and J. Pei: Data Mining for Software Engineering

104

Snapshot vs. Changes Code snapshot

Code change history

Examples

MAPO, …

DynaMine, …

Advantages

Larger amount of available data

Revision transactions encode more-focused entity relationships How to group CVS changes into transactions?

Issues

T. Xie and J. Pei: Data Mining for Software Engineering

105

Characteristics in Mining SE Data • Improve quality of source data: data preprocessing – MAPO: inlining, reduction – DynaMine: call association – BugTriage: labeling heuristics, inactive-developer removal

• Reduce uninteresting patterns: pattern postprocessing – MAPO: compression, reduction – DynaMine: dynamic validation

• Source data may not be sufficient – DynaMine: revision histories – BugTriage: historical bug reports SE-Domain-Specific Heuristics are important T. Xie and J. Pei: Data Mining for Software Engineering

106

Conclusions • Software development generates a large amount of different types of data • Data mining and data analysis can help software engineering substantially • Successful cases – What software engineering data can be mined? – What software engineering tasks can be helped? – How to conduct the mining T. Xie and J. Pei: Data Mining for Software Engineering

107

Challenges • Complexity in software engineering – Specific data mining techniques are needed

• Software engineering processes are dynamic and user-centered – Interactive data mining – Visual data mining and analysis – Online, incremental mining

T. Xie and J. Pei: Data Mining for Software Engineering

108

Questions?

Mining Software Engineering Data Bibliography http://ase.csc.ncsu.edu/dmse/ •What software engineering tasks can be helped by data mining? •What kinds of software engineering data can be mined? •How are data mining techniques used in software engineering? •Resources

Slides

T. Xie and J. Pei: Data Mining for Software Engineering. 3. Introduction. • A large amount of data is produced in software development. – Data from software ...

2MB Sizes 4 Downloads 262 Views

Recommend Documents

Slides
int var1 = 5; //declares an integer with value 5 var1++;. //increments var1 printf(“%d”, var1); //prints out 6. Page 17. Be Careful!! 42 = int var;. Page 18. Types. Some types in C: int: 4 bytes goes from -231 -> 231 - 1 float: 4 bytes (7-digit p

Slides - GitHub
Android is an open source and Linux-based Operating System for mobile devices. ○ Android application run on different devices powered by ... Page 10 ...

Slides - GitHub
A Brief Introduction. Basic dataset classes include: ... All of these must be composed of atomic types. 12 .... type(f.root.a_group.arthur_count[:]) list. >>> type(f.root.a_group.arthur_count) .... a word on a computer screen (3 seconds), then. 27 ..

Quarterly Earnings Slides
Please see Facebook's Form 10-K for the year ended December 31, 2012 for definitions of user activity used to .... Advertising Revenue by User Geography.

slides
make it easier for other lenders and borrowers to find partners. These “liquidity provision services”to others receive no compensation in the equilibrium, so individual agents ignore them when calculating their equilibrium payoffs. The equilibriu

Slides-DominanceSolvability.pdf
R (6.50 ; 4.75) (10.00 ; 5.00). B. A. l r. L (9.75 ; 8.50) ( 9.75 ; 8.50). R (3.00 ; 8.50) (10.00 ; 10.00). Game 1 Game 2. This game clearly captures both key facets of ...

Download the slides - Portworx
In this workshop we will: ○ deploy a stateful app. ○ demonstrate HA by doing failover on the app. ○ snapshot a volume. ○ deploy a test workload against the ...

SSTIC 2011 slides - GitHub
Relies upon data structures configuration .... Unreal mode (fiat real, big real mode) .... USB specification: no direct data transfers between host controllers.

Slides
Key tool from potential theory : minimal thiness - the notion of a set in D being 'thin' at a Point of T. Recall: the Poisson Remel for TD Ös : f(z) = 1 - \ z (2 e D, well). 12 - w. D W. Definition. A set E cli) a called minimals thin at well if the

Prize Lecture slides
Dec 8, 2011 - Statistical Model for government surplus net-of interest st st = ∞. ∑ ... +R. −1 bt+1,t ≥ 0. Iterating backward bt = − t−1. ∑ j=0. Rj+1st+j−1 + Rtb0.

Slides [PDF] - GitHub
[capture parallel data. write to register/shared memory]. [configurable bit ... driver. Callbacks and. APIs parallel_bus_interface driver. Callbacks and. APIs.

intro slides - GitHub
Jun 19, 2017 - Learn core skills for doing data analysis effectively, efficiently, and reproducibly. 1. Interacting with your computer on command line (BASH/shell).

slides-NatL.pdf
strangely enough, they are still aware of these models to different extents. An. interesting intertwining between inferential logic, lexical contents, common. sense ...

slides in pdf
Oct 3, 2007 - 2. Peter Dolog, ERP Course, ERP Development. ERP Implementation. Phases stay: • Planning. • Requirements analysis. • Design. • Detailed design. • Implementation. • Maintanance. Focus changes. • To fit the existing software

malofiej title slides copy - GitHub
Page 23. A tool for making responsive · graphics with Adobe Illustrator. Page 24. Thanks, I hope you had fun! @archietse bit.ly/nytgraphics2015 ai2html.org.

INSECTS (SLIDES).pdf
There was a problem previewing this document. Retrying... Download. Connect more apps... Try one of the apps below to open or edit this item. INSECTS ...

slides-trs-modal.pdf
There was a problem previewing this document. Retrying... Download. Connect more apps... Try one of the apps below to open or edit this item. slides-trs-modal.

Girua-Slides-Profuncionario-Alimentacao_Escolar-ConclDez2015 ...
ENTREGA DO MATERIAL DIDÁTICO PARA OS ALUNOS. Page 4 of 18. Girua-Slides-Profuncionario-Alimentacao_Escolar-ConclDez2015.compressed.pdf.

Access Lesson 6.1 slides here
You are looking for Google Earth files showing shipwrecks around Florida—only you have already seen the ones on. Floridamarine.org and The_Jacobs.org. Other than those websites, what virtual tours are out there? [ filetype:kmz shipwrecks OR “ship

Fundamentals of Power Electronics Instructor's slides
Ferrite toroid data: Excel 5 spreadsheet ..... The power electronics field is quite broad, and includes fundamentals in ..... recovery mechanisms, 76-77, 98-100.

Grad school slides for Reddit.pdf
Grad school slides for Reddit.pdf. Grad school slides for Reddit.pdf. Open. Extract. Open with. Sign In. Main menu. Displaying Grad school slides for Reddit.pdf.