Data Mining in Resilient Identity Crime Detection by

Chun Wei Clifton Phua, BBusSys(Hons), DipIT

Dissertation Submitted by Chun Wei Clifton Phua for fulfillment of the Requirements for the Degree of

Doctor of Philosophy

Supervisors: Prof. Kate Smith-Miles and Assoc. Prof. Vincent Lee Associate Supervisor: Dr. Ross Gayler

Clayton School of Information Technology Monash University December, 2007

c Copyright

by Chun Wei Clifton Phua 2007

Keywords resilience, adaptivity, quality data, identity crime detection, credit application fraud detection, string and phonetic matching, communal detection, spike detection, data mining-based fraud detection, security, data stream mining, and anomaly detection

For my parents Chye Twee and Siok Moy

献给我敬爱的父母再对和惜美

iii

Contents

List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

viii

List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

ix

Notation and Abbreviations . . . . . . . . . . . . . . . . . . . . . . . . .

xi

Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

xv

Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xviii 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1

1

Definitions of Identity Crime . . . . . . . . . . . . . . . . . . . . . . .

1

1.1.1

Credit Application Fraud . . . . . . . . . . . . . . . . . . . . .

2

1.1.2

Fraudster Attack Cycle . . . . . . . . . . . . . . . . . . . . . .

3

Challenges for Data Mining-based Detection Systems . . . . . . . . .

4

1.2.1

Resilience . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4

1.2.2

Other Challenges . . . . . . . . . . . . . . . . . . . . . . . . .

5

1.3

Existing Detection System . . . . . . . . . . . . . . . . . . . . . . . .

7

1.4

Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

8

1.5

Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

9

1.6

Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

9

1.2

2 Data Mining-based Detection . . . . . . . . . . . . . . . . . . . . . .

13

2.1

Commercial Interest . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.2

Fraud Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.2.1

Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 iv

2.2.2 2.3

Key Ideas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

Adversarial-related Detection . . . . . . . . . . . . . . . . . . . . . . 18 2.3.1

Terrorism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.3.2

Financial Crime . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.3.3

Computer Network Intrusion and Spam . . . . . . . . . . . . . 21

2.4

Identity Crime-related Detection . . . . . . . . . . . . . . . . . . . . . 22

2.5

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3 Data and Measure . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

25

3.1

Responsibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.2

Identity Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.2.1

Real Application DataSet (RADS) . . . . . . . . . . . . . . . 28

3.2.2

Name Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.3

Evaluation Measure . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.4

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4 Name Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

37

4.1

Personal Names . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

4.2

Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

4.3

Algorithm Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 4.3.1

Step 1: Name Authenticity . . . . . . . . . . . . . . . . . . . . 39

4.3.2

Step 2: Name Order . . . . . . . . . . . . . . . . . . . . . . . 41

4.3.3

Step 3: Name Gender . . . . . . . . . . . . . . . . . . . . . . . 42

4.4

Experimental Design . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

4.5

Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . 44

4.6

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

5 Communal Detection . . . . . . . . . . . . . . . . . . . . . . . . . . .

49

5.1

Adaptive Whitelist . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

5.2

Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

5.3

Algorithm Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 5.3.1

Step 1: Multi-attribute Link . . . . . . . . . . . . . . . . . . . 54 v

5.3.2

Step 2: Single-link Communal Detection . . . . . . . . . . . . 54

5.3.3

Step 3: Single-link Average Previous Score . . . . . . . . . . . 55

5.3.4

Step 4: Multiple-links Score . . . . . . . . . . . . . . . . . . . 55

5.3.5

Step 5: Parameter’s Value Change . . . . . . . . . . . . . . . 56

5.3.6

Step 6: Whitelist Change

. . . . . . . . . . . . . . . . . . . . 57

5.4

Experimental Design . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

5.5

Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . 59

5.6

Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

6 Spike Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

63

6.1

Adaptive Attributes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

6.2

Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

6.3

Algorithm Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 6.3.1

Step 1: Single-step Scaled Count . . . . . . . . . . . . . . . . 67

6.3.2

Step 2: Single-value Spike Detection . . . . . . . . . . . . . . 67

6.3.3

Step 3: Multiple-values Score . . . . . . . . . . . . . . . . . . 68

6.3.4

Step 4: SD Attributes Selection . . . . . . . . . . . . . . . . . 68

6.3.5

Step 5: CD Attribute Weights Change . . . . . . . . . . . . . 68

6.4

Experimental Design . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

6.5

Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . 70

6.6

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

75

7.1

Chapter Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . 75

7.2

Future Research Directions . . . . . . . . . . . . . . . . . . . . . . . . 77

7.3

7.2.1

Graph Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

7.2.2

Utility Measures . . . . . . . . . . . . . . . . . . . . . . . . . 78

7.2.3

Web-based Identity Crime Detection . . . . . . . . . . . . . . 78

Closing Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

Appendix A Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

81

Appendix B Attributes . . . . . . . . . . . . . . . . . . . . . . . . . . . .

85

vi

Appendix C Name Verification DataSet (NVDS) . . . . . . . . . . .

87

Appendix D Real Application DataSet (RADS) Fraud Patterns . .

89

Appendix E Parameter Values . . . . . . . . . . . . . . . . . . . . . . .

93

Appendix F CD F -Measures on Sets b and c . . . . . . . . . . . . . .

95

Appendix G Monthly F -measures on Experiments a2 and a4 . . . .

97

Appendix H Organisations’ F -measures on Experiments a2 and a4

99

Appendix I CD and SD Visualisations . . . . . . . . . . . . . . . . . . 101 Vita . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123

vii

List of Tables 1.1

Contributions to credit application fraud detection

3.1

Confusion matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.2

Evaluation measures . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4.1

Name Detection (ND) algorithm . . . . . . . . . . . . . . . . . . . . . 39

5.1

Communal Detection (CD) algorithm . . . . . . . . . . . . . . . . . . 53

5.2

CD experimental design . . . . . . . . . . . . . . . . . . . . . . . . . 57

5.3

Adaptive CD experimental design . . . . . . . . . . . . . . . . . . . . 58

6.1

Spike Detection (SD) algorithm . . . . . . . . . . . . . . . . . . . . . 66

6.2

SD best attributes experimental design . . . . . . . . . . . . . . . . . 69

6.3

SD and strengthened CD experimental design . . . . . . . . . . . . . 69

viii

. . . . . . . . . . 10

List of Figures 1.1

Resilient credit application fraud detection system outline

. . . . . . 11

2.1

Data mining-based detection overview

3.1

Daily application volume for two months . . . . . . . . . . . . . . . . 29

3.2

Fraud percentage across months . . . . . . . . . . . . . . . . . . . . . 30

3.3

Daily fraud percentage for two months . . . . . . . . . . . . . . . . . 31

4.1

Name algorithms’ time . . . . . . . . . . . . . . . . . . . . . . . . . . 40

4.2

Name authenticity F -measures . . . . . . . . . . . . . . . . . . . . . . 45

4.3

Name order F -measures . . . . . . . . . . . . . . . . . . . . . . . . . 45

4.4

Name gender F -measures . . . . . . . . . . . . . . . . . . . . . . . . . 46

5.1

Communal Detection (CD) F -measures on set a . . . . . . . . . . . . 59

5.2

Monthly F -measures on experiment a1 . . . . . . . . . . . . . . . . . 60

5.3

Organisations’ F -measures on experiment a1 . . . . . . . . . . . . . . 61

5.4

Adaptive CD F -measures on set d . . . . . . . . . . . . . . . . . . . . 61

6.1

Spike Detection (SD) F -measures on set e . . . . . . . . . . . . . . . 71

6.2

SD F -measures on set f . . . . . . . . . . . . . . . . . . . . . . . . . . 71

6.3

SD attribute weights on experiments f2, f3, and f4 . . . . . . . . . . . 72

. . . . . . . . . . . . . . . . . 16

C.1 Name Verification DataSet (NVDS) region . . . . . . . . . . . . . . . 87 C.2 NVDS order, gender, culture . . . . . . . . . . . . . . . . . . . . . . . 88 D.1

fraud percentage average fraud percentage

by hour . . . . . . . . . . . . . . . . . . . . . . . . 89

D.2

fraud percentage average fraud percentage

by state . . . . . . . . . . . . . . . . . . . . . . . . 90 ix

D.3 Top fourty postcodes by D.4

fraud percentage average fraud percentage

fraud percentage average fraud percentage

by organisation

D.5 Top ten organisations by

. . . . . . . . . . . . . . . 90

. . . . . . . . . . . . . . . . . . . 91

fraud percentage average fraud percentage

. . . . . . . . . . . . . . 91

F.1 CD F -measures on set b . . . . . . . . . . . . . . . . . . . . . . . . . 95 F.2 CD F -measures on set c . . . . . . . . . . . . . . . . . . . . . . . . . 96 G.1 Monthly F -measures on experiment a2 . . . . . . . . . . . . . . . . . 97 G.2 Monthly F -measures on experiment a4 . . . . . . . . . . . . . . . . . 98 H.1 Organisations’ F -measures on experiment a2 . . . . . . . . . . . . . . 99 H.2 Organisations’ F -measures on experiment a4 . . . . . . . . . . . . . . 100 I.1

CD visualisation of known fraud application links . . . . . . . . . . . 101

I.2

CD visualisation of known fraud attribute links . . . . . . . . . . . . 102

I.3

SD visualisation of all attributes . . . . . . . . . . . . . . . . . . . . . 102

I.4

SD visualisation of attribute sparsity . . . . . . . . . . . . . . . . . . 103

x

Notation and Abbreviations General RADS: Real Application DataSet. tp: number of true positives. f p: number of false positives. f n: number of false negatives. tn: number of true negatives. X: number of decision thresholds.

Name Detection ND: Name Detection. NVDS: Name Verification DataSet. ah,name : NVDS name. a ˆh,name : encoded NVDS name. ch,order : NVDS order label. ch,gender : NVDS gender label. ch,culture : NVDS culture label. NDS: Name DataSet. xi

ai,name : NDS name. a ˆi,name : encoded NDS name. ci,f raud : NDS fraud label. ci,order : NDS order label. ci,gender : NDS gender label. Tsimilarity : string similarity threshold between two values. ai,f irstname : current application’s first name. ai,lastname : current application’s last name. ai,name−authenticity : derived authenticity value. ai,name−order : derived order value. ai,name−gender : derived gender value.

Communal Detection CD: Communal Detection. G: overall continuous stream. gx : current Mini-discrete stream. x: fixed interval of the current month, fortnight, or week in the year. p: variable number of micro-discrete streams in a Mini-discrete stream. ux,y : current micro-discrete stream. y: fixed interval of the current day, hour, minute, or second. q: variable number of applications in a micro-discrete stream. vi : unscored current application. N : number of attributes. xii

ai,k : current value. W : moving window of previous applications. vj : scored previous application. aj,k : previous value.
Ωx,y : short-term current average score. δx−1 : long-term previous average links. δx,y−1 : short-term previous average links. δx,y : short-term current average links.

Spike Detection SD: Spike Detection. t: number of steps in the moving window. θ: time difference filter. ai,j : single-attribute match between current value and previous values. sτ (ai,k ): scaled matches in each step. L(ai,k ): set of previous values within each step which current value matches. κ: number of values in each step. S(ai,k ): current value score.

xiv

Data Mining in Resilient Identity Crime Detection

Abstract Identity crime is prominent, prevalent, and costly. To combat identity crime, data mining searches for patterns in a principled fashion. These patterns can be highly indicative of early symptoms in identity crime, especially synthetic identity fraud. Credit application fraud is a specific case of identity crime which utilises duplicates: personal identity details which are worded exactly the same, or slightly different from the current application. This thesis argues that each successful credit application fraud pattern is represented by a sudden and sharp spike in duplicates within a short time, relative to the established baseline level. There are two existing, ad-hoc, non-data mining layers of defence against credit application fraud: business rules and scorecards, and known fraud matching. The main objective of this research is to achieve resilience by adding three new, real-time, data mining-based layers to complement the two existing ad-hoc layers. Resilience is the ability to degrade gracefully when under most real attacks. The three data mining layers are Name Detection (ND), Communal Detection (CD), and Spike Detection (SD). The third-layer defence proposed in this dissertation is ND. ND verifies personal names for authenticity, order, and gender. The fourth-layer defence is CD. CD finds real social relationships to reduce the suspicion score, and is tamper-resistent to synthetic social relationships. It is the whitelist-oriented approach on a fixed set of attributes. To complement and strengthen CD, the fifth-layer defence is SD. SD finds spikes in duplicates which increases the suspicion score, and reduces probing of attributes used in score calculation. It is the attribute-oriented approach on a variable-size set of attributes. Experiments were carried out on these new layers with several million real credit applications and thousands of real names. The best detection result is provided by the CD algorithm, strengthened by the SD algorithm’s attribute weights. And with the right parameter setting, this delivers xv

superior results compared to all other experiments. It gives the best result because it can detect more types of attacks, better account for changing legal behaviour, and remove the redundant attributes. Results on the data support the thesis that successful credit application fraud patterns are characterised by sudden and sharp spikes in duplicates. Although this research is specific to credit application fraud detection, it is also potentially generalisable to future work in graph theory, utility measures, and Web-based identity crime detection. More importantly, the concept of resilience, together with adaptivity and quality data discussed in the dissertation, are general to the design, implementation, and evaluation of all detection systems.

xvi

Data Mining in Resilient Identity Crime Detection

Declaration I declare that this thesis is my own work and has not been submitted in any form for another degree or diploma at any university or other institute of tertiary education. Information derived from the published and unpublished work of others has been acknowledged in the text and a list of references is given.

Chun Wei Clifton Phua April 16, 2008

xvii

Acknowledgments Over the course of my PhD candidature, I received generous funding and support from a number of organisations. • 2007: Endeavour Research Fellowship from the Australian Government’s Department of Education, Science and Training (DEST) under Identity number 0141-2007. • 2006: Student Travel Award from the Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD). • 2005 to 2007: PhD School travel and accommodation allowances from the Australian Research Council’s (ARC) Network in Enterprise Information Infrastructure (EII). • 2005 to 2007: Australian Postgraduate Award - Industry (APA-I); Veda Advantage (formerly Baycorp Advantage) research fund; Monash postgraduate travel allowance. These funds were provided by the ARC, Veda Advantage, and Monash University under Linkage grant number LP0454077. • 2004 and 2005: International Postgraduate Research Scholarship (IPRS) and Monash Graduate Scholarship (MGS); Ethics approval from Monash Standing Committee on Ethics for Research involving Humans (SCERH) under Project number 694ED-2005.

xviii

Throughout my candidature, the research and the dissertation write-up were advised by an industry and two academic supervisors. I am grateful to each of them. • Dr. Ross Gayler initiated the project, supplied real data, ideas and perspectives on behalf of Veda Advantage. • Assoc. Prof. Vincent Lee was encouraging in getting the tested ideas published and often provided important general advice. • Prof. Kate Smith-Miles gave me freedom to explore new ideas but made sure they were explained in a simple, clear, and logical manner (maybe, sometimes we have to trade a small degree of academic precision for a much larger degree of explanatory power). At various times during my candidature, the direction was influenced, aided, or supported by a number of other people. • Dr. Warwick Graco and Dr. Peter Christen gave helpful comments on many of my publication drafts. Ms. Judith Morgan proof-read this dissertation. • Prof. Sungzoon Cho and Mr. Pilsung Kang hosted me in Seoul National University’s (SNU) data mining laboratory to exchange security ideas and methods. • Prof. David Hand, Assoc. Prof. David Jensen, and Dr. Tom Fawcett enabled me to better understand my research issues on data mining-based fraud detection. Dr. Wengkeen Wong did likewise on early anomaly detection and Prof. John Galloway on visual link analysis. • Mr. Paul van Haaster and Mr. Gerhard Fries freed up dozens of precious desktops for my computationally intensive experiments. • Freely available software developers of Simmetrics, FEBRL data set generator, yEd graph editor, MATLAB time series libraries, Ironic chart plotter, PERF evaluation tool, Picnik online photo editor, and LATEX 2ε typesetter.

xix

My PhD life was enriched by my colleagues. I am most grateful to them. • Bob, Xuebing, Yeeling, and the rest of my Monash Information Technology fellow-sufferers for meaningful conversations over coffee and nice dinners to encourage and support one another (perhaps, as PhD students, we constantly feel vulnerable and angry at the vicissitudes of fate). • Junhyup, Dongkyun, Hyoungjoo, and others in SNU for teaching me about Korean culture. • Van, Khai, Flora, Hoyoung, Damian, and other attendees of EII PhD schools for the great fun we had. • Rasika, Kumari, Zoe, Guoyi, and Malini during Business Systems days for many informal gatherings on the cozy red couch, home-cooked meals, and regular badminton sessions. I also need to thank the following Singaporeans, Melburnians, and Seoulites who played such an important role in my personal life. • Mr. and Mrs. Kwansoo Yoon, and Mr. and Mrs. Waihim Syn for their prayers. • Atlantic Street and Jean Avenue housemates for the good times. • Boonwei, Alvin, Sandy, Seungyea, Namjung, Yaobin, and Han-cheng for being great friends. • Xigu, Toayee, and Jeegu for regularly keeping in touch with me. • My parents, brother Yat, and girlfriend Shups for their love, patience, and encouragement throughout these several years away from home. Clifton Phua

潘骏卫 반준위

Monash University December 2007

xx

“... Exercise caution in your business affairs; for the world is full of trickery. But let this not blind you to what virtue there is, many persons strive for high ideals; and everywhere life is full of heroism. ...” - author unknown, 1692, Desiderata

xxi

xxii

Chapter 1

Introduction “We have given away so much information that anyone anywhere can become anybody at any time.” - Frank Abagnale, 2002, Interview

This chapter begins with definitions of identity crime and application fraud, and explains why they are serious and urgent social problems. Challenges for data mining-based detection systems are discussed in the second section. The subsequent sections describe the existing credit application detection system. This is followed by the dissertation’s objectives, contributions, and outline.

1.1

Definitions of Identity Crime

In this dissertation, identity crime is defined as broadly as possible. At one extreme, synthetic identity fraud refers to the use of plausible but fictitious identities. These are effortless to create but more difficult to apply successfully. At the other extreme, real identity theft refers to illegal use of innocent peoples’ complete identity details. These can be harder to obtain (although large volumes of some identity data are widely available) but easier to successfully apply. In reality, identity crime can be committed with a mix of both synthetic and real identity details. Identity crime has become prominent because there is so much real identity data available on the Web, and confidential data accessible through unsecured mailboxes. It has also become easy for perpetrators to hide their true identities. This can happen in a myriad of insurance, credit, and telecommunications fraud, as well as 1

2

CHAPTER 1. INTRODUCTION

other more serious crimes. In addition to this, identity crime is prevalent and costly in developed countries that do not have nationally registered identity numbers. It was the fastest growing crime in the early twenty-first century (Abagnale, 2001) and its monetary cost to organisations is still in the billions of dollars, with hundreds of thousands of victims a year. Substantial identity crime can be found in private and commercial databases containing information collected about customers, employees, suppliers, and rule violators. The same situation occurs in public and government-regulated databases such as birth, death, patient and disease registries; taxpayers, residents’ address, bankruptcy, and criminals lists. To succeed in reducing identity crime, the most important textual identity attributes (see Appendix B for a non-exhaustive list) such as personal name, Social Security Number (SSN), Date-of-Birth (DoB), and address must be used. The following publications support this argument: Jonas (2006) ranks SSN as most important, followed by personal name, DoB and address. Jost (2004) assigns highest weights to permanent attributes (such as SSN and DoB), followed by stable attributes (such as last name and state), and transient (or ever changing) attributes (such as mobile phone number and email address). Malin (2005) states that DoB, gender, and postcode can uniquely identify more than eighty percent of the United States (US) population. Head (2006); Kursun et al. (2006) regards name, gender, DoB, and address as the most important attributes. The most important identity attributes differ from database to database. They are least likely to be manipulated, and are easiest to collect and investigate. They also have the least missing values, least spelling and transcription errors, and have no encrypted values.

1.1.1

Credit Application Fraud

Credit applications are Internet or paper-based forms with written requests by potential customers for credit cards, mortgage loans, and personal loans. They contain a large number of identity values. Credit application fraud is a specific case of identity crime, involving synthetic identity fraud and real identity theft.

1.1. DEFINITIONS OF IDENTITY CRIME

3

As in identity crime, credit application fraud has reached a critical mass of fraudsters who are highly experienced, organised, and sophisticated (Oscherwitz, 2005). Their visible patterns can be different to each other and constantly change. They are persistent, due to the high financial rewards, and the minimal risk and effort involved. They can use software automation to manipulate particular values within an application and increase frequency of successful values.

1.1.2

Fraudster Attack Cycle

In this dissertation, duplicates (or matches) refer to applications which share common values. There are two types of duplicates: exact (or identical) duplicates have the all same values; near (or approximate) duplicates have some same values (or characters), some similar values with slightly altered spellings, or both. This thesis argues that each successful credit application fraud pattern is represented by a sudden and sharp spike in duplicates within a short time, relative to the established baseline level. It will be shown later in the dissertation that many fraudsters operate this way with these applications and that their characteristic pattern of behaviour can be detected by the methods reported. The following fraudster attack cycle is the modus operandi of credit application fraudsters based on anecdotal observations of experienced credit application investigators. Step 1: Flood Attacks are many patterns which probe for new weaknesses in the detection system. This can happen to any organisation at any time. Step 2: Focused Attacks are patterns which focus on the new weaknesses before being stopped. Fraudsters then repeat flood attacks again. Furthermore, in some cases, the detection system can fail when fraudsters exploit many new weaknesses (such as names from a particular culture, some communal relationships or data errors, and certain attributes). Although these two attack steps are specific to credit application fraud, they are also generalisable to identity crime and possibly to other fraud domains such as casinos (Jonas, 2006), banking (ASIC, 2007), and immigration.

4

CHAPTER 1. INTRODUCTION

1.2

Challenges for Data Mining-based Detection Systems

Throughout this dissertation, data mining is defined as the search for patterns in a principled (or systematic) fashion. These patterns can be highly indicative of early symptoms in identity crime, especially synthetic identity fraud (Oscherwitz, 2005). The next subsections examine the challenges (or desired concepts) for data mining-based detection systems.

1.2.1

Resilience

The basic question asked by all detection systems is whether they can achieve resilience. Resilience is the ability to degrade gracefully when under most real attacks. To do so, it trades off a small degree of efficiency (degrades processing speed) for a much larger degree of effectiveness (improves security by detecting most real attacks). The detection system needs “defence-in-depth” with multiple, sequential, and independent layers of defence (Schneier, 2003) to cover different types of attacks. These layers are needed to reduce false negatives. In other words, any flood and focused attack has to pass every layer of defence without being detected. The two greatest challenges for the data mining-based layers of defence are adaptivity and use of quality data. These challenges need to be addressed in order to reduce false positives. Adaptivity accounts for morphing fraud behaviour, as the attempt to observe fraud changes its behaviour. But what is not obvious, but equally important, is the need to also account for legal (or legitimate) behaviour within a changing environment. One issue of changing legal behaviour is communal relationships. They are found amongst near duplicates which reflect the social relationships from tight familial bonds to casual acquaintances: family members, housemates, colleagues, neighbours, or friends (Jost, 2004). Self-relationships highlight the same applicant as a result of rational behaviour. The detection system needs to exercise caution with applications which reflect communal and self-relationships.

1.2. CHALLENGES FOR DATA MINING-BASED DETECTION SYSTEMS

5

Another specific case is external events (such as entry of new organisations and exit of existing ones, and marketing campaigns of organisations). These external events’ legal behaviour can be hard to distinguish from fraud behaviour. Also, these external events are likely to cause three natural changes in attribute weights, where attribute weights refer to the degree of importance in attributes. These changes are volume drift, which has fluctuating overall volume; population drift, which has volume of both fraud and legal classes fluctuating independently of each other; and concept drift which has changing legal characteristics that can become similar to fraud characteristics. The detection system also needs to make allowance for certain external events. Quality Data is highly desirable for data mining and data quality can be improved through the real-time removal of data errors (or noise). The detection system needs to filter duplicates which have been re-entered due to human error or for other reasons. It also needs to remove redundant attributes which have many missing values, and other issues.

1.2.2

Other Challenges

This dissertation also needs to address other major challenges which hinder detection systems: Responsibility refers to adherence to privacy laws, non-disclosure agreements, and ethical codes of conduct. The designer of the detection system has to ensure explicit consent has been given by the applicant to use their identity data for fraud detection, and all identity data is protected from unauthorised disclosure or intelligible interception. Scalability is the ability to handle data streams (Kleinberg, 2005) (large, uninterrupted, and fast flow of applications)into the database, and the database’s longterm growth. The detection system needs to scale up to the comparison of every current application, in real-time with a moving window, against other previous applications with a string similarity algorithm. In doing so, the detection system must find the balance between effectiveness and efficiency. Effectiveness is the ability to achieve objectives, and efficiency is the ability to conserve time and effort.

6

CHAPTER 1. INTRODUCTION

Implicitness means no explicit link information between applications, and no unique identity attribute. The detection system needs to create implicit links with common identity attributes which capture suspicious links, communal relationships, and data errors. Sparsity is the inverse of density. It refers to sparse attributes, each with an enormous number of possible values where many of the values are unique. This cripples most clustering, association rules, emerging patterns, and correlation analysis algorithms. The detection system has to use appropriate attributes such that when values in an application become denser, the application becomes suspicious. Imbalanced Class refers to known frauds being many times fewer than unknowns. Known frauds, which can be synthetic identity fraud or real identity theft, are applications reported by organisations as fraudulent. Unknowns are mostly legal applications. However, unknowns can be fraudulent applications that have not yet been revealed, inadvertently overlooked, or intentionally not provided. Collectively, known frauds and unknowns are termed class labels. The detection system needs to achieve high true positives with relatively low false positives. Multiple Sources refers to many contributing organisations in different industries. They pay a fee for the application fraud detection service to the credit bureau. The centralised, cross-industry approach to application fraud has shared benefits as more than one organisation is usually attacked with the same patterns. The detection system needs to process a combined data stream from multiple organisations. Evaluation requires measuring performance under implicitness and sparsity, which will result in few links and low scores. The detection system needs to use a suitable multiple-valued measure which allows evaluation of results under imbalanced class and different thresholds. Beyond the scope of this dissertation is the concept of utility which accounts for costs, benefits, and constraints of detection (Hand et al., 2007). User-friendliness allows better understanding and use of scores for users (or investigators). For the purposes of this dissertation, it is not regarded as a challenge here, since it is hard to objectively evaluate the user-interface and its visualisation results. Also, it depends on each contributing organisation’s severity of fraud, and

1.3. EXISTING DETECTION SYSTEM

7

timeliness, and availability of their known frauds (Jensen, 1997). If implications of fraud are not severe, and known frauds are timely and available, a fully automated system with minimal user intervention is better. If it is severe, an interactive system which allows users to annotate, add attributes, or change attribute weights is more suitable. The detection system had two forms of visualisation tools which helped in the better understanding of known frauds. They are simple directed graphs of linked applications which can be annotated and drilled down into values, and time series charts of values.

1.3

Existing Detection System

There are non-data mining layers of defence to protect against credit application fraud, each with its unique strengths and weaknesses. The first-layer defence is made up of business rules and scorecards. In Australia, one business rule is the hundred-point physical identity check test which requires the applicant to provide sufficient point-weighted identity documents faceto-face. They must add up to at least one hundred points. Another business rule is to contact (or investigate) the applicant over the telephone or Internet. The above two business rules are highly effective, but human resource intensive. To rely less on human resources, a common business rule is to match an application’s identity number, address, or phone number against external databases. This is convenient, but in Australia, the public telephone and address directories, semi-public voters’ register, and credit history data can have data quality issues of accuracy, completeness, and timeliness. In addition, scorecards for credit scoring can catch a small percentage of fraud which does not look creditworthy; but it also removes outlier applications which have a higher probability of being fraudulent. The second-layer defence is known fraud matching. Known frauds are recorded in a periodically updated blacklist.

Subsequently, the current applications are

matched against the blacklist. This has the benefit and clarity of hindsight because patterns often repeat themselves. However, there are two main problems in using known frauds. First, they are untimely due to long time delays, in days or

8

CHAPTER 1. INTRODUCTION

months, for fraud to reveal itself, and be reported and recorded. This provides a window of opportunity for fraudsters. Second, recording of frauds is highly manual. This means known frauds can be incorrect (Hand, 2006), expensive, difficult to obtain (Neville et al., 2005; Brockett et al., 2002), and have the potential of breaching privacy. This dissertation argues against the use of classification (or supervised) algorithms which use class labels for third-layer defence. In addition to the problems of using known frauds, these algorithms, such as logistic regression, neural networks, or Support Vector Machines (SVM), cannot achieve scalability or handle the extreme imbalanced class (Hand, 2006) in credit application data streams. To explain it further, as fraud and legal behaviour changes frequently, the classification algorithms are trained on the new data. But the training time is too long for a real-time detection system because the new training data has too many derived numerical attributes (converted from the original, sparse string attributes) and too few known frauds.

1.4

Objectives

The main objective of this research is to achieve resilience by adding three new, real-time, data mining-based layers to complement the two existing ad-hoc layers discussed in the previous section1 . These new layers will improve detection of fraudulent applications because the detection system can detect more types of attacks, better account for changing legal behaviour, and remove the redundant attributes. These new layers are not human resource intensive. They represent patterns in a score where the higher the score for an application, the higher the suspicion of fraud (or anomaly). In this way, only the highest scores require human intervention. Two out of the three new layers, communal and spike detection, do not use external databases, but only the credit application database per se. And crucially, these two layers are unsupervised algorithms which are not completely dependent on known frauds but use them only for evaluation. 1

This real-time data mining tool is Hesperus, named after the evening star in Greek mythology.

1.5. CONTRIBUTIONS

1.5

9

Contributions

The main contribution of this dissertation is the demonstration of resilience, with adaptivity and quality data in real-time data mining-based detection algorithms2 . Table 1.1 summarises their specific contributions to credit application fraud detection in Chapters 4, 5, and 6. The third-layer defence is Name Detection (ND): the verification of personal names for authenticity, order, gender, and culture. The fourth-layer defence is Communal Detection (CD): the whitelist-oriented approach on a fixed set of attributes. To complement and strengthen CD, the fifth-layer defence is Spike Detection (SD): the attribute-oriented approach on a variable-size set of attributes. The second contribution is the coverage of the other challenges in Chapters 3, 5, and 6. Table 1.1 includes responsibilities for imbalanced application data with multiple organisations, and the evaluation measure for results in Chapter 3. The third contribution is the survey of data mining-based detection algorithms in Chapter 2. It shows that this dissertation significantly extends knowledge in credit application fraud detection because publications in this area are rare. In addition, it uses the key ideas from other related domains to design the credit application fraud detection algorithms. Finally, the last contribution is the recommendation of credit application fraud detection as one of the solutions to identity crime. Being at the first stage of the credit life cycle, credit application fraud detection also prevents some credit transactional fraud.

1.6

Outline

Figure 1.1 gives an overview of the dissertation. Chapter 2 presents an overview of fraud, adversarial, and identity crime-related detection. Chapter 3 considers the legal and ethical responsibility of handling application and name data, and describes the data and the evaluation measure. Name data is used in Chapter 4 to compare 2

All experiments were performed on Pentium IV 3.0GHz, 1 or 2Gb RAM workstations, running on an Windows XP platform. Name, communal, and spike detection algorithms, as well as evaluation measures, were coded in Visual Basic .NET. The name and application data were stored in a Microsoft Access database.

10

CHAPTER 1. INTRODUCTION Chapters Contributions

Resilience Adaptivity Quality data Responsibility Scalability Implicitness Sparsity Imbalanced class Multiple sources Evaluation

3

4 • •

5 • • •

6 • • •

• •





• • • •



Table 1.1: Contributions to credit application fraud detection

string similarity and phonetic algorithms. This is to propose the ND algorithm. Application data is used in Chapters 5 and 6. The CD and SD algorithms are presented in Chapters 5 and 6, respectively. Chapter 7 concludes the dissertation with contributions of each chapter followed by future work, and closing remarks.

1.6. OUTLINE

11

First-layer and Second-layer Defence

Chapter 1: Introduction

Chapter 2: Data Mining-based Detection

Chapter 3: Data and Measures

Third-layer Defence

Chapter 4: Name Detection

Fourth-layer Defence

Chapter 5: Communal Detection

Fifth-layer Defence

Chapter 6: Spike Detection

Chapter 7: Conclusion

Figure 1.1: Resilient credit application fraud detection system outline

Data Mining in Resilient Identity Crime Detection ...

2.2.2 Key Ideas . .... 7.2.3 Web-based Identity Crime Detection . . . . . . . . . . . . . . 78 .... application fraud: business rules and scorecards, and known fraud matching. The ... University's (SNU) data mining laboratory to exchange security ideas and.

275KB Sizes 1 Downloads 240 Views

Recommend Documents

Resilient Identity Crime Detection
All experiments were performed on a dedicated 2 Xeon. Quad Core (8 2.0GHz CPUs) and 12 Gb RAM server, running on Windows Server 2008 platform.

Adaptive Spike Detection for Resilient Data Stream Mining
Email: [email protected], ... Email: [email protected]. “A system's .... spike detection will be probed by persistent ad-.

Multilayered Identity Crime Detection System
Keywords--- security, data mining based fraud detection, data stream mining, anomaly .... are similar in concept to credit transactional fraud detection in banking ... justifications and anatomy of the CD algorithm, followed by the SD algorithm.

Investigative Data Mining in Fraud Detection
Nov 3, 2003 - by another person, except where due reference is made in the text. ... Supercomputing Applications (NCSA) and the Angoss Software Corporation for providing ...... However this can be easily tailored to meet the objective of detecting an

Adaptive Spike Detection for Resilient Data Stream ...
Keywords: adaptive spike detection, resilient data mining .... only when there are rapid and large increases ..... mining solution', Encyclopedia of Data Warehous-.

Investigative Data Mining in Fraud Detection
Clusters high dimensional elements into more simple, low dimensional maps. • Automatically groups similar .... Experiment V achieved highest cost savings than II and IV. - C4.5 algorithm is the .... Take time into account. • Plan monitoring and .

Adaptive Spike Detection for Resilient Data Stream ...
Keywords: adaptive spike detection, resilient data mining ... proach (similar to time series analysis) by work- ... requires efficient processing and rapid decision.

A Survey on Brain Tumour Detection Using Data Mining Algorithm
Abstract — MRI image segmentation is one of the fundamental issues of digital image, in this paper, we shall discuss various techniques for brain tumor detection and shall elaborate and compare all of them. There will be some mathematical morpholog

A Comprehensive Survey of Data Mining-based Fraud Detection - arXiv
knowledge, which proposes alternative data and solutions from related domains. Keywords. Data mining applications, automated fraud detection, adversarial detection. 1. .... telecommunications products/services using non-existent identity .... collect

Efficient Data Mining Algorithms for Intrusion Detection
detection is a data analysis process and can be studied as a problem of classifying data ..... new attacks embedded in a large amount of normal background traffic. ...... Staniford et al propose an advanced method of information decay that is a.

data mining tools for malware detection pdf
data mining tools for malware detection pdf. data mining tools for malware detection pdf. Open. Extract. Open with. Sign In. Main menu. Displaying data mining ...

data mining tools for malware detection pdf
data mining tools for malware detection pdf. data mining tools for malware detection pdf. Open. Extract. Open with. Sign In. Main menu.

ILLICIT FINANCIAL FLOWS AND TAX CRIME IN MINING SECTOR 1 ...
of illicit financial flows is considerable, two times bigger ... Based on revenue data from .... of open data and big data to prevent loopholes in tax revenue. 3.

Practical Leakage-Resilient Identity-Based Encryption ...
leakage is allowed but only from parts of memory that are accessed. ... the cold-boot memory attacks of [15], is the ability of the attacker ...... If ν = 1 then Tb ν = 1.

data mining in bioinformatics pdf
data mining in bioinformatics pdf. data mining in bioinformatics pdf. Open. Extract. Open with. Sign In. Main menu. Displaying data mining in bioinformatics pdf.

DATA MINING IN BIOINFORMATICS.pdf
10. b) Explain Non-linear latent variable models for independent component analysis. 10. ———————. Page 1. DATA MINING IN BIOINFORMATICS.pdf.

Data Mining Applications in Healthcare
For example, data mining can help healthcare insurers detect fraud and abuse, healthcare .... towards understanding a data set, especially a large one, and detecting ..... project management, inadequate data mining expertise, and more.

Data Mining Applications in Healthcare
Data mining is not new—it has been used intensively .... methodology for data mining: business understanding, ... computer science, including artificial intelligence and machine .... suppose that as part of its healthcare management program,.

Data Mining Approach, Data Mining Cycle
The University of Finance and Administration, Prague, Czech Republic. Reviewers: Mgr. Veronika ... Curriculum Studies Research Group and who has been in the course of many years the intellectual co-promoter of ..... Implemented Curriculum-1 of Radiol

Mining Software Engineering Data
Apr 9, 1993 - To Change. Consult. Guru for. Advice. New Req., Bug Fix. “How does a change in one source code entity propagate to other entities?” No More.