Detecting Telecommunications Fraud based on ...

Viewer
Transcript

Detecting Telecommunications Fraud based on Signature Clustering Analysis Pedro Gabriel Ferreira, Ronnie Alves, Orlando Belo, Joel Ribeiro University of Minho, Department of Informatics, Campus of Gualtar, 4710-057 Braga, Portugal {pedrogabriel,ronnie,obelo}@di.uminho.pt

Abstract. In the telecommunications services, fraud situations have a significant business impact. Due to the massive amounts of data handled, fraud detection stands as a very difficult and challenging task. In this paper, we propose the application of dynamic clustering over signatures to support this task. Traditional static clustering is applied to determine clusters characteristics, and dynamic clustering analysis is provided to identify changes on cluster membership over time. This approach eliminates the bias caused by special situations like market campaigns or holidays. In order to overcome scalability issues with respect to the huge volume of data involved, a partition-clustering approach is also proposed. Experimental evaluation demonstrates the scalability of the method and its ability to detect previous fraud cases as well as new potential fraud situations.

1

Introduction

Telecommunications companies generate and store massive amounts of data. Such data may provide valuable knowledge about customer behavior. In particular the detection of anomalous changes in customer’s behavior is of critical importance. These changes can be either indicative of a churning situation, where the customer is inclined to change to another service provider, or of a fraud situation, where a customer makes an inappropriate use of the telecommunications company (or of another customer resources) for their own profit. Essentially, two main types of fraud can be distinguished [13]: subscription and superimposition fraud. In the former, the fraudsters (using fake identifications) especially create a new account without having the intention to pay for the services used. Typically, these cases reveal an intensive usage right from the beginning. In the latter, fraudsters make an illegitimate use of a legitimate account using diverse techniques, which means, for instance, that some abnormal usage is blurred into the characteristic usage of the account. This type of fraud is usually more difficult to detect and poses a bigger challenge. To evaluate potential fraud situations we can make use of several data mining techniques. Clustering appears as a promising approach since it can be applied to find groups of customers with similar calling behaviors. In recent years a considerable effort has been made in the proposal of scalable clustering methodologies

2

Pedro Gabriel Ferreira, Ronnie Alves, Orlando Belo, Joel Ribeiro

[9, 15, 2], which can be of particular interest in the telecommunications context. However, traditional clustering has the important limitation of only providing a static perception of the structure of the data. This becomes a major drawback in this kind of application where fraud is detected through temporal changes in customer behavior. Therefore, an analysis of cluster membership through time is an essential requirement. Customer changes between clusters with different characteristics can be indicative of anomalous situations and provide evidences of a fraud situation. In order to capture a customer behavior characteristics the concept of signature can be applied. A signature corresponds to a set of information that captures the typical behavior of a customer. For example, it may include: average number of calls, duration of the calls or the area where the calls are made, just to name a few. Such concept has already been applied successfully for anomalous detection in other areas like credit card usage [11], network intrusion [12, 11] and, in particular, in telecommunications fraud [5, 14, 4, 6, 3]. In fraud detection context, if in a given moment, a customer deviates from its typical behavior expressed by its signature, which can be a motive to trigger an alarm for further analysis of that customer. In this paper we tackle the problem of anomaly detection in customer’s behavior for mobile telecommunications environments. We propose the application of dynamic clustering analysis over signatures. Traditional static clustering is applied to determine the characteristics of the clusters, and dynamic clustering analysis is provided to determine the changes on clusters membership over time. Our main motivation is that these changes will provide evidences to fraud analysts for establishing potential fraud situations.

2

Signatures

A signature of a customer corresponds to a vector of feature variables whose values are determined during a certain period of time. This information is obtained directly from the coded fields of a set of Call Detail Records (CDRs), which describe the details of the calls. The variables can be simple, if they consist of a unique atomic value (e.g. integer or real) or complex, if they consist in two codependent statistical values, the average and the standard deviation of a given feature. A signature S is then obtained from a function ϕ, that consists of a set of functions, for a given time unit w, where S = ϕ(w). A time unit corresponds to the amount of time in which the CDRs are accumulated and that in the end of this period are processed in order to obtain the signature. Typically a time unit of one day is considered. The choice of the variables present on the signature depends on several factors, like the complexity of the feature described, the data available or the required computation to perform such calculation. A feature like the duration of the calls shows a significant variability and therefore is better expressed through an average/standard-deviation variable. A feature like the number of international calls is typically much less frequent and thus the respective frequency value is sufficient to describe it. The selection of the variables was already dis-

Detecting Telecommunications Fraud based on Signature Clustering Analysis

3

cussed on previous works [8]. Signatures provide data-reduction making feasible the application of clustering in the context of fraud detection, which would be either impracticable and meaningless to apply directly to CDR data. Signatures include the following feature variables [3, 8]; Complex: Duration of Calls, Number of Calls (NOC) - Working Days, NOC - Weekends and Holidays, NOC Working Time (8h-20h), NOC - Night Time (20h - 8h). Simple: NOC - To the different National Networks, NOC - As caller (Origin), NOC - As called (Destination), NOC - As Caller in Roaming and NOC - As Called in Roaming.

3

Clustering Signatures

Signatures are composed of simple and complex variables. This makes traditional similarity measures, such as Euclidean distance, Pearson correlation, Jaccard measure not suitable for signature comparison, since they cannot deal with the co-dependent nature of the elements of complex variables. Thus, a new similarity measure which allows to determine similarities among signatures needs to be devised. This new measure makes possible the determination of a similarity space between the signatures, in the form of a similarity matrix. This matrix can then be processed by a clustering algorithm in order to provide an overall clustering solution. Since feature variables have different types, the signature similarity function results from the combination of similarity functions for each of the different variables. Therefore, comparing two signatures corresponds to the pairwise comparison of the respective feature variables. 3.1

Similarity for Simple Feature Variables

A simple feature variable is defined by a unique variable, which corresponds to the average or the frequency value of the considered feature. For simple variable comparison we will make use of a ratio-scaled function. This type of function makes a positive measurement on a non-linear scale, which in this case will be the exponential scale. The idea behind the use of this function is that larger difference have significantly more impact then smaller ones. The used function is defined in the range [0, 1] and is defined according to Equation 1. d(Ix , Iy ) = e−{

|Ix −Iy | Amp }

(1)

In equation 1, Ix and Iy are the two simple variables under comparison and Amp the amplitude (difference between the maximum and the minimum value) of the respective feature in all the signatures space. 3.2

Similarity for Complex Feature Variables

Complex feature variables are defined by two co-dependent variables. These variables correspond respectively to the average and the standard deviation of the considered feature. For two complex variables, Cx = (Ix , σx ) and Cy = (Iy , σy ), the similarity function is defined in Equation 2, and is within the range [0, 1].

4

Pedro Gabriel Ferreira, Ronnie Alves, Orlando Belo, Joel Ribeiro

d(Cx , Cy ) = d(Ix , Iy ) ×

|Cx ∩ Cy | |Cx ∪ Cy |

(2)

Equation 2 is the combination of two formulas: the similarity function for |C ∩C | simple variables (Eq. 1) and the ratio |Cxx ∪Cyy | . This ratio provides the overlap degree between the range X = [Ix − σx , Ix + σx ] and Y = [Iy − σy , Iy + σy ], of the two complex variables Cx and Cy ; it takes values between 0 and 1. 3.3

Similarity for Signatures

The similarity between two signatures is then obtained through the combination of the previous similarity measures for the different variables of the signature. For two signatures A and B, where Ai and Bi are respectively the feature variable i of A and B, and for n possible variables, the similarity measure is defined in equation 3. p D(A, B) = W1 · d1 (A1 , B1 )2 + . . . + Wn · dn (An , Bn )2 (3) Pn Again, D(A, B) ∈ [0, 1]. Wi defines the weight of the feature and i=1 Wi = 1. The weighting factor is defined by fraud analysts, although it could be induced by using the knowledge base (success rate) of previous fraud situations detected by the proposed method. As the signature similarity measure is now defined, an all-against-all comparison of the signatures can be made. This will yield a N × N matrix, that summarizes the similarities among the N signatures. This similarity matrix will then be used as an input for a traditional clustering algorithm.

4

Cluster Migration Analysis

Clustering is the process of determining groups of similar objects. Whether the nature of these objects is static or dynamic, clustering will only provide a snapshot of the objects arrangement for a specific time moment. Applying clustering methods to either dynamic or temporal data at different time moments will most certainly provide different results. This is especially true in the telecommunications scenario, where the constant number of events, like marketing campaigns, new price plans, special days (Christmas, Easter, . . . ), will influence customer’s behaviors. Additionally, to these already expected changes, situations like fraud or churn cases will also result in changes of cluster membership. The analysis of changes in the clusters topology over a period of time will provide support for a better understanding of the usage patterns. In particular, the detection of abrupt changes in cluster membership, denominated as anomaly detection, may provide strong evidences of a fraud situation. In this section, we describe our method for anomaly detection. First, we describe how signatures are assigned to clusters based on a provided cluster topology. Next, we show how to detect changes on clusters membership, and which of these changes can be considered as anomalous situations.

Detecting Telecommunications Fraud based on Signature Clustering Analysis

4.1

5

Signature Assignment

According to the moment of the week, different usage patterns can be found [8, 3]. For example, some customers show a high usage profile during week days, with a larger number of originated calls and medium average call duration. Other customers show a low usage profile during work hours and an intensive usage during night hours, with few long duration calls. The characteristics of these usage profiles can be provided by the telecommunications analysts or can be obtained by automatic inspection of the data. The latter is our particular case, where usage profiles are obtained by means of signature clustering analysis, according to the method described in section 3. Therefore, for each day of the week a cluster topology is provided through the signatures corresponding to that particular day. This topology describes customers’ usage patterns during that period. In order to make possible the automatic and the manual definition of the clusters, each cluster is described by the characteristics of its centroid. A centroid is defined as a signature. This allows us to make a direct comparison between signatures and clusters centroids. The comparison is made according to the similarity measure defined by Equation 3. Algorithm 1 resumes the signature assignment to the clusters. The list of signatures and the list of cluster centroids are previously calculated. Each signature is compared to all centroids (lines 1 to 3) and it is assigned to the cluster C to which has the smallest distance (line 7). input : SignList(List of Signatures); CentroidLst (List of Centroids) /* Compare each Signature against Centroids */ foreach Sign in SignList do 2 foreach Centroid in CentroidLst do 3 if minDist > D(Sign, Centroid) then 4 minDist = D(Sign, Centroid); minCluster = Centroid; 5 end 6 end 7 C[Sign] = minCluster; 8 end Algorithm 1: Procedure for the assignment of the signatures to the clusters. 1

4.2

Absolute and Relative Similarity

In order to compare signatures against clusters centroids, two types of similarity measures can be defined: absolute and relative similarity. Absolute similarity defines the similarity value between the signature and the centroid in a given time moment t. This value is given by Equation 3, where the maximum value corresponds to a signature exactly equal to the centroid. This type of similarity was already used in lines 3 and 4 of Algorithm 1 to determine the assignment of the signatures to the clusters. Relative similarity assesses the similarity difference between instant t and t + 1. It provides the percentage of the signature variation between two consecutive

6

Pedro Gabriel Ferreira, Ronnie Alves, Orlando Belo, Joel Ribeiro

time instants. This value is defined according to Equation 4, Si t corresponds to the signature of customer i in the instant t and C[Si t ] the cluster in that in t, Si belongs to. 4 = {1 −

D({Si }t+1 , C[{Si }t ]) } × 100% D({Si }t , C[{Si }t ])

Cluster ID = 1

- Instant t

(4)

Cluster ID = 1

- Instant t

- Instant t+1

- Instant t+1

Feature Y

Feature Y

= -60%

0.5 = 28.5%

Cluster ID = 2 0.7

Cluster ID = 0

0.5 0.8

Cluster ID = 2

Cluster ID = 0

Feature X

Feature X

Fig. 1. (a) Positive variation on the relative similarity of the signature; (b) Negative variation on the relative similarity of the signature; examples are provided for a 2feature space;

Figure 1(a) depicts a situation where the signature Si is closer to the cluster centroid 0 in the instant t than in t + 1. This will represent a positive variation. On the other hand, Figure 1(b) represents a negative variation, where Si gets closer to the cluster centroid in the instant t + 1. A negative value of the relative similarity in the instant t + 1, indicates that the signature Si is now close to the centroid of the cluster that it fits in the instant t. Nevertheless, we can assist to a cluster membership change, since now Si can be closer to another cluster. This situation is illustrated in Figure 2. Cluster ID = 1

- Instant t - Instant t+1

= -50%

Feature Y

0.9

0.5

0.75

Cluster ID = 2

Cluster ID = 0

Feature X

Fig. 2. Negative variation on the relative similarity of the signature and cluster membership change; example is provided for a 2-feature space.

Detecting Telecommunications Fraud based on Signature Clustering Analysis

7

Definition 1. (Cluster Membership Change) A signature S changes its cluster membership in the instant t + 1 to cluster Cj ({Cj }t+1 ), if it belongs to cluster Ci in the instant t, and at instant t+1 the distance D({S}t+1 , {Cj }t+1 ) is minimal with respect to all the clusters and D({S}t+1 , {Cj }t+1 ) < D({S}t , {Ci }t ). All the data relative to the cluster membership of the signatures are kept for posterior analysis. These data, which we call historical data will make possible the analysis of the evolution of customers behavior through time. 4.3

Analysis Report

In order to provide to the analyst a tool for a better examination of clusters memberships changes, during a pre-defined time period, analysis reports can be calculated. An analysis report is a tool based on the calculation of the variation of a set of signatures during a given time period, typically between two consecutive time instants, under some pre-defined conditions. The variation is assessed through the relative similarity measure (Equation 4). Examples of conditions that can be supported by these reports are: the set of customer signatures, the set of clusters, upper and lower bounds for the variation in each feature variable. Each analysis report includes the identification of all the conditions used, the average and standard deviation of the signatures variations and the maximum, minimum and average values for all the signature features. These values provide the basis to detect anomalous situations. The signatures which show some deviating behavior are included in a list called “blacklist”, for posterior analysis of the fraud analyst [8, 3]. 4.4

Detecting Anomalies

After an analysis report is obtained, signatures are scanned for detecting anomalous situations. The anomaly criteria is defined as follows: Definition 2. (Anomaly Criteria) A signature S is considered to be in anomalous situation for a given instant t when its variation, ∆, is outside the range [M − 2σ, M + 2σ]. M and σ are respectively the average and the standard deviation for all the signatures, calculated through the respective analysis. Signatures that fulfill the criteria defined in Definition 2 are considered suspect and inserted in the blacklist. This list provides to the analysts a set of customers who have shown a deviating or anomalous behavior, since the respective signature variation was considerably greater than the variation of the remaining customers. This approach allows to detect not only deviating behaviors in normal situations, but also eliminates the bias caused by special situations like new market campaigns, holidays, special events and so on. This is so, because the comparison baseline is provided by the variation of all the customers in that given instant.

8

5

Pedro Gabriel Ferreira, Ronnie Alves, Orlando Belo, Joel Ribeiro

Scalability Issues

Mobile telecommunications companies are known for producing a large volume of data. This is so because they handle calls from hundreds of thousands to millions of customers every day. Each customer is described by its respective signature. If the similarity among all the customers is evaluated, then for N signatures a N × N similarity matrix is needed. This sounds unfeasible. For instance, for a million (106 ) of customers, a matrix with one billion (1012 ) of values would be required. If each value contains 3 decimal places (6 bytes/value), it will be necessary 5.5 terabytes to store this matrix. Additionally to the size of the similarity matrix, it is also necessary to take into account that each signature can be made up of an arbitrary number of feature variables. This of course, has great impact not only in terms of memory space but also in terms of processing time. From our preliminary evaluation, with half million of customers, it was verified that combining the methodology described in section 2, with typical clustering algorithms is not viable, due to the large size of the similarity matrix and the high processing time. In order to overcome these scalability issues we propose a variation of the clustering algorithm. In this proposed methodology the data is partitioned, processed independently and the resulting cluster information merged through traditional clustering. In the context of our work we call this scheme as Partitional Clustering and it follows the general ideas of the approached proposed in [9]. 5.1

Partitional Clustering of Signatures

The basic idea behind partitional clustering is to divide the original block of data D, into a set of partitions Di0 , mutually exclusive, in order to make the processing of each partition feasible. The size of each partition should be chosen according to the characteristics of the processing computer. It should not be too small, in order to avoid unnecessary I/O operations, and not too large due to limitations of the physical memory of the system for obtaining processing in useful time. This is a two step process. In the first step, the original data set D is partitioned into blocks of adequate size and each block processed independently. After all partitions have been processed, the second step consists of merging all the clustering information that resulted from the individual partition processing. The parameters that describe the clusters topology obtained for each block partition are gathered in a unique set Df0 . These parameter vectors are now considered the data objects. The data set Df0 is then processed and the final K clusters obtained. This approach makes possible the processing of a considerable number of signatures (typically more than one million), without exceeding the processing limits of the machine and in useful time. 5.2

Performance Study

In order to evaluate scalability issues, different experiments have been conducted on signature datasets of different sizes. All the experiments were conducted on

Detecting Telecommunications Fraud based on Signature Clustering Analysis

9

a machine with a 1.9GHz processor and 512MB (DDR) of main memory. These processing characteristics are quite below to what one would expect to employ for this type of project. Nevertheless, it provides a lower bound for the performance of our methods. Figure 3(a) presents the running time for the processing of the similarity matrix and clustering according to a different number of signatures. In this case the data is treated as a unique partition. As we can see, even for a small number of signatures, the processing time does not scale linearly with the number of signatures. Figure 3(b) shows the processing time, matrix construction plus clustering, with a different number of signatures per partition (matrix). This evaluation is repeated for a different number of signatures on the initial dataset D. These figures reveal a tendency in which lower size partitions have a lower processing time. Nevertheless we should remember that a lower partition size implies more I/O operations. The choice of the partition size should be a tradeoff between these two components. From our performance studies, the selected partition size was 1500. Figure 4(a) shows the processing time up to a million of signatures, with a partition size of 1500. We believe that the use of powerful computational resources, in particular more main memory, will result in significant runtime reductions. Nevertheless, we are also interested in investigating the impact of other clustering methodologies for very large databases, like for instance [15, 2].

Fig. 3. (a) Processing time w.r.t the total number of signatures; (b) Processing time w.r.t. the total number of signatures, with different number of signatures per partition.

5.3

Clustering Algorithm

The CLUTO [10] clustering toolkit was used to compute the clustering solutions. CLUTO provides access to its various clustering and analysis algorithms via the vcluster and scluster stand-alone programs. The scluster program that operates on the similarity space between the objects is used to compute the clustering solution.

6

Experimental Evaluation

To assess the quality of the proposed scheme in detecting anomalous behaviors, an evaluation study with real and synthetic data was performed. First, real data provided by the telecommunications company was tested. Since the number of

10

Pedro Gabriel Ferreira, Ronnie Alves, Orlando Belo, Joel Ribeiro

Fig. 4. (a) Processing time w.r.t. the total number of signatures with a partition size of 1500. (b) Cluster topology for the first day of the week (Monday).

clearly identified fraud cases present in this dataset is very limited, data with planted alarms was generated so a precision/recall analysis could be made. The first testbed was made on data correspondent to a week of voice calls from a Portuguese mobile telecommunications network. The complete set of CDRs corresponds to approximately 2.5 millions of records, with 700 thousand signatures processed per day. Figure 4(b) shows the cluster topology obtained for the signatures of the first day (Monday) of the week. The quality of the clusters is maximized with 8 clusters. We were provided by the fraud analyst with a small list of 12 customers (fraudsters in the referenced week), since no noteworthy database of previously known fraud cases existed. From these 12 customers 11 were detected and added to the blacklist of the system. Additionally, customers outside this list were also detected. Table 1(a) shows a distribution of the number of alarms raised for three different days of this week. Note that cluster 8 has a large number of alarms, which is proportional to the average number of calls made by the customers in this cluster (see Figure 4(b) and Table 1(b)).

Cluster Tue Wed Sat 1 2 3 4 5 6 7 8

3 9 3 5 23 20 8 52

9 7 12 17 21 31 11 72

1 23 17 16 22 40 26 0

Cluster Avg. number Calls 8 6 5 7 1 4 2 3

11.4 6.7 4.5 3.2 2.7 2.2 1.6 1.2

Table 1. (a) Alarms raised per cluster for three particular days of the week; (b) Clusters sorted according to the average number of calls. For better understanding the impact of each feature variable in the detection of an anomalous situation, we have evaluated the top-10 signatures with the highest variation (see Equation 4). Figure 5(a) shows the impact of each feature variable in the final solution. As can be noticed the variable number of “inter-

Detecting Telecommunications Fraud based on Signature Clustering Analysis

11

national calls” and “duration of the calls” have a major impact in the detection of deviating behaviors. This is also a typical observation from previous studies regarding fraud detection in telecommunications [8, 3]. Figures 5(b) and 5(c) shows examples of detected anomalous behaviors. Each detected anomaly here represented, is composed by the customer identification (mobile phone number masqueraded) in the first line. The following lines contain the day when the analysis was made, the cluster that the signature belongs, a true (T) or false (F) flag indicating if there was a cluster membership change or not, the absolute and the relative similarity. Detected anomalies are marked with the symbol (A), which represents an alarm and consecutive addition to the blacklist. In Figure 5(b), the first and the second customers have a change in the cluster membership, since they pass from clusters (2 and 1) with a lower average number of calls to the cluster with the highest number of calls (8), in days 4 and 3 respectively. The third customer in this example, although always in the same cluster, has registered a significant variation between days between days 5 and 6. The significant variation verified for these customers has resulted in the triggering several alarms and subsequent addition to the blacklist.

Fig. 5. (a) Impact of each feature variable for the Top-10 high alarms; (b) Examples of customers with an anomalous increase in the number of calls; (c) Examples of customers with an anomalous decrease in the number of calls.

The three cases represented by Figure 5(c), correspond to an opposite situation from the one represented in the last example. Here, all the three customers go from a high usage cluster, to a cluster with a lower average number of calls. These changes in the cluster membership and the variation that is outside the bounds, according to Definition 2, resulted in the detection of an anomalous situation. Typically, a telecommunications analyst would consider these last three cases (Figure 5(c)) as potential churn cases and the first three cases (Figure 5(b)) as potential fraud cases. Assessment of the credibility of the system was performed in the second part of the evaluation. Synthetic signature data is generated and abnormal usage is planted in normal data. It is considered that a signature has a deviating behavior

12

Pedro Gabriel Ferreira, Ronnie Alves, Orlando Belo, Joel Ribeiro

if it deviates from one day to the next more than 50%. We admit that this value is a lower bound for abnormal usage. According to this criterion data was generated to simulate a week of calls for 10000 signatures. From these 9450 signatures have a normal behavior during the entire week and 550 have a deviation in one day of the week. The distribution along the week days of the deviating moments is presented in Table 2(a). A topology of 7 clusters, one per each day of the week, was obtained. According to Definition 2, anomalous situations were detected and counted. We consider the four cases: Tp - the number of anomalous signatures detected as anomalous, Tn - the number of normal signatures detected as normal, Fp - the number of normal signatures detected as anomalous and Fn - the number of anomalous signatures detected as normal. Credibility was then evaluated by the Recall, Precision, Specificity and Accuracy formulas. Results are show in Table 2 (b). From the presented results we can see that the system performs well on all aspects except on recall. This is due to the fact that we have assumed that a deviating behavior occurs when a signature varies 50% of its value. In practice one would expect a much higher value. Therefore, higher variations are easy to detect and will represent a higher rate of true positives. On the other hand the rate of false discoveries is kept at lower values. Day Number 2 3 4 5 6 7 Total

107 91 81 88 86 97 550

%

Day

19.5 16.5 14.5 16.0 15.6 17.6 100.0

2 3 4 5 6 7 Total

T p Fp

Tn

80 1 9892 63 5 9904 52 7 9912 55 3 9909 55 4 9910 69 7 9896 374 27 59423

Fn

Recall Precision Specificity Accuracy

27 28 29 33 31 28 176

0.748 0.692 0.642 0.625 0.640 0.711 0.680

0.988 0.926 0.881 0.948 0.932 0.908 0.933

1.000 0.999 0.999 1.000 1.000 0.999 1.000

0.997 0.997 0.996 0.996 0.997 0.997 0.997

Table 2. (a) Distributions of the planted alarms during the days of the week; (b) Synthetic data evaluation values.

7

Conclusions

In this work we have applied dynamic clustering analysis over signatures for detecting anomalous situations in a mobile telecommunications environment. We have made use of the concept of a signature, which resumes the customer behavior during a certain period of time. Experimental evaluation performed with real data from a week of voice calls, and respective comparison, with a list of previously detected fraud cases, allowed us to detect the majority of the cases described in the list. Additionally, the proposed methods detected other fraud situations which were not previously identified by the analysts. Evaluation with synthetic data shows that the system has a low false discovery rate, while it has a satisfactory true positive rate even for relatively small signature variation. Preliminary discussion with fraud analysts gave us positive feedback about the results presented in this work. Although we have achieved good results in the qualitative evaluation of the proposed method, work still needs to be done in order to improve the quantitative results. In particular, the problem of time and space scalability needs to be totally

Detecting Telecommunications Fraud based on Signature Clustering Analysis

13

overcome to be able to handle even larger datasets. As future work we plan to adapt to our system recently proposed efficient methods for clustering stream data [7, 1].

References 1. C. Aggarwal, J. Han, J. Wang, and P. Yu. A framework for projected clustering of high dimensional data streams. In Proceedings of 30th VLDB Conference, 2004. 2. R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan. Automatics subspace clustering of high dimensional data for data mining applications. In Proceedings of ACM SIGMOD Conference, 1998. 3. R. Alves, P. Ferreira, O. Belo, J. Lopes, J. Ribeiro, and L. Cortesao. Discovering telecom fraud situations through mining anomalous behavior patterns. In Proceedings of the DMBA Workshop, on the 12th ACM SIGKDD, 2006. 4. R. Bolton and D. Hand. Statistical fraud detection: A review. Statistical Science, 17(3):235–255, January 2002. 5. M. Cahill, D. Lambert, J. Pinheiro, and D. Sun. Handbook of massive data sets, chapter Detecting fraud in the real world, pages 911–929. Kluwer Academic Publishers, 2002. 6. C. Cortes and D. Pregibon. Signature-based methods for data streams. Data Mining and Knowledge Discovery, (5):167–182, 2001. 7. F. Farnstrom and J. Lewis anc C. Elkan. Scalability for clustering algorithms revisited. SIGKDD Explorations, 2(1):51–57, 2000. 8. P. Ferreira, O. Belo, R. Alves, and L. Cortesao. Establishing fraud detection patterns based on signatures. In Proceedings of the 7th Industrial Conference on Data Mining, Leipzig - Germany, 2006. 9. S. Guha, R. Rastogi, and K. Shim. Cure: an efficient clustering algorithm for large databases. In Proceedings of the 1998 ACM SIGMOD, pages 73–84, 1998. 10. G. Karypis. CLUTO - A clustering toolkit. University of Minnesota, 2003. 11. Y. Kou, T. Lu, S. Sirwongwattana, and Y. Huang. Survey of fraud detection techniques. In Proceedings of 2004 IEEE International Conference on Networking, Sensing and Control, 2004. 12. T.F. Lunt. A survey of intrusion detection techniques. Computer and Security, (53):405–418, 1999. 13. J. Shawe-Taylor, K. Howker, P. Gosset, M. Hyland, H. Verrelst, Y. Moreau, C. Stoermann, and P. Burge. In Business Applications of Neural Networks, chapter Novel techniques for profiling and fraud detection in mobile telecommunications, pages 113–139. Singapore: World Scientific, 2000. 14. G. Weiss. Data Mining in Telecommunications. Kluwer Academic Publishers, 2004. 15. T. Zhang, R. Ramakrishnan, and M. Livny. Birch: An efficient data clustering method for very large databases. In Proceedings of ACM SIGMOD conference, 1996.

Detecting Telecommunications Fraud based on ...

Detecting Telecommunications Fraud based on Signature Clustering Analysis. Pedro Gabriel Ferreira, Ronnie Alves, Orlando Belo, Joel Ribeiro. University of Minho ..... signature variation was considerably greater than the variation of the remain- ing customers. This approach allows to detect not only deviating behaviors in.

Download PDF

370KB Sizes 1 Downloads 312 Views

Report

Detecting Telecommunications Fraud based on ...

Recommend Documents