Mining significant change patterns in multidimensional ...

Viewer
Transcript

Int. J. Business Intelligence and Data Mining, Vol. x, No. x, xxxx

Mining significant change patterns in multidimensional spaces Ronnie Alves* Institute of Signalling, Developmental Biology and Cancer Research, Laboratory of Virtual Biology, CNRS UMR 6543 Parc Valnore, Nice 06108, Cedex 02, France E-mail: [email protected] *Corresponding author

Joel Ribeiro and Orlando Belo Department of Informatics, School of Engineering, University of Minho, Campus de Gualtar, 4710-057 Braga, Portugal E-mail: [email protected] E-mail: [email protected] Abstract: In this paper, we present a new OLAP Mining method for exploring interesting trend patterns. Our main goal is to mine the most (TOP-K) significant changes in Multidimensional Spaces (MDS) applying a gradientbased cubing strategy. The challenge is then finding maximum gradient regions, which maximises the task of detecting TOP-K gradient cells. Several heuristics are also introduced to prune MDS efficiently. In this paper, we motivate the importance of the proposed model, and present an efficient and effective method to compute it by • evaluating significant changes by means of pushing gradient search into the partitioning process • measuring Gradient Regions (GR) spreadness for data cubing • measuring Periodicity Awareness (PA) of a change, assuring that it is a change pattern and not only an isolated event • devising a Rank Gradient-based Cubing to mine significant change patterns in MDS. Keywords: change analysis; multidimensional data mining; ranking cubes; cube gradients; e OLAP mining. Reference to this paper should be made as follows: Alves, R., Ribeiro, J. and Belo, O. (xxxx) ‘Mining significant change patterns in multidimensional spaces’, Int. J. Business Intelligence and Data Mining, Vol. x, No. x, pp.xxx–xxx.

Copyright © 200x Inderscience Enterprises Ltd.

1

2

R. Alves et al. Biographical notes: Ronnie Alves received his PhD in Computer Science from the Department of Informatics at Minho University, Portugal. While pursuing his PhD studies, he also served as a Visiting Researcher in the Data Mining Group (at DAIS Lab in the University of Illinois at Urbana-Champaign, USA) led by Professor Dr. Jiawei Han and in the Bioinformatics Research Group (at Pablo de Olavide University – Sevilla, Spain) led by Prof. Dr. Jesus Aguilar. Currently he is a Post-doc Researcher in the Institute of Signalling Developmental Biology and Cancer at Nice, pursuing research on data mining applied to Transcriptomics. Joel Ribeiro has a Master Degree in Computer Science from the Department of Informatics at Minho University, Portugal. He has been finishing his thesis based in Top-K Multidimensional Gradients. While pursuing his Master Degree, he had also worked on research and industrial projects related to OLAP and data mining, having published a few papers on international conferences and workshops. In a near future, he will be a PhD candidate at the Eindhoven University of Technology, conducting his research in the Process Mining area. Orlando Belo is an Associate Professor in the Department of Informatics at Minho University, Portugal. He is also a member of the Computer Science and Technology Center in the same university, working in areas like Data Warehousing Systems, OLAP, and Data Mining. His main research topics are related with data warehouse design, implementation and tuning, ETL services, and distributed multidimensional structures processing. During the last few years he was involved with several projects in the decision support systems area designing and implementing computational platforms for specific applications like fraud detection and control in telecommunication systems, data quality evaluation, and ETL systems for industrial data warehousing systems.

1

Introduction

Change Analysis (CA) is a required and practical strategy to catch up interesting trend patterns in several business applications. Let us examine, for instance, Customer Relationship Management (CRM), where business organisations are concerning about understanding customer behaviour to anticipate their needs. In such applications, organisations should be able to differentiate, essentially, profit customers from the average ones. Hence, they will be able to track sensitive changes in those behaviours (high-profit and low-profit profiles), being possible to assist each one of them according to business organisation rules. The basis for tracking customer’s commercial paths through the products/services they are interested in is Data Warehousing (DW). Even though managing such huge amounts of data may imply business advantage, its OLAP exploration, especially in large Multidimensional Spaces (MDS), is a hard task. The more dimensions one wants to explore, the more are the computational costs for Multidimensional Data Analysis (MDA).

Mining significant change patterns in multidimensional spaces

3

The key challenge for making practical CA in MDS is handling OLAP mining effectively while ranking potential change patterns for further analysis. Significant changes can be evaluated either by statistical modelling (Sarawagi et al., 1998; Sarawagi and Sathe, 2000; Sathe and Sarawagi, 2001) or through gradient queries (Imielinski et al., 2002; Lam et al., 2004; Alves et al., 2007). However, even with proper gradient constraints, the number of significant changes to evaluate will be quite large in MDS (Lam et al., 2004; Wang et al., 2006). Besides, as far as we are concerned, there is a particular observation that has not been mentioned at any of those previous gradient query studies, namely, Periodicity Awareness (PA) of gradient patterns. Example 1: Let us consider the following case of generating alarms for identifying potential fraud situations on a Mobile Telecom company. Those alarms are generated when abnormal utilisation of the system is detected, meaning that sensitive changes happened regarding customers normal behaviours. For example, one may want to see what is associated with significant changes of the average number of calls in the Lisbon area on Monday compared with that on Tuesday, and the answer could be one in the form “the average number of calls for callings during working time in Rossio went up 96%, while those callings during night time in Baixa-Chiado went down 15%”. Expressions such as “callings during working time” correspond to cells in data cubes and describe sectors of the business modelled by the data cube. Given that the number of calls generated in a business day in Lisbon area is extremely large (hundred million calls), the fraud analyst would be interested to evaluate just the Top-10 higher changes in that scenario, especially those ones in Rossio area. Then the fraud analyst would be able to drill through the most interesting changes for further analysis. In Figure 1, two temporal series associated with the cuboids in Example 1 can be seen, namely S1 = {Rossio} and S2 = {Baixa-Chiado}. For each particular day of the month (x-axis), different peeks with respect to the average number of calls were depicted (y-axis). Let us say that we are looking for changes corresponding to an increase of 84%; then we can clearly conclude that a first temporal gradient happens on day 6. This is great, since it presents an increase of 92%. On the other hand, if we scroll this window frame in such way to find a similar gradient at other day, we will come up with another temporal gradient on day 18. Certainly, this information brings more significance to gradient analysis, assuring that this event is a consistent change and not just a casual coincidence. From this discussion, we differentiate temporal gradients, being the significant changes we discuss in this paper, from classical gradients (Imielinski et al., 2002; Lam et al., 2004; Wang et al., 2006; Alves et al., 2007). Temporal gradients are gradients that are sensitive to time effects, i.e., they occur with a certain probability over temporal window frames. The remaining question still is the same, how to enumerate efficiently interesting gradients. Further, once we have all them, how to devise a preference selection strategy over such multidimensional patterns.

4 Figure 1

R. Alves et al. An overall figure of temporal gradients at different time instants

Example 2: Before starting the discussion about ranking gradient cells in MDS, let us first look at Figure 2. This figure shows how aggregated values are distributed along cube’s dimensions. The base table used for cubing was generated according to a uniform distribution. It has 100 tuples, 2 dimensions (x, y) with cardinalities varying from 1 to 10, and a measure (m) varying from 1 to 100. From Figure 2 we can see that different aggregated functions (Gray et al., 1997) (distributive and algebraic functions) will provide different density regions in data cubes. When searching for gradients, one may want to start this task by evaluating a cube region that presents higher spreadness with respect to the aggregated values. The central region corresponds to the rectangle R1 = {[4, 6] : [6, 4]}, which covers all five bins {b0 = {0–20}, b1 = {20–40}, b2 = {40–60}, b3 = {60–80}, b4 = {80–100}}. If it is possible to identify such region, before cubing, i.e., during partitioning, chances are that the most interesting gradients will take place there. We denote those regions as Gradient Regions (GR). One of the problems is that average is an algebraic function having an unpredictable Spreading Factor (SF) with respect to the distribution of aggregated values over the cube. Another problem is checking PA, given that cube cells usually have quite large associated calendar dimensions. From Figure 2, we can also observe that gradients are maximised when walking from bins b0 to b4. Therefore, even if a region does not cover all bins but if at least has the lowest and highest one, it should be a potential candidate GR for mining gradient cells. We expect that GR with the largest SF will provide higher gradient cells.

Mining significant change patterns in multidimensional spaces Figure 2

5

Spreading factors for cubing with average function (see online version for colours)

This paper is organised as follows. In Section 2, we present the research problem employed in our study as well as few definitions. In Section 3, we describe our partitioning strategy with a set of measures to assist the search for significant changes in MDS. Our gradient-based cubing is introduced in Section 4, concluding our proposed strategy. In Section 5, we present an experimental study. The related work in the literature to mine gradients is reviewed in Section 6. In Section 7, we present our final comments regarding the proposed OLAP Mining strategy.

2

Ranking gradient queries

Given a base relation table R with categorical attributes A1, A2, …, An and real-valued measures M1, M2, …, Mr. We denote the categorical attributes as selecting dimensions and the measures as ranking dimensions. A classical Top-K query specifies the selection condition over one or more categorical attributes, and evaluates a ranking function ‘f ’ on a subset of one or more measure attributes (Chang et al., 2000; Hristidis et al., 2001). The result is an ordered set of K tuples according to the given ranking condition. A relational Top-K query can be expressed as follows using the SQL-like notation: Query-1 1: select top k* from R 2: where A1 = a1 and …Aj = aj // Aj = dimension, aj = cuboid 3: order by f(M1, …, Mr) asc // f(M) = score function On the other hand, one could be interested in evaluating the Top-K query above using a multidimensional selection condition (Xin et al., 2006). A possible SQL-like notation would be:

6

R. Alves et al.

Query-2 1: select top k* from ( 2: select A1, A2, …, Aj, f(M1, …, Mr) from R 3: group by A1, A2, …, Aj 4: with cube) as T 5: where A1 = a1 and …Aj = aj 6: order by T.f(M1, …, Mr) asc One can see that different users may not only propose ad hoc ranking conditions but also use different levels of data abstraction to explore interesting subsets (see Query-2). Actually, in many data applications, users may need to go through an exhaustive study of the data by taking a multidimensional analysis of the Top-K query results (Xin et al., 2006). Another way to constrain this multidimensional query is by rewriting Query-2, adding a HAVING condition in the GROUP BY aggregation. Even by using such query rewriting, finding the TOP-K gradients in multidimensional space also requires a special ranking condition constraint not addressed yet by the current Top-K methods (Chang et al., 2000; Hristidis et al., 2001; Bruno et al., 2002; Xin et al., 2006). Such ranking condition constraint is denoted as ranking gradient constraint. Thus, the aggregated function in Query-2 is constrained by some properties, focusing the evaluation on particular gradient cells. Besides, the ranking function in line 6 is a gradient function, which means that every aggregation cell should be tested against its neighbourhood cells (mutation, generalisation or specialisation) (Imielinski et al., 2002). This means that one would be interested in locating the most interesting changes (TOP-K gradient cells) above a particular threshold. Even by using this threshold, one cannot afford to compute the whole cube to locate these cells. So, one alternative is to explore these TOP-K gradients while cubing. After rewriting Query-2, a possible SQL solution would be like the following query: Query-3 1: select top k* from ( // TOP-K gradient cells 2: select A1, A2, …, Aj, f(M1, …, Mr) from R 3: group by A1, A2, …, Aj // Partitioning gradient regions 4: with cube as T // Gradient-based cubing 5: having f(M1, …, Mr) > min-delta) // Change threshold 6: where A1 = a1 and …Aj = aj // probe cells 7: order by T.f(M1, …, Mr) desc The problem of mining Top-K cells with Multidimensional selection conditions and ranking gradients is significantly more expressive than the classical Top-K (Bruno et al., 2002) and cubegrade query (Imielinski et al., 2002). In this work, we are interested in mining best cases (Top-K cells) of temporal gradients in MDS. Furthermore, here we provide few definitions that help us to formalise the problem of mining top-k gradient cells while setting proper relevance for temporal gradients.

Mining significant change patterns in multidimensional spaces

7

Definition 3.1 (K-Dimensional Cuboid Cell): In an n-dimension data cube, a cell c = (i1, i2, …, in : m) (where ‘m’ is a measure and ‘i’ is a dimension value) is called a k-dimensional cuboid cell (i.e., a cell in a k-dimensional cuboid), if and only if there are exactly k (k ≤ n) values among {i1, i2, …, in}, which are not * (i.e., all). If k = n, then c is a base cell. A base cell does not have any descendant. A cell c is an aggregated cell only when it is an ancestor of some base cell (i.e., where k < n). We further denote M(c) = m and V(c) = (i1, i2, …, in), where M(c) corresponds to the aggregating value for a particular cuboid cell V(c). Definition 3.2 (Temporal Cell): A temporal cell tc is a special K-dimensional Cuboid Cell, where at least one value in V(c) is known from a temporal (or time) dimension. Additionally, each temporal cell is annotated with its time series average (tcT) and its probability of repetition over time (tcF). Definition 3.3 (Iceberg Cell): A cuboid cell is called iceberg cell if it satisfies a threshold constraint on the measure. For example, an iceberg constraint on measure count is M(c) ≥ min_supp (where min_supp is a user-given threshold). Definition 3.4 (Closed Cells): Given two cells c = (i1, i2, …, in : m) and c′ = (i1′, i2′, …, in′ : m′), we denote V(c) ≤ V(c′) if for each ij (j = 1, …, n), which is not *, ij′ = ij. A cell c is said to be covered by another cell c′ if for each c″ such that V(c) ≤ V(c″) ≤ V(c′), M(c″) = M(c′). A cell is called a closed cuboid cell if it is not covered by any other cells. Definition 3.5 (Matchable Cells): A cell c is said to match a cell c′ when they differ from each other in one, and only one, cube modification at a time. These modifications could be: Cube generalisation/specialisation if c is an ancestor of c′ and c′ a descendant of c; or mutation if c is a sibling of c′, and vice versa. Definition 3.6 (Probe Cells): Probe cells (pc) are particular cases of cuboid cells that are significant according to some mutidimensional selection condition. This condition is more than a query constraint on base table’s dimensions. For instance, in this work, the probe cells are those cells that are iceberg-closed cells. Definition 3.7 (Gradient Cells): A cell cg is said to be gradient cell of a probe cell cp, when they are matchable cells and their delta change, given by ∆g(cg, cp) ≡ (g(cg, cp) ≥ ∆min) is true, where ∆min is a constant value and g is a gradient function. Definition 3.8 (Temporal Gradient Cells): A cell ctg is said to be temporal gradient cell when it is a gradient cell and has at least one known temporal dimension, given by ∆t(ctgT, ctpT) satisfies a PA according to ∆gt = ∆g × (1 − (∆g − ∆t)) × ctgF × ctpF. A temporal dimension for analysis is assigned by the user. The set of those special time dimensions define a temporal series.

8

R. Alves et al.

Problem: Given a base table R, a set of temporal dimensions T, an iceberg condition IC, and a minimum delta change ∆min, the mining of Top-k Multidimensional Gradients from R is: Find the most interesting (TopK) temporal gradient-probe pairs (ctg, ctp) such that ∆g(ctg, ctp) is true. In this study, we confine our discussion with temporal gradients. Average gradients are effective functions for detecting interesting changes in MDS, but they also pose great challenge to the cubing computation model (Lam et al., 2004; Wang et al., 2006). The first issue is that average is a non-antimonotonic function which complicates pruning efforts on data cubing (Han et al., 2001). The second issue is that by using average gradient as our ranking condition, we have to be able to mine Top-K cells employing a non-convex function. Example 3. Example of K-cuboid cells from Table 1 are c1 = (x1, y2, z3 : 1), c2 = (x1, *, * : 2,5), c3 = (*, y1, z2 : 2.5), c4 = (x1, *, z1 : 4) and c5 = (*, *, z3 : 1). The cells c1, c2 and c4 have a temporal cell (x1) associated and may provide a temporal gradient cell. Let us say we have a IC(ci > 2), cells c5 and c1 are not iceberg cells. The cell c5 is covered by c1, thus is a non-closed cell too. Let us say that applying the proposed strategy we set cell c2 as a probe cell. Thus, cell c2 is only matchable with c4, and this gradient is evaluated as 0.375. Table 1

Example of a base table R. X is a temporal dimension

Tid

X

Y

Z

M

tcT

tcF

1

x1

y2

z3

1

1

1.0

2

x2

y1

z2

2

2.5

1.0

3

x3

y1

z2

3

2.5

1.0

4

x1

y3

z1

4

4

1.0

3

Partitioning gradient regions

Remembering our previous discussion with Figure 2, the central region corresponding to a potential Gradient Region GR1 = {[4, 6] : [6, 4]}, which covers all five bins {b0 = {0–20}, b1 = {20–40}, b2 = {40–60}, b3 = {60–80}, b4 = {80–100}}. If it is possible to identify such region, before cubing, i.e., during partitioning, chances are that the most interesting gradient cells will take place there. The problem is that average is an algebraic function and it has an unpredictable SF. Thus, there will be sets of possible GRs to looking for in the cube before selecting the most interesting ones. Thus, the main challenge will rely on partitioning the base table in such way that maximises this search, and consequently, provides promising gradient cells. Another interesting observation is that even if a region does not cover all bins but if at least has the lowest and highest one, it will be a good candidate region to search for gradients. We expect that GR with largest SF will provide higher gradient cells. This observation motivates us to evaluate gradient cells by using a partitioning method based on a gradient ascent approach.

Mining significant change patterns in multidimensional spaces

9

Gradient ascent is based on the observation that if the real-valued function f(x) is defined and differentiable in a neighbourhood of a point Y, f(x) then increases quickly if one goes from Y in the direction of the gradient of f at Y, ∆f(Y). It follows that, if Z = Y + γ∆f(Y) for, γ > 0 a small number, then f(Y) ≤ f(Z). With this observation, one could start with x0 for a local maximum of f, and considers the sequence x0, x1, x2, … such that xn+1 = xn + γn∆f(xn), n ≥ 0. We have f(x0) ≤ f(x1) ≤ f(x2) ≤ …, so we expect the sequence (xn) converge to the local maximum. Therefore, when evaluating a GR we first search for the probe cells, i.e., the highest closed cells on GR+ (cells having aggregating values higher than the average in the ‘ALL’ cuboid) and lowest ones on GR− (cells having aggregating values lower than the average in the ‘ALL’ cuboid) and then we calculate its gradients from all possible matchable cells.

3.1 Evaluating spreadness The SF or spreadness is measured by the variance. This statistical measure indicates how the values are spread around the expected value. From each GR, we want to capture how large the differences are in a list of measures values Ln = (M1, …, Mn) available from a base relation R. Thus, dimensions are projected onto cube’s lattice according to this variation. This variation follows the equation sf (GR) =

∑ (GR

Ln

− avg(GR)) 2 n

for each GR. Where GR is a particular Gradient Region, avg(GR) is the average for that GR partition, and Ln and n are the list of available measures in GR and number of elements in this list, respectively. For large tables, one could use a variation of the variance equation, using a sample rather than the population. In this sense, one can replace ‘n’ with ‘n − 1’. After getting the spreadness measure (SF) for each GR, we can order them according to this measure. The intuition is that regions with highest SF values will present promising gradients values. The same behaviour happens when projecting Gri :: GRj. Example 4: With SF equation, we can compute the SF values and all possible partitions (GR) from Table 1. Thus for the dimension X, partitioning within x1, we have the following GR[x1] = {1,4}. Finally, we can order all partitions in the dimension X as follows sf(x1) sf(x2) sf(x3).

3.2 Three-level partitioning Once we have all GR and its corresponding spreadness, the next step is partitioning the table relation. The proposed rank gradient-based cubing method relies on a three-level framework. Thus, in the first two levels we focus on ordering and partitioning, and in the third one, we do Gradient-based aggregation.

10

R. Alves et al.

Level one. Given a particular descending order of all GR available in a table relation R, we create a GR-tree (see Figure 3) that allows us to further traverse it (DFS order) and compute K gradient cells from R. To get better pruning of non-closed cells, we also take a Top-Down approach (Xin et al., 2006). In this step, we compute just the GR-tree first level (L − 1), on each dimension, keeping only cells that are closed. For each GR, we also set a meta-information bin [aggmin; aggmax] containing the lower and upper bin boundaries (w.r.t. aggregated values). Each GR is also partitioned in two GR, one GR+ (covering cells having average higher than ‘ALL’), and the other one GR – (covering cells having average lower than ‘ALL’). Finally, we enumerate all possible set of dimension (GR partitions) values Sd = {d1, d2, …, dn}, which are directly identified by all closed cells on each GR. The bin boundaries are set for each dimension as well. The number of elements in the Sd subset for each GR is obtained by the following equation: D

Sd(GR) = ∑ Card(GR n ) n=1

where Card(GR) corresponds to the number of distinct elements on each dimension (D) in that GR. Figure 3

GR-tree and GR[x1] partition. Cells on italics are candidate gradient pairs

Level two. Given all possible GR and its associated Sd sets, we can enumerate all candidate GR pairs to project to find gradient cells. Additionally, to each pair we set an approximation, the maximum delta (∆max) that could be reached by doing that projection. Next, we can order all GR pairs according to this new information. The intuition is that projecting those GR first will provide the highest gradient values. Consequently, it has a higher probability of evaluating K valid gradient cells on those first P′ projections. Lemma 3.1: The maximum number of projections to evaluate according to the gradient ascent search is minimum with respect to the maximum number of K for finding the highest gradient cells.

Mining significant change patterns in multidimensional spaces

11

Proof sketch: The maximum number of projections P to evaluate given a set of GR is evaluated through the following equation:  Card(GR) × 2  P (GR) ≈   + Card(GR), (∆ max (GR) ≥ ∆ min ), 2   where Card(GR) is related to the number of GR in R, ∆max constrains GR according to its highest delta change, and ∆min is the minimum delta change of interestingness (gradient threshold). Projections follow particular combinations like (GR+, GR−), (GR−, GR+), (GR−, GR−) and then (GR+,GR+). Since projections must satisfy those delta changes, we focus on aggregating only those GR that maximise the task of finding promising gradients. The implication of Lemma 3.1 is that the search space provided by P(GR) will be quite large, and since we can find those K gradient cells on the first P′(GR) projections, to smooth computational costs we must confine the minimum number of projections P′ according to the next equation: P′(GR) ≈ min(Card(P(GR))), max (10, K × K), where K is the number of gradients to rank. Level three. In this level, we aggregate promising partitions found in the previous levels and evaluate their candidate probe cells (cp). Candidate probe cells are the ones that are valid aggregated cells (intersecting GR+ and GR−) while projecting the highest (K) cells in GR+ with the lowest (K) cell in the GR−. Once all cps are identified on that GR, we can enumerate possible matchable cells to evaluate its corresponding gradients. This will provide us an ordered list ‘gl’ with pair of cells to evaluate gradients. Again, we can make use of the (∆max), to aggregate only cells in ‘gl’ which satisfies this condition. Further all matchable cells are aggregated using a Gradient-based cubing strategy. We go through it on the next section. Example 5: Let us use the same partition GR[x1] which is a temporal cell from Example 3. The Sd[x1] = {x1, y2, z3, y3, z1} subset is obtained by looking at the base cells from all TIDs = {1, 4} associated with GR[x1] (see Table 2). Possible GR pairs (Level two) for further projections taking GR[x1] are {(x1 + : x1−), (x1 + : y2−), …, (x1 + : z1−), …}. This will also provide the following cells (Level one) c1 = in GR− and c2 = in GR+. Those cells are also closed ones. For the GR[x1] we can enumerate the probe cell (pc) pc1 = resulting from intersecting cells c1 and c2. Possible matchable cells (mc) to pc1 are mc1 = , mc2 = , mc3 = and mc4 = . Since we keep bins for each dimension in GR, it is possible to evaluate candidate gradient cells before aggregating them. This will lead us to the following cells (Level three) mc1 = and mc3 = , which express a ∆ = 60% when compared with pc1 = . The total number of projections for Table 1 is 162. Assuming that we are searching for the Top-10 gradients with a ∆min > 55%, we then can confine our search to the first 100 projections (P′(GR) = 100). Since the ∆min > 55% and ∆max[x1] = 75%, we will find those Top-10 values by just evaluating the first

5    2  

projections of GR.

12

R. Alves et al.

Table 2 Attr. value

Inverted indices and spreadness for all partitions in Table 1 Tid.Lst

Tid.Sz

sf

x1

1, 4

2

2.25

x2

2

1

0

x3

3

1

0

y1

2, 3

2

0.25

y2

1

1

0

y3

4

1

0

z1

4

1

0

z2

2, 3

2

0.25

z3

1

1

0

3.3 Evaluating closedness To identify closed cells, we make use of the closedness measure that is calculated according to the strategy proposed by Xin et al. (2006). It is also an algebraic measure and it can be computed based on two other measures, representative tuple id (distributive) and closed mask (algebraic). Example 6: Let us say that after cubing Table 1 we got a cell like (x1, *, z3). According to Xin et al. (2006), this cell has All mask = (0, 1, 0). Note that All mask is a property of the cell and can be calculated directly. If C(x1, *, z3) = (0, 1, 1), then its closedness value is (0, 1, 0).

4

Gradient-based cubing

Our cubing method follows Frag-Cubing approach (Li et al., 2004). We start by building an inverted index and value-list index from the base relation R, and then assembling high-dimensional cubes from low-dimensional ones. For example, to compute the cuboid cell {x1, y3, *} from Table 1, we intersect the tids (see Table 2) of x1 {1, 4} and y3 {4} to get the new list of {4}. For tcT and tcF (see Definition 3.2). For example, for cell {x1, y2, z3} its tcT correspond to M(*, y2, z3) = 1. And its tcF = TIDsize(*, y2, z3)/min(tcT (y2); tcT (z3)) = 1/1 = 1. A processing tree GR-tree (see Figure 3) following a Top-Down DFS traversal is used to aggregate potential GR. We also calculate spreadness (sf, see Section 3.1) for all individual dimensions X, Y and Z, and consequently, all possible partitions from Table 1. The GR-tree will follow the order of X >> Y >> Z. Given that we want to find large gradients, those higher ones will probably take place on projecting GR having higher SF. Here comes the first heuristic for pruning (p1, pruning non-valid projections). For example, a threshold constraint on a GR is ∆max ≥ ∆min (where ∆min is a user-given threshold). Let’s say that we are interested to search for gradients on GR having a ∆min > 55%. From this constraint, we only need to evaluate GR and its projections satisfying this minimum delta value (see Figure 3).

Mining significant change patterns in multidimensional spaces

13

After cubing those candidate regions from Table 2, it is possible to mine the Top-K gradient cells from them. We also augment each GR pair with its respective bin [aggmin; aggmax] according to the minimum and maximum aggregated value on it, allowing also an approximation of the maximum delta value on each projection. With all those information, we can make use of the second heuristic for pruning (p2, pruning non-valid probe cells). Given that we have a gradient condition such as ∆min ≥ 65%, we can confine our search for the Top-K cells only for all probe cells in the gl list (see Section 3.2, Level-three) that approximates a GR delta maximum closer to the delta minimum. Even by using such constraint, the number of gradient pairs to evaluate remains large in x1. So, we must define our set of probe cells pc{} to rank. Remember the discussion about gradient ascent; by taking the local maximum (i.e., a cuboid cell having the highest aggregated value) from a particular GR, all matchable cells containing this local maximum will increase the gradient factors (delta). Thus, our probe cells are given by the set of maximum and minimum aggregated values (GR+, GR−) in that region, maximising the gradient search. For a cuboid cell to be eligible as a probe cell, it should be also a closed cell. For example, in Figure 3 the cuboid cell = {x1, y2, z3} is a closed cell on a GR−. Next, intersecting GR+[x1] and GR−[x1] will provide a candidate gradient cell = {x1, *, *}. Finally, we select possible matchable cells by projecting this cell against the Sd subset of GR[x1]. For example, gradient cells in GR[x1] are evaluated by cells {{x1, y2, *};{x1, *, z3}} to {x1, *, *}. Usually, we will have several valid projections rather than in the previous example. Therefore, the final solution must take into account all valid projections before ranking Top-K cells. Besides, after having calculated all local cells (by evaluating each projection), we continue searching for other valid gradient cells resulting from matchable cells (i.e., projecting probe cells from a GRi over cuboid cells in GRj). It is important to mention that one just needs to intersect cells from their inverted indices (Li et al., 2004). From the above discussion, we summarise our rank gradient-based cubing (TopKgrCube) method as follows. Pseudo Algorithm 4.1 TopKgr-Cube Input: a table relation trel; an iceberg condition on average min_avg; a gradient constraint min_delta; the number of Top-K cells K; T a set of temporal dimensions; Output : the set of gradient-cell pairs TopKgr. Method : 1. Let ftrel be the relation fact table; 2. Build inverted index, 1-D cuboids, GR-tree and the temporal series length for each non-temporal dimension; 3. Call TopKgr-Cube () Procedure TopKgr-Cube (ftrel, Index, T, GRtree, min_avg, min_delta, K) 1: Get 1-D cuboids ordered by spreadness //Level I: building gradient regions 2: For each maxCell in GRtree do { //maxCell is a closed cell in GRtree 1st level

14

R. Alves et al.

3: maxCell’ ← reorder dimension values of maxCell 4: if M(maxCell’) < M(*) then Set GR – ← maxCell’ and their first temporal descendents else Set GR + ← maxCell’ and their first temporal descendents} //Level II: partitioning 5: For each GR{+, –} do { 6: Set GRprojs ← valid projections of actual GR{+, –} with all others GRs{+, –} ordered by ∆max} // apply p1 if ∆max < min_delta //Level III: rank gradient-based cubing 7: For each GRproj of GRprojs do { 8: Set TopKpc ← Top-K probe cells of the first region of GRproj // Last-K if region is GR–. 9: Set LastKpc ← Last-K probe cells of the second region of GRproj // Top-K if the first region is GR–. 10: For each probeCell1 of TopKpc do{ 11: For each probeCell2 of LastKpc do{ 12: Set probeGradient ← gradient of probeCell1 and probeCell2 // apply p2 if probeGradient < min_delta //probeGradient is the maximum gradient by comparing those matchable cells// 13: if probeCell1 and probeCell2 are matchable cells then Set TopKpc ← probeGradient 14: Set TopKpc ← all gradients of all matchable cells of the intersection of both probeCell1 and probeCell2 GRstree 15: if (TopK size == K) min_delta ← max(last value of TopK, min_delta)}}} Lemma 4.1: Any gradient projection ‘GRi::GRj’ that does not satisfy an upper bound adjacent limit that cannot provide the highest gradients. Proof sketch: Since we guide the whole search of gradients by exploring spreadness of GR, we can also estimate the number of projections P′GR to handle to identify the K gradient cells according to a minimum delta (∆min) from a relation R. Assuming that the first GR′ are those ones with higher spreadness, and from which any projection GR″, is maximised through theirs maximum delta (∆max); we can calculate, approximately, the total number of projections by the following boxplot-based equation: 3 × (Card(GR)) + 2  GR 75% = 4   (Card(GR)) + 1 GR ∆ max =  GR 50% = , ∋ sf (GR ∆ 75% ) ≥ sf (GR ∆ 50% ) ≥ sf (GR ∆ 25% ). 2   (Card(GR)) + 2  GR ∆ 25% =  4 From the above equation, we can define an upper bound limit to maximise gradient search. Different from a classical box-plot, those limits are on a reversal order such as:

Mining significant change patterns in multidimensional spaces

15

∆ upper = ∆ (GR 75% ) + ((∆(GR 75% ) − ∆(GR 25% )) × 1.5).

The implication of Lemma 4.1 is that we can estimate projections in TopKgr-Cube with non-iceberg dependence. By using this upper adjacent limit, we can say that any projection GR″ that does not satisfy this upper bound constraint cannot provide the highest gradients. While traversing GR-tree to locate gradients, we may also say that those initial partitions TOP-K P(GR), more dense ones, will be located from the left to the right of the GR-tree.

5

Experimental evaluation study

The proposed method called TopKgr-Cube was coded with Java 1.5, and it was performed over a synthetically generated database. It was generated using the data synthesis guidelines described in the next section. We also introduce several change situations that should be detected as temporal gradients by our proposed strategy. All the experiments were performed on a 3GHz Pentium 4 with 2Gb of Java Virtual Machine (JVM) memory, running Windows XP Professional.

5.1 Data synthesis We generated a sample database that simulates trend situations for a Retail Store (RS). These situations were based on an industry project’s data (sales) from a national supermarket case study. The characteristics of each dimension as well as tendencies are explained as follows: a

year, having 5 distinct values (from 2003 to 2007)

b

month, having 12 distinct values (from January to December)

c

day, having 31 distinct values (from 1 to 31)

d

weekday, having 7 distinct values (From 1 – Sunday to 7 – Saturday)

e

product, having 10,000 distinct products

f

sub-category, having 100 distinct sub categories

g

category, having 10 distinct categories

h

customer, having 1000 customer

i

sex, male(0) or female(1)

j

store, having 100 distinct stores

k

region-store, having 10 distinct store locations

l

promotion, having 10 possible active campaigns.

For each sale the unit price for all products is one. Then, the Sale Measure (SM) will be in an interval ranging from 85% to 115% of the product unit price, according to their profit’s value. This interval can also be seen as a kind of data noise. Finally, temporal gradients were injected into sales based on the following rules:

16

R. Alves et al.

a

R1, if category = 10 and month = 12 then SM = SM * 2

b

R2, if category = 1 and sex = 0 then SM = SM * 0.1

c

R3, if weekday = 6 then SM = SM * 1.25

d

R4, if day between 1 and 10 then SM = SM * 1.75

e

R5, if category = promotion then SM = SM * 1.33

f

R6 if store = 5 and sub-category = 5 then SM = SM * 1.2

g

R7, if customer = product then SM = SM * 2.

5.2 Scenario one, qualitative analysis In this first scenario of evaluation, we are interested in measuring the quality of the results accomplished by TopKgr-Cube. Therefore, we have generated a sales database with 10,000 rows having several temporal gradient patterns. Those significant changes were properly injected into the database following all rules described previously. Table 3 shows the distribution of all injected patterns into the sales database. Table 3 Rule

It presents the distribution of all injected patterns into the sales database Rows affected

SM range (%)

R1

38

[185, 215]

R2

258

[0, 25]

R3

715

[110, 140]

R4

1645

[160, 190]

R5

474

[118, 148]

R6

6

[105, 135]

R7

1

[185, 215]

The top-100 significant changes without any constraints were mined over our sales database. Figure 4 summarises all patterns detected with respect to each one of the rules injected into the database. Figure 4

It shows all rules detected by the TopKgr-Cube

Mining significant change patterns in multidimensional spaces

17

As we expected, the most detected rules were R2, R1 and R7, since they present high variability. The fact that 27% of cases have no match with any rule is explained by the aggregating effects resulted from the combination of different rules. Besides, if the number of significant changes were set to a threshold higher than K = 100, certainly, the method would detect more trends (rules).

5.3 Scenario two, performance (K vs. ∆min) In this second study, we are interested in measuring the effects of using K and ∆min parameters as the main constraints for mining Top-K gradient cells for a database RS = 10,000 (Figures 5 and 6). The effectiveness of our method is evaluated by comparing its results against a naïve approach. This latter approach presents a searching strategy similar to TopKgr-Cube, but without any heuristic or pruning condition used by our strategy. The unique constraint used by the naïve method is the maximum limit of the number of projections (see Lemma 4.1). Without this constraint the search space could easily take huge dimensions compromising the computational model as well as the processing time. For instance, in our experiments, for a database with 5000 rows (RS = 5000) the processing time of the naïve method takes (on average) 15 times more than the TopKgr-Cube. Figure 5

It presents the performance impacts of using different K values. X-axis = K and Y-axis = processing time (ms) (see online version for colours)

Figure 6

It presents the performance impacts of using different ∆min values. X-axis = ∆min and Y-axis = processing time (ms) (see online version for colours)

18

R. Alves et al.

It is important to mention that the naive approach is described as a deltaMin = N. This method can only be compared with our method for small K values. In Figure 5, we can observe that the proposed method is quite robust while looking for small K cells. This is an interesting observation since based on our discussion with business analysts they would be able to evaluate from 10 to 25 cuboid cells while applying classical what-if analysis. Besides, independent of which ∆min to use, for small K, it still provides a good performance trade-off (see Figure 6). Needless to say that on a classical what-if analysis they must materialise all dimensions of interest before finding any of those trends.

5.4 Scenario three, TopKgr-Cube’s levels behaviour In this third scenario of evaluation, we want to see the computational costs involved at each level of our model (Figure 7). By using some combinations of K and ∆min parameters, we will be able to analyse the performance impacts of the method according to TopKgr-Cube’s levels. The method’s third level logically presents the major cost for all situations, especially for high K numbers. This is an expected behaviour since in this level all partitions are evaluated and temporal gradients are finally searched. Figure 7

It presents the performance impacts on each level using different ∆min and K values. X-axisBottom = K, X-axisTop = ∆min and Y-axis = %processing time (see online version for colours)

5.5 Scenario four, pruning effects In this last scenario, we are interested in studying the pruning effects of TopKgr-Cube (Figures 8–11). The pruning effects are accomplished by using two main heuristics, P1 and P2. The first one (P1) allows the stop of the processing when there are no more valid projections, i.e., the maximum potential temporal gradient is smaller than the actual top-k results. The second heuristic (P2) constrains the search space while carrying out projections.

Mining significant change patterns in multidimensional spaces Figure 8

It shows the pruning effects of heuristic P1. X-axis = ∆min and Y-axis = %Pruned projections (see online version for colours)

Figure 9

It shows the pruning effects of heuristic P2 on a RS = 2.5 K. X-axis = K and Y-axis = Number of processed cells (see online version for colours)

Figure 10 It presents the pruning effects of heuristic P2 on a RS = 5 K. X-axis = K and Y-axis = Number of processed cells (see online version for colours)

19

20

R. Alves et al.

Figure 11 It presents the pruning effects of heuristic P2 on a RS = 2.5 K. X-axis = K and Y-axis=Number of processed gradients (see online version for colours)

In Figure 8, we can see that the number of processed projections is strongly related to the ∆min parameter. This parameter is constantly refreshed with the last value of kth temporal gradient detected so far. Even for a small setting of ∆min parameter (like ∆min = 0), our method is able to discard worthless projections. From Figures 9 and 10, we can observe that the number of processed cells is closely related to the K parameter and to the dataset characteristics (i.e., size and density). It is also important to mention that in RS = 2.5 K and RS = 5 K the total number of cells which could be processed is 8.540.727 and 25.032.542, respectively. Finally, we can see in Figure 11, the number of processed gradients related to the configuration of K and ∆min parameters. Similar results were achieved for all analysed datasets (RS = 2.5 K, RS = 5 K and RS = 10 K), showing that the dataset characteristics have low impacts over the proposed strategy. This observation is explained because during the evaluation of gradients, only useful cells (cells relying on potential GR) are considered for further analysis. In Figure 12, the processing time of the top-100 multidimensional gradients using several datasets with ten dimensions and different sizes (number of tuples) is presented. Real processing costs are provided for databases raging from 2500 tuples to 40000 tuples. For the next big datasets, we provide an estimation (prediction curve) of the expected processing time up to 5 billions of tuples. Figure 12 It presents the scalability of the method against big datasets. X-axis = number of tuples and Y-axis = processing time (h) (see online version for colours)

Mining significant change patterns in multidimensional spaces

21

From Figure 12, we can observe that TopKgr-Cube method will take around 3 h to compute the top-100 multidimensional gradients using a dataset with more than 5 billions of tuples.

6

Related work

The problem to mine changes of sophisticated measures in an MDS was first introduced by Imielinski et al. (2002) as the cubegrade problem. The main idea is to explore how changes (delta changes) in a set of measures (aggregates) of interest are associated with changes in the underlying characteristics of sectors (dimensions). In Lam et al. (2004), a method called LiveSet-Driven was proposed, leading to a more efficient solution for mining gradient cells. This is achieved by group processing of live probe cells and pruning of search space by pushing several constraints deeply. There are also other studies (Sarawagi et al., 1998; Sarawagi and Sathe, 2000; Sathe and Sarawagi, 2001) by Sarawagi for mining interesting cells on data cubes. The idea of interestingness in these works is quite different from that explored by gradient-based ones. Instead of using a specified gradient threshold in relevance to the cells’ ancestor, descendants, and siblings, it relies on the statistical analysis of neighbourhood values of a cell to determine its interestingness (or also outlyingness). These previous methods employ the idea of interestingness supported by either statistical or ratio-based approach. Such approaches still provide a large number of interesting cases to evaluate. In a real application scenario, one could be interested in exploring just a small number (the most significant K cells) of gradient cells. Besides, the idea of significant change patterns based on exploring temporal gradients has not been discussed so far. There are several research papers on answering top-k queries (Chang et al., 2000; Hristidis et al., 2001; Bruno et al., 2002) on large databases, which could be used as baseline for mining Top-K gradient cells. However, the range of complex delta functions provided by the cube gradient model complicates the direct application of those traditional Top-K query methods. Besides, the discussion about Top-K queries with Multidimensional selection was introduced quite recently by Xin et al. (2006). Furthermore, the model still relies on the computation of convex functions, which cannot be applied directly to rank gradients. Recently, in Alves et al. (2007) we have presented an approach for ranking gradients in MDS. The main idea behind this strategy relies on partitioning the search space according to spreadness values of GR. Basic heuristics are provided to smooth the search for gradients. However, in Alves et al. (2007) gradients are treated equally, in the sense that they just need to follow a particular threshold to be pointed as interesting ones. In this new study, we set interestingness according to PA of gradient patterns. Several new pruning ideas were devised to assist search, selection and ranking of temporal gradients.

7

Conclusions

We have presented an effective and efficient OLAP Mining method for ranking gradients in MDS, and also verified by several evaluation scenarios that dense databases have workload issues on cubing (Level-3) and sparse databases on partitioning (Level-2). The most costly step is the third level, being the key bottleneck of the proposed method.

22

R. Alves et al.

The third level depends directly of the K parameter. The more gradients one want, the more remarkable will be the pruning impacts of the heuristics P1 and P2. Such dependency can be seen as a disadvantage for who want to find a great number of gradients (thousands). Therefore, it would be reasonable to apply other soft constraints (Wang et al., 2006) on those databases, and thus, smoothing computational costs in both levels. Given that real applications usually provide sparse databases (Beyer and Ramakrishnan, 1999), our method reduces computational cost by at least an order of magnitude, while retrieving small (10–25) K gradient cells. This is a practical requirement by most of the business analysts. Our performance study indicates that our strategy is effective and efficient in mining the most interesting gradients in MDS. We also plan to investigate top-k average pruning into GR as well as mining of highdimensional cells in future studies.

References Alves, R., Belo, O. and Ribeiro, J. (2007) ‘Mining top-K multidimensional gradients’, in Song, I.Y., Eder, J. and Nguyen, T.M. (Eds.): Proceedings of the 9th international Conference on Data Warehousing and Knowledge Discovery (Regensburg, Germany, September 03 – 07, 2007), Lecture Notes in Computer Science, Springer-Verlag, Berlin, Heidelberg, Vol. 4654, pp.375–384, DOI= http://dx.doi.org/10.1007/978-3-540-74553-2_35 Beyer, K. and Ramakrishnan, R. (1999) ‘Bottom-up computation of sparse and Iceberg CUBE’, Proceedings of the 1999 ACM SIGMOD international Conference on Management of Data (Philadelphia, Pennsylvania, United States, May 31 – June 03, 1999), SIGMOD ’99, ACM, New York, NY, pp.359–370, DOI= http://doi.acm.org/10.1145/304182.304214 Bruno, N., Chaudhuri, S. and Gravano, L. (2002) ‘Top-k selection queries over relational databases: mapping strategies and performance evaluation’, ACM Trans. Database Syst., Vol. 27, No. 2, June, pp.153–187, DOI=http://doi.acm.org/10.1145/568518.568519 Chang, Y., Bergman, L., Castelli, V., Li, C., Lo, M. and Smith, J.R. (2000) ‘The onion technique: indexing for linear optimization queries’, Proceedings of the 2000 ACM SIGMOD international Conference on Management of Data (Dallas, Texas, United States, May 15 - 18, 2000), SIGMOD ’00, ACM, New York, NY, pp.391–402, DOI= http://doi.acm.org/10.1145/ 342009.335433 Gray, J., Bosworth, A., Layman, A. and Pirahesh, H. (1996) ‘Data cube: a relational aggregation operator generalizing group-by, cross-tab, and sub-total’, in Su, S.Y. (Ed.): Proceedings of the Twelfth international Conference on Data Engineering (February 26 – March 01, 1996), ICDE. IEEE Computer Society, Washington, DC, pp.152–159. Han, J., Pei, J., Dong, G. and Wang, K. (2001) ‘Efficient computation of Iceberg cubes with complex measures’, in Sellis, T. (Ed.): Proceedings of the 2001 ACM SIGMOD international Conference on Management of Data (Santa Barbara, California, United States, May 21 – 24, 2001), SIGMOD ’01. ACM, New York, NY, pp.1–12, DOI= http://doi.acm.org/10.1145/ 375663.375664 Hristidis, V., Koudas, N. and Papakonstantinou, Y. (2001) ‘PREFER: a system for the efficient execution of multi-parametric ranked queries’, in Sellis, T. (Ed.): Proceedings of the 2001 ACM SIGMOD international Conference on Management of Data (Santa Barbara, California, United States, May 21 – 24, 2001), SIGMOD ’01, ACM, New York, NY, pp.259–270, DOI=http://doi.acm.org/10.1145/375663.375690 Imieliński, T., Khachiyan, L. and Abdulghani, A. (2002) ‘Cubegrades: generalizing association rules’, Data Min. Knowl. Discov., Vol. 6, No. 3, July, pp.219–257, DOI=http://dx.doi.org/ 10.1023/A:1015417610840

Mining significant change patterns in multidimensional spaces

23

Lam, J.M., Wang, K. and Zou, W. (2004) ‘Mining constrained gradients in large databases’, IEEE Trans. on Knowl. and Data Eng., Vol. 16, No. 8, August, pp.922–938, DOI= http://dx. doi.org/10.1109/TKDE.2004.28 Li, X., Han, J. and Gonzalez, H. (2004) ‘High-dimensional OLAP: a minimal cubing approach’, in Nascimento, M.A., Özsu, M.T., Kossmann, D., Miller, R.J., Blakeley, J.A. and Schiefer, K.B. (Eds.): Proceedings of the Thirtieth international Conference on Very Large Data Bases – Volume 30 (Toronto, Canada, August 31 – September 03, 2004), Very Large Data Bases, VLDB Endowment, pp.528–539. Sarawagi, S. and Sathe, G. (2000) ‘i3: intelligent, interactive investigation of OLAP data cubes’, SIGMOD Rec., Vol. 29, No. 2, June, p.589, DOI= http://doi.acm.org/10.1145/335191.336564 Sarawagi, S., Agrawal, R. and Megiddo, N. (1998) ‘Discovery-driven exploration of OLAP data cubes’, in Schek, H., Saltor, F., Ramos, I. and Alonso, G. (Eds.): Proceedings of the 6th International Conference on Extending Database Technology: Advances in Database Technology (March 23 – 27, 1998), Extending Database Technology, Springer-Verlag, London, Vol. 1377, pp.168–182. Sathe, G. and Sarawagi, S. (2001) ‘Intelligent rollups in multidimensional OLAP data’, in Apers, P.M., Atzeni, P., Ceri, S., Paraboschi, S., Ramamohanarao, K. and Snodgrass, R.T. (Eds.): Proceedings of the 27th international Conference on Very Large Data Bases (September 11 – 14, 2001), Very Large Data Bases, Morgan Kaufmann Publishers, San Francisco, CA, pp.531–540. Wang, J., Han, J. and Pei, J. (2006) ‘Closed constrained gradient mining in retail databases’, IEEE Trans. on Knowl. and Data Eng., Vol. 18, No 6, June, pp.764–769, DOI=http://dx.doi. org/10.1109/TKDE.2006.88 Xin, D., Han, J., Cheng, H. and Li, X. (2006) ‘Answering top-k queries with multi-dimensional selections: the ranking cube approach’, in Dayal, U., Whang, K., Lomet, D., Alonso, G., Lohman, G., Kersten, M., Cha, S.K. and Kim, Y. (Eds.): Proceedings of the 32nd International Conference on Very Large Data Bases (Seoul, Korea, September 12 – 15, 2006). Very Large Data Bases, VLDB Endowment, pp.463–474. Xin, D., Shao, Z., Han, J. and Liu, H. (2006) ‘C-cubing: efficient computation of closed cubes by aggregation-based checking’, Proceedings of the 22nd international Conference on Data Engineering (April 03 – 07, 2006), ICDE. IEEE Computer Society, Washington, DC, Vol. 4. DOI= http://dx.doi.org/10.1109/ICDE.2006.31

Mining significant change patterns in multidimensional ...

Joel Ribeiro has a Master Degree in Computer Science from the Department .... Figure 2 Spreading factors for cubing with average function (see online version for colours) ... in mining best cases (Top-K cells) of temporal gradients in MDS.

Download PDF

587KB Sizes 1 Downloads 229 Views

Report

Mining significant change patterns in multidimensional ...

Recommend Documents