An affinity-based new local distance function and ...

Viewer
Transcript

Pattern Recognition Letters 33 (2012) 356–363

Contents lists available at SciVerse ScienceDirect

Pattern Recognition Letters journal homepage: www.elsevier.com/locate/patrec

An afﬁnity-based new local distance function and similarity measure for kNN algorithm Gautam Bhattacharya a, Koushik Ghosh b, Ananda S. Chowdhury c,⇑ a

Department of Physics, University Institute of Technology, University of Burdwan, Golapbag (North), Burdwan 713104, India Department of Mathematics, University Institute of Technology, University of Burdwan, Golapbag (North), Burdwan 713104, India c Department of Electronics and Telecommunication Engineering, Jadavpur University, Kolkata 700032, India b

a r t i c l e

i n f o

Article history: Received 16 March 2011 Available online 11 November 2011 Communicated by N. Sladoje Keywords: kNN Afﬁnity function Similarity measure

a b s t r a c t In this paper, we propose a modiﬁed version of the k-nearest neighbor (kNN) algorithm. We ﬁrst introduce a new afﬁnity function for distance measure between a test point and a training point which is an approach based on local learning. A new similarity function using this afﬁnity function is proposed next p for the classiﬁcation of the test patterns. The widely used convention of k, i.e., k = [ N] is employed, where N is the number of data used for training purpose. The proposed modiﬁed kNN algorithm is applied on ﬁfteen numerical datasets from the UCI machine learning data repository. Both 5-fold and 10-fold cross-validations are used. The average classiﬁcation accuracy, obtained from our method is found to exceed some well-known clustering algorithms. Ó 2011 Elsevier B.V. All rights reserved.

1. Introduction Appropriate measures of distance and similarity are two prime issues in the ﬁeld of pattern recognition. The last century witnessed a series of efforts to explore novel measures of distance and similarity in the ﬁeld of pattern classiﬁcation, clustering and information retrieval problems (Cha, 2007; Duda et al., 2001; Deza and Deza, 2006; Zezula et al., 2006; Monev, 2004; Gavin et al., 2003). From the mathematical point of view, distance is deﬁned as a quantitative degree of how far apart two objects are. One synonym for distance is dissimilarity. The distance measures satisfying metric properties are termed ‘metric’ and those satisfying other non-metric distance measures are coined as ‘divergence’. In contrast, similarity measures the proximity. In traditional algebra, similarity is often used as the inner product in a certain vector space. This concept is modiﬁed and adapted suitably for numerical datasets where the concept of similarity is often used as corresponding expressions for weights attached to the proposed distances (Cha, 2007; Zezula et al., 2006). Fix and Hodges (1951) introduced a non-parametric method for pattern classiﬁcation that is known as the nearest neighbor rule. The nearest neighbor is one of the most popular algorithms that have long been used in pattern recognition, exploratory data analysis and data mining problems. Typically, the k-nearest neighbors of an unknown sample in the training set are calculated using a predeﬁned distance. The class label of the unknown sample is conse⇑ Corresponding author. Tel.: +91 33 2414 6666x2405; fax: +91 33 2414 6217. E-mail addresses: [email protected] (G. Bhattacharya), [email protected] (K. Ghosh), [email protected] (A.S. Chowdhury). 0167-8655/$ - see front matter Ó 2011 Elsevier B.V. All rights reserved. doi:10.1016/j.patrec.2011.10.021

quently predicted to be the most frequent one occurring in the set of k nearest neighbors. Some advantages of the kNN algorithm are: (a) its inherent simplicity; (b) its robustness to noisy training data, especially if the inverse square of weighted distance is used as the ‘‘distance’’ measure; and (c) its effectiveness if the training data is large. Many researchers have found that the kNN algorithm achieves good performance in their experiments on different data sets (Cover and Hart, 1967; Domeniconi et al., 2002; Michie et al., 1994; Wang et al., 2007; Yang and Liu, 1999; Baoli et al., 2002). Cover and Hart showed that for k = 1 and n ? 1 (n denotes the number of sample points) the kNN classiﬁcation error is bounded above by twice the Bayes’ error rate. However, researches have been generated new rejection approaches (Helman, 1970), reﬁnements with respect to Bayes’ error rate (Fukunaga and Hostetler, 1975), distance weighted approaches (Dudani, 1976; Bailey and Jain, 1978), soft computing technique (Bermejo and Cabestany, 2000) and fuzzy methods (Jozwik, 1983; Keller et al., 1985) as possible enhancements to the classical kNN algorithm. It has been observed that despite the abovementioned advantages, the performance of the kNN algorithm strongly depends on the following factors: (a) the optimum value of the parameter k (the number of nearest neighbors), (b) choice of a proper distance measure (Parvin et al., 2008) and (c) selection of an appropriate similarity measure (Mitra et al., 2002a). In the present work, we have addressed the above critical issues to improve the performance of the kNN algorithm. We take p k = [ N] following (Mitra et al., 2002a) (N is the number of training points and the symbol ‘[ ]’ stands for the greatest integer function). Such a choice of k analytically projects a fair estimation of k. In addition to the conventional distances like Euclidean, Minkowski,

357

G. Bhattacharya et al. / Pattern Recognition Letters 33 (2012) 356–363

Chebyshev, several other distance functions were proposed over the years. Some prominent examples are Mahalanobis distance (Mahalanobis, 1936), Xing distance (Xing et al., 2002), Large Margin Nearest Neighbor (LMNN)-based distance (Weinberger et al., 2005), Information Theoretic Metric Learning (ITML)-based distance (Davis et al., 2007), Kernel Relevant Component Analysis (KRCA) distance (Tsang et al., 2005), Information Geometric Metric Learning (IGML)-based distance (Wang and Jin, 2009) and Kernel Information Geometric Metric Learning (KIGML)-based distance (Wang and Jin, 2009). In most cases the distance function is linear in nature due to its advantages of simplicity of description and efﬁciency of computation. This same simplicity is insufﬁcient to model similarity for many real-world datasets (Wu et al., 2005). In this work, a new nonlinear afﬁnity function for the distance measure is introduced which has a good resemblance with the concept of non-Mahalanobis approach of local distance function (Frome et al., 2007; Kulis, 2010) and local asymmetrically weighted learning captured in lazy learning and memory-based learning works of Dietterich et al. (1993), Atkeson et al. (1997) and Ricci and Avesani (1996). Other well-known learning based distance functions can be found in (Wang and Jin, 2009). Our proposed distance also captures the effects of other training points for any particular feature. We also introduce a new similarity function (for measuring the proximity of the test point from the training points) using our newly proposed distance function. Depending upon the classes of the ﬁrst k nearest neighbor training points, scores are allotted to a test point. The classiﬁcation inference about the test point is ﬁnally taken on the basis of the assigned scores. Our goal in this paper is to improve the classiﬁcation accuracy through proper formulation of the distance and similarity functions without using detailed metric learning (McFee and Lanckriet, 2010) or Parzen window-based learning (Parzen, 1962; Mitra et al., 2002b). However, it is relevant to mention that the local learning has been implicitly used in the proposed afﬁnity-based distance and similarity functions. We have compared the mean classiﬁcation accuracy of eight datasets with C4.5, RIPPER, CBA, CMAR and CPAR (Yin and Han, 2003) and the mean classiﬁcation accuracy of six datasets with Tahir and Smith (2010). The mean accuracy obtained from our method (using 10-fold cross validation for 10 individual random seeds) are found to outperform the mean accuracies of both Yin and Han, and Tahir and Smith. The rest of the paper is organized in the following manner: in Section 2, we describe in details the theory of the proposed kNN classiﬁer and provide a time-complexity analysis of the same. In Section 3, we analyze the experimental results and present a performance comparison with other methods (as mentioned above). The paper is concluded in Section 4 with an outline of future research directions. 2. Proposed kNN classiﬁer algorithm In this section, we describe the theoretical basis behind the proposed kNN classiﬁer. In particular, we propose an afﬁnity-based distance function and a new similarity measure in the existing kNN clustering algorithm. We also perform an analysis of the time-complexity of our method (Cormen et al., 2001).

points and N be the total number of training points (patterns).So, these N points in the feature space, namely, X1, X2, . . . , XN are already classiﬁed into M number of classes, namely, C1, C2, . . . , CM so that class C contains Nj number of points for j = 1, 2, . . . , M such that Pj M j¼1 N j ¼ N. The goal is to classify a new point X in the above feature space involving these N given sample points and M given classes. The points are expressed in the following way:

X j ¼ ðX j1 ; X j2 ; . . . ; X jd Þ; X ¼ ðX 1 ; X 2 ; . . . ; X N Þ

Now, we describe our modiﬁed algorithm using the following steps: Step 1: Note that in this work, we solely experimented with the numerical datasets. Each data under a particular attribute has been ﬁrst centered by subtraction of the mean and then scaled through division by the standard deviation. Thus, we use a normalized representation of the original data. Step 2: We employ both 5-fold and 10-fold partitioning and cross-validations of the data. For the 5-fold partitioning, ﬁve partitions of the total data are made, each set of which contains approximately [n/5] data for test and the rest part for training. Similarly, for the 10-fold partitioning, 10 partitions are used, each of which carrying [n/10] data for test and rest part of training. The whole process of partitioning is done in a completely random and arbitrary manner. Step 3: Here the task is to determine the distance of all the training points from the test points. There exist many conventional distance functions, such as, Euclidean distance, City block distance (Manhattan distance), Chebyshev distance, Minkowski distance, Canberra distance, Bray Curtis (Sorensen) distance, angular separation, correlation coefﬁcient, Mahalanobis distance. There is one more important contribution in the measurement of distance between two d-dimensional points Xp and Xq, which is given by (Cover and Hart, 1967; Domeniconi et al., 2002; Michie et al., 1994):

Dpq ¼

" d X j¼1

2 #1=2 X pj X qj max j min j

ð1Þ

where max j and min j are the maximum and the minimum values computed over all the training points along the jth axis. One very latest trend is to consider attribute-wise local learning while framing the distance. In this connection, the non-Mahalanobis local distance function is a very useful one (Frome et al., 2007; Kulis, 2010) in which the distance between an arbitrary (e.g., test) point Xt and a training point Xi is proposed as follows:ce:display>

dðX t ; X i Þ ¼

d X

wij dtij

j¼1

where wij is the weight function for the ith training point along jth feature and dtij is the distance between a test point Xt and a training point Xi along jth feature. Keeping the concept of non-Mahalanobis local distance, we propose a new formula for distance in terms of an afﬁnity measurement between a training point and a test point in the following manner. We propose the afﬁnity between a d-dimensional test point Xt and any d-dimensional training point Xi by the following manner:

2.1. Algorithmic details Let us assume that each pattern in a typical pattern classiﬁcation problem is recognized by d number of observable, well-deﬁned, identically distributed and mutually independent features in the feature space. Hence, each pattern is traced as a point in the d-dimensional feature space. Let n be the total number of sample

j ¼ 1; 2; . . . ; N

dðX t ; X i Þ ¼

d X j¼1

0 jX tj X ij j@ PN

þ PN

l¼1 jX tj

X lj j

m¼1;m–i jX ij X mj j

X lj j PN

!1=2

l¼1 jX tj

X lj j þ !1=2 1 A

l¼1 jX tj

PN

PN

m¼1;m–i jX ij

X mj j ð2Þ

358

G. Bhattacharya et al. / Pattern Recognition Letters 33 (2012) 356–363

In this approach, in order to compute the distance between a test point Xt and a training point Xi (i = 1, 2, . . . , N), we consider all the corresponding attribute-wise gaps. To ﬁnd the distance between a test point and a particular training point, we consider the sum of the products of (i) the absolute gap between the attribute-wise entries which is the ﬁrst term in the right hand side of Eq. (2) and (ii) a weight function, being the sum of two terms as demonstrated in the parenthesis in the right hand side of Eq. (2) which can be suitably considered as the proposed weight function wij as employed in the non-Mahalanobis local distance. In this paper, our proposed afﬁnity based distance function cannot be kernel-based as it is not satisfying the fundamental property: d(x, y) = [k(x, x) + k(y, y) 2 k(x, y)]1/2, where d(x, y) and k(x, y) stand for distance and kernel respectively between two points given by x and y. This happens due to the fact that we are not taking the weights wij as constants. Otherwise, non-Mahalanobis local distance function with constant weight coefﬁcients happens to be kernel-based. The numerator of the ﬁrst term in this weight function denotes the sum of all the attribute-wise absolute gaps between a test point and all the training points. The denominator of the ﬁrst term has two additive components, namely, (i) the numerator itself and (ii) sum of all the attribute-wise absolute gaps of all the training points from the concerned training point. The ﬁrst term signiﬁes the relative (fractional) effect of all the training points to the concerned test point with respect to the effect of all the training points to both the concerned test point and the concerned training point. The numerator of the second term is exactly the same as the numerator of the ﬁrst term. The denominator of the second term denotes the sum of distances of all the training points from the concerned training point only. So, the second term implies the relative (fractional) afﬁnity of all the training points to the test point with respect to the afﬁnity of all the training points to the concerned training point. To comprehend the newly proposed concept we take the help of Figs. 1 and 2. In both the ﬁgure as mentioned above, the solid lines denote the distances between the concerned test point Xt and the corresponding training points (concerned training point X1 and other training points), whereas the dotted lines demonstrate the distances of the concerned training point X1 from other training points. In both the ﬁgures, the distance between Xt and the 1st training point X1 (only the indices of the training points have been shown in the ﬁgures, i.e., the point demonstrated as i actually denotes Xi) as calculated in usual schemes looks alike (bold black lines), but according to our newly proposed afﬁnity function, these distances are not same. This is due to the fact that in Fig. 1 the training points are more clustered than in Fig. 2. Under the light of these considerations, it can be seen that the numerators and denominators for both the terms within the parenthesis in the proposed afﬁnity function corresponding to Fig. 2 will be greater than those corresponding to Fig. 1 which in turn suggests that the test point Xt may have different likelihood towards the cluster of training points demonstrated

in these two ﬁgures. So, from the above discussion and explanation, it may be stated that our proposed afﬁnity function incorporates more information as it takes into account not only the traditional distance between a training and a test point, but also the impact of spatial distribution of the training points. So, the ﬁrst term signiﬁes the relative positional inﬂuence of the concerned test point in the system taken as a whole while the second term depicts the comparative inﬂuence of the concerned test point and the concerned training point. Hence we go through vicinity-level learning while processing the afﬁnity between a selected test point and a selected training point. In Table 4 we have shown the comparison of our afﬁnity function with other distance functions in terms of ﬁnding the nearest neighbors. Table 5 highlights the effectiveness of the 1st and the 2nd term of our afﬁnity function given in Eq. (2). The dotted lines altogether both in Figs. 1 and 2 also implicitly give a measurement of the cluster density of the training points, i.e. our proposed afﬁnity function has a connection with the density of sample space. In the later part of the paper, we experimentally show (via Table 5) that each of these two terms (expressed as ratio), when taken individually, bear a strong correlation with d(Xt, Xi). From Eq. (2), it is quite evident that if we take two test points for distance measure, then the equation will fail as it depends on the biasing of the training points on a test point and not on the reverse. Thus, it is not possible to prove the symmetry property of d(Xt, Xi). We will now state and prove two other important properties of d(Xt, Xi). Property 1. d(Xt, Xi) is non-negative.

Proof. Each component in d(Xt, Xi), i.e., the coefﬁcient, the terms appearing in the numerator and the denominator of the two additive components (each being expressed as a ratio), has a modulus sign. So, d(Xt, Xi) cannot be negative. h Property 2. d(Xt, Xi) satisﬁes identity of indiscernibles. Proof. If we take Xt = Xi the ﬁrst factor in each of the terms in the entire sum becomes zero making the entire measure to be zero. So, the proposed measure satisﬁes the property of identity of indiscernibles.Thus, we can conclude that d(Xt, Xi) is positive deﬁnite. Based on the above observations, we deem d(Xt, Xi) as an afﬁnity-based distance function. h Step 4: We next discuss formulation of a similarity function using the corresponding afﬁnity function. A number of choices exist for the similarity function. One classical approach is to take the similarity measure in the form of inverse distance. Another effective and popular strategy is use of exponential similarity. For

Xt Xt

dN d2 d2

d1

α12 3

d3

N α1N

d3 2

d1

dN

2

α1N N 1 α13

Fig. 1. First type of distribution of training points with respect to the test point.

α12 1

α13 3

Fig. 2. Second type of distribution of training points with respect to the test point.

359

G. Bhattacharya et al. / Pattern Recognition Letters 33 (2012) 356–363 Table 1 Comparison of the proposed method with C4.5, RIPPER, CBA, CMAR and CPAR (Yin and Han, 2003). Dataset

C4.5

RIPPER

CBA

CMAR

CPAR

Modiﬁed KNN (MKNN)

IRIS WINE GLASS PIMA BREAST SONAR IONOSPHERE VEHICLE AVERAGE

95.3 92.7 68.7 75.5 95.0 70.2 90.0 72.6 82.50

94.0 91.6 69.1 73.1 95.1 78.4 91.2 62.7 81.90

94.7 95.0 73.9 72.9 96.3 77.5 92.3 68.7 83.91

94.0 95.0 70.1 75.1 96.4 79.4 91.5 68.8 83.79

94.7 95.5 74.4 73.8 96.0 79.3 92.6 69.5 84.48

94.53 97.19 74.63 72.80 96.58 86.41 93.49 70.54 85.77

function exhibits better performance, than other existing approaches. This fact is demonstrated elaborately in Table 4. It can be noticed that our newly proposed similarity function may have some structural similarity with the Gaussian radial basis kernel function given by Kðxt ; xi Þ ¼ expðmkxt xi k2 Þ (Hastie et al., 2009). But in our expression of similarity we have used our distance/afﬁnity function measurement. In the exponential function, the denominator is the average distances per feature, which depends on the training and test data separations and also on the dimension of training datasets. It depends also on cross validation partitioning. Thus it is not at all same as the constant m proposed in radial basis kernel function. In fact, the proposed similarity function captures more information compared to a standard radial basis kernel. Here in our similarity function we have taken exponential of some dimensionless quantity, whereas the radial basis kernel is the exponential of some squared Euclidean distance. Moreover our afﬁnity function is not a symmetric function which is mandatory for radial basis kernel function. The effectiveness of our similarity function for classiﬁcation has been established in Table 5 through comparison of our similarity with the similarity function used by Mitra et al. (2002a). Step 5: Using the proposed afﬁnity function in Eq. (2), we obtain the distances of all the training points from any test point. Initially, we sort these distances in the ascending order. We then mark the k-nearest neighboring points of the test point as y1, y2, . . . , yk arranged in the order of increasing distances. Step 6: The next task is to allot some score to each class relative to the test point Xt which is given by Sc(Xt, Cj) for j = 1, 2, . . . , M. In (Bermejo and Cabestany, 2000), the score function was given by the following equation:

Table 2 Comparison of the proposed method with that of Tahir and Smith (2010).

a

Dataset

Tahir and Smitha

Modiﬁed KNN (MKNN)a

SONAR IONOSPHERE VEHICLE WDBC SPECTF MUSK1 AVERAGE

87.1(6.53) 92.2(4.53) 70.7(3.60) 95.5(2.45) 70.7(7.71) 86.2(3.84) 83.73(4.78)

86.41(7.82) 93.49(4.30) 70.54(4.21) 95.78(2.30) 74.55(7.41) 88.74(4.33) 84.92(5.06)

Within the bracket the standard deviations of classiﬁcation are given.

example, see the works of Mitra et al. (2002a), who developed a model of similarity function as: Simpq = exp(bDpq), where b is a positive constant. The expression for b is given by: ln 0:5=D, where D is the average distance between data points computed over the entire data set. This value of b is estimated by taking the mostly expected value of similarity between any two points as 0.5 which sometimes may lead to misclassiﬁcation. A similar form of similarity function (Billot et al., 2008) was given by Sim(z, x) = exp(m(x z)) for some norm m on Rm. In the present work we set up a new similarity function in the following way:

SimðX t ; X i Þ ¼ 1 if

Xt ¼ Xi

! d N ðdðX t ; X i ÞÞ ¼ exp PN if t¼1 dðX t ; X i Þ

ð3Þ

X t –X i

Here, i = 1, 2, . . . , N and d(Xt, Xi) is given by Eq. (2). From Eq. (3), we can see that the function Sim(Xt, Xi) tends to zero for extremely large values of d(Xt, Xi). We carefully noticed that the above similarity

Table 3 Comparison of our distance measure with distance measures in (Tahir and Smith, 2010). Dataseta

IRIS(150, 4) WINE(178, 13) GLASS(214, 9) PIMA_DIABETES(768, 8) BREAST(699, 10) SONAR(208, 60) IONOSPHERE(351, 34) VEHICLE(846, 18) WDBC(569, 31) SPECTF(267, 44) MUSK1(476, 166) BREAST-TISSUE(106, 10) PARKINSON(195, 23) SEGMENTATION(210, 18) ECOLI(336,8) AVERAGE a

Euclidean

95.29(5.91) 96.34(4.71) 69.67(8.79) 73.34(4.56) 95.78(2.31) 85.45(8.50) 87.29(5.48) 69.92(4.29) 95.40(2.74) 71.12(7.92) 88.59(4.50) 85.84(9.82) 96.20(4.38) 85.05(8.46) 92.07(4.22) 85.82(5.77)

Squared euclidean

95.15(6.08) 95.95(4.95) 69.61(8.43) 72.24(5.25) 95.15(2.44) 85.99(8.21) 87.09(5.60) 69.85(4.29) 95.13(2.53) 70.21(7.94) 88.53(4.61) 85.45(9.70) 96.15(4.26) 84.67(8.40) 92.12(4.21) 85.55(5.79)

City block

94.45(6.39) 97.18(3.72) 73.28(8.38) 72.42(4.82) 95.94(2.25) 87.41(7.93) 90.60(4.69) 70.10(4.32) 95.85(2.20) 70.61(7.99) 86.03(4.67) 85.89(9.53) 96.60(4.12) 86.81(7.27) 93.02(4.21) 86.41(5.50)

Canberra

93.24(6.75) 92.98(6.02) 71.73(8.73) 69.67(5.08) 96.47(2.11) 78.82(9.07) 78.94(6.92) 68.22(4.61) 93.81(3.18) 71.10(7.63) 76.81(5.85) 81.41(11.18) 92.98(5.37) 85.57(7.36) 89.51(4.78) 82.75(6.31)

Squared-chord

95.14(6.20) 95.95(4.34) 70.60(8.96) 70.41(4.49) 95.04(2.41) 87.70(6.96) 86.26(5.84) 69.49(4.20) 95.11(2.44) 70.89(8.51) 87.90(4.53) 81.86(11.06) 94.97(4.39) 85.57(7.73) 91.35(4.26) 85.22(5.76)

Squared-chi-squared

95.14(6.20) 94.76(5.48) 69.76(8.77) 70.40(4.57) 95.17(2.59) 84.38(8.60) 75.54(7.18) 69.62(4.48) 94.80(2.61) 70.80(7.38) 83.55(5.24) 79.42(11.29) 96.01(4.53) 86.38(7.80) 90.61(4.32) 83.76(6.07)

In this column for each dataset the ﬁgures within the bracket denote the total number of points and the total number of attributes.

Proposed distance 5-fold

10-fold

94.69(3.98) 96.85(2.70) 74.31(5.28) 72.78(2.67) 96.52(1.60) 86.15(5.86) 93.20(3.09) 70.85(2.71) 95.90(1.54) 74.58(4.57) 87.90(3.57) 87.15(6.03) 97.08(2.68) 86.67(5.22) 87.90(3.57) 86.84(3.67)

94.53(6.39) 97.19(4.12) 74.63(8.58) 72.80(5.13) 96.58(2.14) 86.41(7.82) 93.49(4.30) 70.54(4.21) 95.78(2.30) 74.55(7.41) 88.74(4.33) 87.23(8.93) 97.24(4.00) 86.62(7.56) 91.98(4.51) 87.22(5.45)

360

G. Bhattacharya et al. / Pattern Recognition Letters 33 (2012) 356–363

Table 4 Comparison of scores for the signiﬁcance of individual terms of afﬁnity function Dataseta

Performance of accuracy using the following terms of the Eq. (2)

IRIS(150, 4) WINE(178, 13) GLASS(214, 9) PIMA_DIABETES(768, 8) BREAST(699, 10) SONAR(208, 60) IONOSPHERE(351, 34) VEHICLE(846, 18) WDBC(569, 31) SPECTF(267, 44) MUSK1(476, 166) Correlation with Col. 2 a

both the terms

1st term

2nd term

94.53 97.19 74.63 72.80 96.58 86.41 93.49 70.54 95.78 74.55 88.74

94.52 96.79 73.94 72.55 96.59 86.88 92.74 70.87 95.82 74.25 88.03 0.999249

94.53 96.96 73.68 72.88 96.61 86.55 93.63 70.70 95.69 74.89 88.99 0.999476

Maximum Accuracy from one of (Yin and Han, 2003; Tahir and Smith, 2010)

95.3 95.5 74.4 75.5 96.4 87.1 92.6 72.6 95.5 70.7 86.2 Sum of all scores Sum of ve scores

Difference of Col. 2 and col. 5 (x)

Col. 3 and col. 5 (y)

Col. 4 and col. 5 (z)

0.8 1.7 0.2 2.7 0.2 0.7 0.9 2.1 0.3 3.8 2.5 3.4 6.2

0.8 1.3 0.5 2.9 0.2 0.2 0.1 1.7 0.3 3.5 1.8 1.2 6.1

0.8 1.5 0.7 2.6 0.2 0.5 1.0 1.9 0.2 4.2 2.8 3.3 6.6

In this column for each dataset the ﬁgures within the bracket denote the total number of points and the total number of attributes of the given dataset respectively.

ScðX t ; C j Þ ¼

k X

ZðY i ; C j Þ;

j ¼ 1; 2; . . . ; M

ð4Þ

i¼1

Note that the function Z in the above equation can only take two values: 0 and 1. So, we can write: If yi 2 Cj Z(Yi, Cj) = 1. Else Z(Yi, Cj) = 0. The test point Xt is allotted to that class for which the value of the score Sc is maximum. Some modiﬁcation is proposed for the score function in the following manner (Monev, 2004; Domeniconi et al., 2002):

ScðX t ; C j Þ ¼

k X

SimðX t ; Y i ÞZðY i ; C j Þ;

j ¼ 1; 2; . . . ; M

ð5Þ

i¼1

where Z is same as in Eq. (4) and Sim(Xt, Yi) is given by Eq. (3). In this paper, we used Eq. (5) to extract the corresponding scores. 2.2. Time-complexity analysis For n number of samples, B-fold partitioning is used for the purpose of cross-validation. Let N be the number of training samples. Let k and NC represents the number of nearest neighborhood points and the number of classes assigned for classiﬁcation of patterns respectively. Now, we present the detailed (worst-case) time-complexity analysis. Step 1: Complexity of normalization of n samples: O(n). Step 2: Complexity of partitioning n samples: O(n). For each test sample, Steps 3–4: Complexity of calculating distance for N training samples: O(N). Step 5: Complexity of sorting N training samples and calculating the similarity of k samples: O(N log N) + O(k). Step 6: Complexity of score calculation: O(kNC). Total complexity for each test sample: O(N) + O(N log N) + O(k) + O(kNC). For B-fold partitioning of n samples, the number of test samples in each partition is n/B. Considering all such test samples in all partitions, the total complexity is O(B(n/B)N) + O(B(n/B)N log N) + O(B(n/B)k) + O(B(n/B)kNC) = O(nN) + O(nN log N) + O(nk) + O(nkNc).

Table 5 Comparison of the proposed similarity function with the one by (Mitra et al., 2002a) Dataseta

Modiﬁed KNN (proposed distance with proposed Similarity) p k = [ N] k=3

Modiﬁed KNN (our distance with the similarity function used in (Mitra et al., 2002a) taking p k = [ N])

IRIS(150, 4) WINE(178, 13) GLASS(214, 9) PIMADIABETES(768, 8) BREAST(699, 10) SONAR(208, 60) IONOSPHERE(351, 34) VEHICLE(846, 18) WDBC(569, 31) SPECTF(267, 44) MUSK1(476, 166) BREAST TISSUE(106, 10) PARKINSON(195, 23) SEGMENTATION(210, 18) ECOLI(336, 8) AVERAGE

94.53(6.39) 97.19(4.12) 74.63(8.58) 72.80(5.13) 96.58(2.14) 86.41(7.82) 93.49(4.30) 70.54(4.21) 95.78(2.30) 74.55(7.41) 88.74(4.33) 87.23(8.93) 97.24(4.00) 86.62(7.56) 91.98(4.51) 87.22(5.45)

95.43(5.45) 96.07(4.18) 69.39(9.66) 75.40(4.79) 96.25(2.20) 77.95(9.57) 91.94(4.60) 70.03(4.67) 95.38(2.82) 78.58(7.39) 84.21(5.98) 81.55(10.44) 92.16(5.21) 85.10(7.87) 92.67(4.72) 85.47(5.97)

93.5(6.64) 97.13(3.75) 73.91(8.75) 70.43(5.27) 95.81(2.22) 86.06(7.91) 93.57(4.26) 70.57(4.24) 95.73(2.30) 74.21(7.56) 88.74(4.33) 86.04(9.32) 97.71(3.63) 86.62(7.68) 91.29(4.52) 86.75(5.49)

a In this column for each dataset the ﬁgures within the bracket denote the total number of points and the total number of attributes of the given dataset respectively. In other columns within the bracket the standard deviations of classiﬁcations are given. Comparison has been made using 10 individual random runs along with 10-fold cross validation.

So, total complexity for the modiﬁed kNN algorithm taking the normalization (step 1) and partitioning (step 2) parts into account is: O(n) + O(n) + O(nN) + O(nN log N) + O(nk) + O(nkNC). Since, k, NC n, N, the above complexity becomes: O(nN log N). 3. Experimental results We compared our method of modiﬁed kNN algorithm with some existing clustering methods on standard available datasets from UCI machine learning data repository. In the tables given below, the ﬁgures appearing in bold represent best performance (for that particular dataset). Note that no feature selection strategy is thus far incorporated and we have used normalized (using Z-score) dataset for classiﬁcation. We ﬁrst show the average classiﬁcation performance of our modiﬁed kNN method for 10-fold cross validation for 10 random seeds in comparison to other well-known

G. Bhattacharya et al. / Pattern Recognition Letters 33 (2012) 356–363

methods like C4.5, RIPPER, CBA, CMAR and CPAR (Yin and Han, 2003). Table 1 demonstrates that our method MKNN yields the highest average accuracy (85.77%) (taking average performance of 10 individual random run along with 10-fold cross validation) among the above mentioned classiﬁcation methods for eight standard datasets. In addition, MKNN is a winner in 5 and runners in 3 out of these 8 datasets. We next compare our proposed method with a very recent clustering work, done by (Tahir and Smith, 2010). From Table 2, it is evident that we have outperformed Tahir and Smith in 4 out of 6 datasets. Here also we have submitted the average performance on 10 individual random run of the algorithm with 10-fold cross validations. Note that our MKNN method yields a better average accuracy of 84.92% as compared 83.73% from Tahir and Smith. Note that Tahir and Smith (2010) and Yin and Han (2003) have used different datasets with some overlaps (as indicated by Tables 1 and 2). This is why we have used two different tables for performance comparisons. We next show the individual impacts of (i) the afﬁnity-based distance function and (ii) the similarity measure using 10 individual random run with 10-fold cross validation, changing distance functions (with our similarity function only) and again using different similarity functions (with our distance function) individually. Everywhere we have taken p k = [ N], where N is the number of data used for training purpose. In Table 3, we show the effectiveness of our afﬁnity function as a measure of distance by comparing it with different distance functions following Tahir and Smith. The similarity function of Eq. (3) is used with all the above distances. Our proposed distance function yield 86.84% and 87.22% average classiﬁcation accuracy for the 5fold and 10-fold partitioning respectively, which exceed the other distance measures for the 10 individual random run with 10-fold cross validation. The proposed afﬁnity-based distance function wins respectively in 9 and 8 cases with 10-fold and 5-fold crossvalidations in a total of 15 datasets. In Fig. 3, we provide the range of accuracy obtained for all ﬁfteen datasets using all the distance functions of Table 3. This is indicated by 15 vertical straight lines (topmost point of each such line indicates maximum accuracy and the bottommost point of each such indicates minimum accuracy). Here, we also show the range of accuracy obtained from our MKNN method using ﬁfteen rectangles (top edge of each rectangle indicates maximum accuracy and the bottom edge of each rectangle indicates minimum accuracy). We observe that in all the cases, the rectangle intersects the straight line near the top, which clearly indicates the superiority of the proposed afﬁnity-based distance measure.

361

In Fig. 4, we present the classiﬁcation accuracy obtained from our proposed kNN method for all ﬁfteen datasets. To make the analysis complete, we have added the accuracies for both 5-fold and 10-fold data partitioning in this ﬁgure. From the results of Table 3 and Fig. 4, we can conclude that the average classiﬁcation accuracy for 10-fold partitioning changes marginally from that of the 5-fold partitioning for the above datasets. Table 4 establishes the importance of the individual terms of the afﬁnity function. The correlation coefﬁcient between the classiﬁcation accuracy of the afﬁnity-based distance function and that of exclusively the ﬁrst term of the afﬁnity function is 0.999249. Likewise, the correlation coefﬁcient between the classiﬁcation accuracy of the afﬁnity-based distance function and that of exclusively the second term is 0.999476. The sum of the differences of all scores and the sum of the differences of the negative scores in the last two rows of Table 6 also emphasize the effectiveness of the respective terms. In addition, the correlation coefﬁcient between the relative gain or loss for the overall expression (x) and that with only the ﬁrst term (y) is computed. Similarly, we have obtained the correlation coefﬁcient between the relative gain or loss for the overall expression (x) and that with only the second term (z). The respective values obtained, viz., 0.978849 and 0.984858, are found to be quite high. All these above numerical arguments clearly justify the presence of the two different components in the afﬁnity-based distance function. In Table 5, we exclusively demonstrate the effectiveness of our proposed similarity function by keeping the afﬁnity-based distance function (from Eq. (2)) unchanged. The proposed similarity function is compared with the one in (Mitra et al., 2002a). The similarity function of Mitra et al. (2002a), when used in our proposed kNN algorithm, gives an accuracy of 85.47%. In contrast, our proposed similarity function yields average classiﬁcation accuracies of p 87.22% and 86.75% for k = [ N] and for k = 3 respectively for the same datasets using 10 individual random run with 10-fold cross validation. Our similarity function wins respectively in 11 and 10 p cases for k=[ N] and k = 3 out of 15 cases. For the Ionosphere, Vehicle and Parkinson dataset, k = 3 gives better result than that p of k = [ N]. The classiﬁcation accuracy for Musk1 and Segmentation datasets do not change at all with change of k. However, for p the remaining datasets, k = [ N] gives better results than k = 3. We have additionally compared our results with the results of kNN using different proposed distance functions viz. Mahalanobis distance (Mahalanobis, 1936), Xing distance (Xing et al., 2002), Large Margin Nearest Neighbor (LMNN)-based distance (Weinberger et al., 2005), Information Theoretic Metric Learning (ITML)based distance (Davis et al., 2007), Kernel Relevant Component

Fig. 3. Comparison proﬁle of the proposed afﬁnity function with respect to the ranges of classiﬁcation accuracy obtained from different distance functions.

362

G. Bhattacharya et al. / Pattern Recognition Letters 33 (2012) 356–363

Fig. 4. Classiﬁcation accuracy for 5-fold and 10-fold data partitioning of 15 datasets using 10 individual random run.

Table 6 Comparison of the proposed distance function with the distances mentioned by Wang and Jin (2009). DATA

Mahalanobis

Xing

LMNN

ITML

KRCA

IGML

KIGML

OUR KNN (10 FOLD)

WINE GLASS PIMA SONAR IONOSPHERE AVERAGE

92.5 65.1 72.2 71.1 81.6 76.5

89.2 58.3 72.1 71.1 89.7 76.1

95.9 65.1 72.9 79.7 85 79.7

92.3 63.8 72.2 71.7 88.9 77.8

95.4 63.1 72.2 73.5 82.8 77.4

95 64.2 72.4 71.9 83.4 77.4

93.9 66.7 72.2 85.4 85.8 80.8

97.2 74.6 72.8 86.4 93.5 84.9

Analysis (KRCA) distance (Tsang et al., 2005), Information Geometric Metric Learning (IGML)-based distance (Wang and Jin, 2009) and Kernel Information Geometric Metric Learning (KIGML)-based distance (Wang and Jin, 2009). The superiority of the proposed distance function over the above-mentioned distance functions is clearly evident from Table 6. In four out of the ﬁve datasets shown, our distance has outperformed all seven distances. Moreover, the average accuracy using our distance is approximately 5–10% higher (5% higher than KIGML and 10% higher than Xing) compared to that of the other seven distances.

4. Conclusion and future work In this paper, we propose a modiﬁed version of the classical kNN algorithm. In particular, we introduce an afﬁnity function between a training point and a test point as a measure of distance. We also design a new similarity function using this newly proposed afﬁnity based distance function. Since, the proposed kNN algorithm realizes vicinity level learning while structuring the proximity functions (i.e., the distance and the similarity function), it can also be categorized as a locally adaptive kNN algorithm (Hastie and Tibshirani, 1996; Domeniconi et al., 2002). We have shown that each of the above modiﬁcations has a considerable inﬂuence on the performance of

the algorithm. Experimental results clearly indicate that the proposed method has outperformed some well-known variants of the kNN algorithm. Asymmetric proximity functions (distance and similarity function) in recent years (McFee and Lanckriet, 2010) have gained popularity over their traditional counterpart. Note that the proposed afﬁnity based distance function and similarity function are both asymmetric. In fact, they are directed from a test point to a training point and capture more information as they realize local level learning about the concerned training point. A properly chosen value of k can potentially improve the classiﬁcation results. Thus, in future, we will try to further improve our results by choosing a suitable value of k. We will also explore Metric learning (McFee and Lanckriet, 2010) and Parzen window based learning (Parzen, 1962; Mitra et al., 2002b) to possibly enhance the performance of the proposed kNN algorithm. In this paper, we have experimented only with the numerical data. So, another direction of future research will be to extend the current approach for the classiﬁcation of categorical data (Boriah et al., 2008). References Atkeson, C.G., Moore, A.W., Schaal, S., 1997. Locally weighted learning. Artif. Intell. Rev. 11, 11–73. Bailey, T., Jain, A., 1978. A note on distance weighted k-nearest neighbour rules. IEEE Trans. Systems Man Cybernet. 8, 311–313.

G. Bhattacharya et al. / Pattern Recognition Letters 33 (2012) 356–363 Baoli, L., Yuzhong, C., Shiwen, Y., 2002. A comparative study on automatic categorization methods for Chinese search engine. In: Proc. 8th Joint Internat. Computer Conf.. Zhejiang University Press, Hangzhou, pp. 117–120. Bermejo, S., Cabestany, J., 2000. Adaptive soft k-nearest-neighbour classiﬁers. Pattern Recognition 33, 1999–2005. Billot, A., Gilboa, I., Schmeidler, D., 2008. Axiomatization of an exponential similarity function. Math. Soc. Sci. 55, 107–115. Boriah, S., Chandola, V., Kumar, V., 2008. Similarity measures for categorical data: A comparative evaluation. In: Proc. SIAM Data Mining Conf., Atlanta, GA, pp. 243– 254. Cha, S.H., 2007. Comprehensive survey on distance/similarity measures between probability density functions. Internat. J. Math. Models and Methods Appl. Sci. 1 (4), 300–307. Cormen, T.H., Leiserson, C.E., Rivest, R.L., Stein, C., 2001. Introduction to Algorithms. MIT Press, USA. Cover, T.M., Hart, P., 1967. Nearest neighbour pattern classiﬁcation. IEEE Trans. Inform. Theory 13 (1), 21–27. Davis, J., Kulis, B., Jain, P., Sra, S., Dhillon, I., 2007. Information-theoretic metric learning. In: Proc. ICML. Corvalis, Oregon, pp. 209–216. Deza, E., Deza, M.M., 2006. Dictionary of Distances. Elsevier. Dietterich, T.G., Wettschereck, D., Atkeson, C.G., Moore, A.W., 1993. Memory based methods for regression and classiﬁcation. In: Proc. NIPS 1993, pp. 1165–1166. Domeniconi, C., Peng, J., Gunopulos, D., 2002. Locally adaptive metric nearest neighbour classiﬁcation. IEEE Trans. Pattern Anal. Machine Intell. 24 (9), 1281– 1285. Duda, R.O., Hart, P.E., Stork, D.G., 2001. Pattern Classiﬁcation, 2nd ed. John Wiley & Sons, New York. Dudani, S.A., 1976. The distance-weighted k-nearest-neighbour rules. IEEE Trans. Systems Man Cybernet. 6, 325–332. Fix, E., Hodges, J.L., 1951. Discriminatory analysis, nonparametric discriminators: Consistency properties. Technical Report 4, USAF School of Aviation Medicine, Randolph Field, Texas. Frome, A., Singer, Y., Sha, F., Malik, J., 2007. Learning globally-consistent local distance functions for shape-based image retrieval and classiﬁcation. In: Proc. Computer Vision, ICCV. Fukunaga, K., Hostetler, L., 1975. k-Nearest-neighbour Bayes-risk estimation. IEEE Trans. Inform. Theory 21 (3), 285–293. Gavin, D.G., Oswald, W.W., Wahe, E.R., Williams, J.W., 2003. A statistical approach to evaluating distance metrics and analog assignments for pollen records. Quater. Res. 60, 356–367. Hastie, T., Tibshirani, R., 1996. Discriminant adaptive nearest neighbour classiﬁcation. IEEE Trans. Pattern Anal. Machine Intell. 18 (6), 607–616. Hastie, T. et al., 2009. The Elements of Statistical Learning, 2nd ed. Springer, p. 172. Helman, M.E., 1970. The nearest neighbour classiﬁcation rule with a reject option. IEEE Trans. System Man Cybernet. 3, 179–185. Jozwik, A., 1983. A learning scheme for a fuzzy k-NN rule. Pattern Recognition Lett. 1, 287–289.

363

Keller, J.M., Gray, M.R., Givens, J.A., 1985. A fuzzy k-NN neighbour algorithm. IEEE Trans. Systems Man Cybernet. 15 (4), 580–585. Kulis, B., 2010. Metric learning, ICML 2010 Tutorial. Mahalanobis, P.C., 1936. On the generalised distance in statistics. In: Proc. National Institute of Sciences of India 2(1), pp. 49–55. McFee, B., Lanckriet, G., 2010. Metric learning to rank. In Proc. ICML. Haifa, Israel, pp. 775–782. Michie, D., Spiegelhalter, D.J., Taylor, C.C., 1994. Machine Learning, Neural and Statistical Classiﬁcation. Ellis Horwood, Upper Saddle River, NJ, USA. Mitra, P., Murthy, C.A., Pal, S.K., 2002a. Unsupervised feature selection using feature similarity. IEEE Trans. Pattern Anal. Machine Intell. 24 (3), 301–312. Mitra, P., Murthy, C.A., Pal, S.K., 2002b. Density-based multiscale data condensation. IEEE Trans. Pattern Anal. Machine Intell. 24 (6), 734–747. Monev, V., 2004. Introduction to similarity searching in chemistry. MATCH – Communications. In: Proc. Mathematical and Computational Chemistry, vol. 51, pp. 7–38. Parvin, H., Alizadeh, H., Minaes-Bidgoli, B., 2008. MKNN: Modiﬁed k-nearest neighbor. In: Proc. World Congress on Engineering and Computer Science (WCECS) San Francisco, USA. Parzen, E., 1962. On estimation of a probability density function and mode. Ann. Math. Statist. 33, 1065–1076. Ricci, F., Avesani, P., 1996. Nearest neighbour classiﬁcation with a local asymmetrically weighted metric. IRST. Protocol No., 9601-12. Tahir, M.A., Smith, J., 2010. Creating diverse nearest neighbor ensembles using simultaneous metaheuristic feature selection. Pattern Recognition Lett. 31, 1470–1480. Tsang, I., Cheung, P., Kwok, J., 2005. Kernel relevant component analysis for distance metric learning. In: Proc. IJCNN. Wang, J., Neskovic, P., Cooper, L., 2007. Improving neighbour rule with a simple adaptive distance measure. Pattern Recognition Lett. 28, 207–213. Wang, S., Jin, R. 2009. Information geometry approach for distance metric learning. In: Proc. 12th Internat. Conf. on Artiﬁcial Intelligence and Statistics (AISTATS). JMLR: W&CP, vol. 5. Clearwater Beach, Florida, USA. Weinberger, K., Blitzer, J., Saul, L., 2005. Distance metric learning for large margin nearest neighbor classiﬁcation. In: Proc. NIPS. Wu, G., Chang, E.Y., Panda, N., 2005. Formulating distance functions via kernel trick. In: Proc. Conf. on Knowledge Discovery and Data Mining (KDD), pp. 703–709. Xing, E., Ng, A., Jordan, M., Russell, S., 2002.Distance metric learning, with application to clustering with side-information. In: Proc. NIPS. Yang, Y., Liu, X., 1999. A re-examination of text categorization methods. In: Proc. 22nd Annual Internat. ACM SIGIR Conf. on Research and Development in Information Retrieval, pp. 42–49. Yin, X., Han, J., 2003. CPAR: Classiﬁcation based on predictive association rules. In: Proc. SIAM Internat. Conf. on Data Mining (SDM), San Francisco, CA, USA. Zezula, P., Amato, G., Dohnal, V., Batko, M., 2006. Similarity Search the Metric Space Approach. Springer.

An affinity-based new local distance function and ...

Nov 11, 2011 - a Department of Physics, University Institute of Technology, University of Burdwan, Golapbag (North), Burdwan 713104, .... Information Geometric Metric Learning (KIGML)-based distance ...... Clearwater Beach, Florida, USA.

Download PDF

639KB Sizes 2 Downloads 231 Views

Report

An affinity-based new local distance function and ...

Recommend Documents