Pattern Recognition Letters 33 (2012) 356–363

Contents lists available at SciVerse ScienceDirect

Pattern Recognition Letters journal homepage: www.elsevier.com/locate/patrec

An affinity-based new local distance function and similarity measure for kNN algorithm Gautam Bhattacharya a, Koushik Ghosh b, Ananda S. Chowdhury c,⇑ a

Department of Physics, University Institute of Technology, University of Burdwan, Golapbag (North), Burdwan 713104, India Department of Mathematics, University Institute of Technology, University of Burdwan, Golapbag (North), Burdwan 713104, India c Department of Electronics and Telecommunication Engineering, Jadavpur University, Kolkata 700032, India b

a r t i c l e

i n f o

Article history: Received 16 March 2011 Available online 11 November 2011 Communicated by N. Sladoje Keywords: kNN Affinity function Similarity measure

a b s t r a c t In this paper, we propose a modified version of the k-nearest neighbor (kNN) algorithm. We first introduce a new affinity function for distance measure between a test point and a training point which is an approach based on local learning. A new similarity function using this affinity function is proposed next p for the classification of the test patterns. The widely used convention of k, i.e., k = [ N] is employed, where N is the number of data used for training purpose. The proposed modified kNN algorithm is applied on fifteen numerical datasets from the UCI machine learning data repository. Both 5-fold and 10-fold cross-validations are used. The average classification accuracy, obtained from our method is found to exceed some well-known clustering algorithms. Ó 2011 Elsevier B.V. All rights reserved.

1. Introduction Appropriate measures of distance and similarity are two prime issues in the field of pattern recognition. The last century witnessed a series of efforts to explore novel measures of distance and similarity in the field of pattern classification, clustering and information retrieval problems (Cha, 2007; Duda et al., 2001; Deza and Deza, 2006; Zezula et al., 2006; Monev, 2004; Gavin et al., 2003). From the mathematical point of view, distance is defined as a quantitative degree of how far apart two objects are. One synonym for distance is dissimilarity. The distance measures satisfying metric properties are termed ‘metric’ and those satisfying other non-metric distance measures are coined as ‘divergence’. In contrast, similarity measures the proximity. In traditional algebra, similarity is often used as the inner product in a certain vector space. This concept is modified and adapted suitably for numerical datasets where the concept of similarity is often used as corresponding expressions for weights attached to the proposed distances (Cha, 2007; Zezula et al., 2006). Fix and Hodges (1951) introduced a non-parametric method for pattern classification that is known as the nearest neighbor rule. The nearest neighbor is one of the most popular algorithms that have long been used in pattern recognition, exploratory data analysis and data mining problems. Typically, the k-nearest neighbors of an unknown sample in the training set are calculated using a predefined distance. The class label of the unknown sample is conse⇑ Corresponding author. Tel.: +91 33 2414 6666x2405; fax: +91 33 2414 6217. E-mail addresses: [email protected] (G. Bhattacharya), [email protected] (K. Ghosh), [email protected] (A.S. Chowdhury). 0167-8655/$ - see front matter Ó 2011 Elsevier B.V. All rights reserved. doi:10.1016/j.patrec.2011.10.021

quently predicted to be the most frequent one occurring in the set of k nearest neighbors. Some advantages of the kNN algorithm are: (a) its inherent simplicity; (b) its robustness to noisy training data, especially if the inverse square of weighted distance is used as the ‘‘distance’’ measure; and (c) its effectiveness if the training data is large. Many researchers have found that the kNN algorithm achieves good performance in their experiments on different data sets (Cover and Hart, 1967; Domeniconi et al., 2002; Michie et al., 1994; Wang et al., 2007; Yang and Liu, 1999; Baoli et al., 2002). Cover and Hart showed that for k = 1 and n ? 1 (n denotes the number of sample points) the kNN classification error is bounded above by twice the Bayes’ error rate. However, researches have been generated new rejection approaches (Helman, 1970), refinements with respect to Bayes’ error rate (Fukunaga and Hostetler, 1975), distance weighted approaches (Dudani, 1976; Bailey and Jain, 1978), soft computing technique (Bermejo and Cabestany, 2000) and fuzzy methods (Jozwik, 1983; Keller et al., 1985) as possible enhancements to the classical kNN algorithm. It has been observed that despite the abovementioned advantages, the performance of the kNN algorithm strongly depends on the following factors: (a) the optimum value of the parameter k (the number of nearest neighbors), (b) choice of a proper distance measure (Parvin et al., 2008) and (c) selection of an appropriate similarity measure (Mitra et al., 2002a). In the present work, we have addressed the above critical issues to improve the performance of the kNN algorithm. We take p k = [ N] following (Mitra et al., 2002a) (N is the number of training points and the symbol ‘[ ]’ stands for the greatest integer function). Such a choice of k analytically projects a fair estimation of k. In addition to the conventional distances like Euclidean, Minkowski,

357

G. Bhattacharya et al. / Pattern Recognition Letters 33 (2012) 356–363

Chebyshev, several other distance functions were proposed over the years. Some prominent examples are Mahalanobis distance (Mahalanobis, 1936), Xing distance (Xing et al., 2002), Large Margin Nearest Neighbor (LMNN)-based distance (Weinberger et al., 2005), Information Theoretic Metric Learning (ITML)-based distance (Davis et al., 2007), Kernel Relevant Component Analysis (KRCA) distance (Tsang et al., 2005), Information Geometric Metric Learning (IGML)-based distance (Wang and Jin, 2009) and Kernel Information Geometric Metric Learning (KIGML)-based distance (Wang and Jin, 2009). In most cases the distance function is linear in nature due to its advantages of simplicity of description and efficiency of computation. This same simplicity is insufficient to model similarity for many real-world datasets (Wu et al., 2005). In this work, a new nonlinear affinity function for the distance measure is introduced which has a good resemblance with the concept of non-Mahalanobis approach of local distance function (Frome et al., 2007; Kulis, 2010) and local asymmetrically weighted learning captured in lazy learning and memory-based learning works of Dietterich et al. (1993), Atkeson et al. (1997) and Ricci and Avesani (1996). Other well-known learning based distance functions can be found in (Wang and Jin, 2009). Our proposed distance also captures the effects of other training points for any particular feature. We also introduce a new similarity function (for measuring the proximity of the test point from the training points) using our newly proposed distance function. Depending upon the classes of the first k nearest neighbor training points, scores are allotted to a test point. The classification inference about the test point is finally taken on the basis of the assigned scores. Our goal in this paper is to improve the classification accuracy through proper formulation of the distance and similarity functions without using detailed metric learning (McFee and Lanckriet, 2010) or Parzen window-based learning (Parzen, 1962; Mitra et al., 2002b). However, it is relevant to mention that the local learning has been implicitly used in the proposed affinity-based distance and similarity functions. We have compared the mean classification accuracy of eight datasets with C4.5, RIPPER, CBA, CMAR and CPAR (Yin and Han, 2003) and the mean classification accuracy of six datasets with Tahir and Smith (2010). The mean accuracy obtained from our method (using 10-fold cross validation for 10 individual random seeds) are found to outperform the mean accuracies of both Yin and Han, and Tahir and Smith. The rest of the paper is organized in the following manner: in Section 2, we describe in details the theory of the proposed kNN classifier and provide a time-complexity analysis of the same. In Section 3, we analyze the experimental results and present a performance comparison with other methods (as mentioned above). The paper is concluded in Section 4 with an outline of future research directions. 2. Proposed kNN classifier algorithm In this section, we describe the theoretical basis behind the proposed kNN classifier. In particular, we propose an affinity-based distance function and a new similarity measure in the existing kNN clustering algorithm. We also perform an analysis of the time-complexity of our method (Cormen et al., 2001).

points and N be the total number of training points (patterns).So, these N points in the feature space, namely, X1, X2, . . . , XN are already classified into M number of classes, namely, C1, C2, . . . , CM so that class C contains Nj number of points for j = 1, 2, . . . , M such that Pj M j¼1 N j ¼ N. The goal is to classify a new point X in the above feature space involving these N given sample points and M given classes. The points are expressed in the following way:

X j ¼ ðX j1 ; X j2 ; . . . ; X jd Þ; X ¼ ðX 1 ; X 2 ; . . . ; X N Þ

Now, we describe our modified algorithm using the following steps: Step 1: Note that in this work, we solely experimented with the numerical datasets. Each data under a particular attribute has been first centered by subtraction of the mean and then scaled through division by the standard deviation. Thus, we use a normalized representation of the original data. Step 2: We employ both 5-fold and 10-fold partitioning and cross-validations of the data. For the 5-fold partitioning, five partitions of the total data are made, each set of which contains approximately [n/5] data for test and the rest part for training. Similarly, for the 10-fold partitioning, 10 partitions are used, each of which carrying [n/10] data for test and rest part of training. The whole process of partitioning is done in a completely random and arbitrary manner. Step 3: Here the task is to determine the distance of all the training points from the test points. There exist many conventional distance functions, such as, Euclidean distance, City block distance (Manhattan distance), Chebyshev distance, Minkowski distance, Canberra distance, Bray Curtis (Sorensen) distance, angular separation, correlation coefficient, Mahalanobis distance. There is one more important contribution in the measurement of distance between two d-dimensional points Xp and Xq, which is given by (Cover and Hart, 1967; Domeniconi et al., 2002; Michie et al., 1994):

Dpq ¼

" d  X j¼1

2 #1=2 X pj  X qj max j  min j

ð1Þ

where max j and min j are the maximum and the minimum values computed over all the training points along the jth axis. One very latest trend is to consider attribute-wise local learning while framing the distance. In this connection, the non-Mahalanobis local distance function is a very useful one (Frome et al., 2007; Kulis, 2010) in which the distance between an arbitrary (e.g., test) point Xt and a training point Xi is proposed as follows:ce:display>

dðX t ; X i Þ ¼

d X

wij dtij

j¼1

where wij is the weight function for the ith training point along jth feature and dtij is the distance between a test point Xt and a training point Xi along jth feature. Keeping the concept of non-Mahalanobis local distance, we propose a new formula for distance in terms of an affinity measurement between a training point and a test point in the following manner. We propose the affinity between a d-dimensional test point Xt and any d-dimensional training point Xi by the following manner:

2.1. Algorithmic details Let us assume that each pattern in a typical pattern classification problem is recognized by d number of observable, well-defined, identically distributed and mutually independent features in the feature space. Hence, each pattern is traced as a point in the d-dimensional feature space. Let n be the total number of sample

j ¼ 1; 2; . . . ; N

dðX t ; X i Þ ¼

d X j¼1

0 jX tj  X ij j@ PN

þ PN

l¼1 jX tj

 X lj j

m¼1;m–i jX ij  X mj j

 X lj j PN

!1=2

l¼1 jX tj

 X lj j þ !1=2 1 A

l¼1 jX tj

PN

PN

m¼1;m–i jX ij

 X mj j ð2Þ

358

G. Bhattacharya et al. / Pattern Recognition Letters 33 (2012) 356–363

In this approach, in order to compute the distance between a test point Xt and a training point Xi (i = 1, 2, . . . , N), we consider all the corresponding attribute-wise gaps. To find the distance between a test point and a particular training point, we consider the sum of the products of (i) the absolute gap between the attribute-wise entries which is the first term in the right hand side of Eq. (2) and (ii) a weight function, being the sum of two terms as demonstrated in the parenthesis in the right hand side of Eq. (2) which can be suitably considered as the proposed weight function wij as employed in the non-Mahalanobis local distance. In this paper, our proposed affinity based distance function cannot be kernel-based as it is not satisfying the fundamental property: d(x, y) = [k(x, x) + k(y, y)  2  k(x, y)]1/2, where d(x, y) and k(x, y) stand for distance and kernel respectively between two points given by x and y. This happens due to the fact that we are not taking the weights wij as constants. Otherwise, non-Mahalanobis local distance function with constant weight coefficients happens to be kernel-based. The numerator of the first term in this weight function denotes the sum of all the attribute-wise absolute gaps between a test point and all the training points. The denominator of the first term has two additive components, namely, (i) the numerator itself and (ii) sum of all the attribute-wise absolute gaps of all the training points from the concerned training point. The first term signifies the relative (fractional) effect of all the training points to the concerned test point with respect to the effect of all the training points to both the concerned test point and the concerned training point. The numerator of the second term is exactly the same as the numerator of the first term. The denominator of the second term denotes the sum of distances of all the training points from the concerned training point only. So, the second term implies the relative (fractional) affinity of all the training points to the test point with respect to the affinity of all the training points to the concerned training point. To comprehend the newly proposed concept we take the help of Figs. 1 and 2. In both the figure as mentioned above, the solid lines denote the distances between the concerned test point Xt and the corresponding training points (concerned training point X1 and other training points), whereas the dotted lines demonstrate the distances of the concerned training point X1 from other training points. In both the figures, the distance between Xt and the 1st training point X1 (only the indices of the training points have been shown in the figures, i.e., the point demonstrated as i actually denotes Xi) as calculated in usual schemes looks alike (bold black lines), but according to our newly proposed affinity function, these distances are not same. This is due to the fact that in Fig. 1 the training points are more clustered than in Fig. 2. Under the light of these considerations, it can be seen that the numerators and denominators for both the terms within the parenthesis in the proposed affinity function corresponding to Fig. 2 will be greater than those corresponding to Fig. 1 which in turn suggests that the test point Xt may have different likelihood towards the cluster of training points demonstrated

in these two figures. So, from the above discussion and explanation, it may be stated that our proposed affinity function incorporates more information as it takes into account not only the traditional distance between a training and a test point, but also the impact of spatial distribution of the training points. So, the first term signifies the relative positional influence of the concerned test point in the system taken as a whole while the second term depicts the comparative influence of the concerned test point and the concerned training point. Hence we go through vicinity-level learning while processing the affinity between a selected test point and a selected training point. In Table 4 we have shown the comparison of our affinity function with other distance functions in terms of finding the nearest neighbors. Table 5 highlights the effectiveness of the 1st and the 2nd term of our affinity function given in Eq. (2). The dotted lines altogether both in Figs. 1 and 2 also implicitly give a measurement of the cluster density of the training points, i.e. our proposed affinity function has a connection with the density of sample space. In the later part of the paper, we experimentally show (via Table 5) that each of these two terms (expressed as ratio), when taken individually, bear a strong correlation with d(Xt, Xi). From Eq. (2), it is quite evident that if we take two test points for distance measure, then the equation will fail as it depends on the biasing of the training points on a test point and not on the reverse. Thus, it is not possible to prove the symmetry property of d(Xt, Xi). We will now state and prove two other important properties of d(Xt, Xi). Property 1. d(Xt, Xi) is non-negative.

Proof. Each component in d(Xt, Xi), i.e., the coefficient, the terms appearing in the numerator and the denominator of the two additive components (each being expressed as a ratio), has a modulus sign. So, d(Xt, Xi) cannot be negative. h Property 2. d(Xt, Xi) satisfies identity of indiscernibles. Proof. If we take Xt = Xi the first factor in each of the terms in the entire sum becomes zero making the entire measure to be zero. So, the proposed measure satisfies the property of identity of indiscernibles.Thus, we can conclude that d(Xt, Xi) is positive definite. Based on the above observations, we deem d(Xt, Xi) as an affinity-based distance function. h Step 4: We next discuss formulation of a similarity function using the corresponding affinity function. A number of choices exist for the similarity function. One classical approach is to take the similarity measure in the form of inverse distance. Another effective and popular strategy is use of exponential similarity. For

Xt Xt

dN d2 d2

d1

α12 3

d3

N α1N

d3 2

d1

dN

2

α1N N 1 α13

Fig. 1. First type of distribution of training points with respect to the test point.

α12 1

α13 3

Fig. 2. Second type of distribution of training points with respect to the test point.

359

G. Bhattacharya et al. / Pattern Recognition Letters 33 (2012) 356–363 Table 1 Comparison of the proposed method with C4.5, RIPPER, CBA, CMAR and CPAR (Yin and Han, 2003). Dataset

C4.5

RIPPER

CBA

CMAR

CPAR

Modified KNN (MKNN)

IRIS WINE GLASS PIMA BREAST SONAR IONOSPHERE VEHICLE AVERAGE

95.3 92.7 68.7 75.5 95.0 70.2 90.0 72.6 82.50

94.0 91.6 69.1 73.1 95.1 78.4 91.2 62.7 81.90

94.7 95.0 73.9 72.9 96.3 77.5 92.3 68.7 83.91

94.0 95.0 70.1 75.1 96.4 79.4 91.5 68.8 83.79

94.7 95.5 74.4 73.8 96.0 79.3 92.6 69.5 84.48

94.53 97.19 74.63 72.80 96.58 86.41 93.49 70.54 85.77

function exhibits better performance, than other existing approaches. This fact is demonstrated elaborately in Table 4. It can be noticed that our newly proposed similarity function may have some structural similarity with the Gaussian radial basis kernel function given by Kðxt ; xi Þ ¼ expðmkxt  xi k2 Þ (Hastie et al., 2009). But in our expression of similarity we have used our distance/affinity function measurement. In the exponential function, the denominator is the average distances per feature, which depends on the training and test data separations and also on the dimension of training datasets. It depends also on cross validation partitioning. Thus it is not at all same as the constant m proposed in radial basis kernel function. In fact, the proposed similarity function captures more information compared to a standard radial basis kernel. Here in our similarity function we have taken exponential of some dimensionless quantity, whereas the radial basis kernel is the exponential of some squared Euclidean distance. Moreover our affinity function is not a symmetric function which is mandatory for radial basis kernel function. The effectiveness of our similarity function for classification has been established in Table 5 through comparison of our similarity with the similarity function used by Mitra et al. (2002a). Step 5: Using the proposed affinity function in Eq. (2), we obtain the distances of all the training points from any test point. Initially, we sort these distances in the ascending order. We then mark the k-nearest neighboring points of the test point as y1, y2, . . . , yk arranged in the order of increasing distances. Step 6: The next task is to allot some score to each class relative to the test point Xt which is given by Sc(Xt, Cj) for j = 1, 2, . . . , M. In (Bermejo and Cabestany, 2000), the score function was given by the following equation:

Table 2 Comparison of the proposed method with that of Tahir and Smith (2010).

a

Dataset

Tahir and Smitha

Modified KNN (MKNN)a

SONAR IONOSPHERE VEHICLE WDBC SPECTF MUSK1 AVERAGE

87.1(6.53) 92.2(4.53) 70.7(3.60) 95.5(2.45) 70.7(7.71) 86.2(3.84) 83.73(4.78)

86.41(7.82) 93.49(4.30) 70.54(4.21) 95.78(2.30) 74.55(7.41) 88.74(4.33) 84.92(5.06)

Within the bracket the standard deviations of classification are given.

example, see the works of Mitra et al. (2002a), who developed a model of similarity function as: Simpq = exp(bDpq), where b is a positive constant. The expression for b is given by: ln 0:5=D, where D is the average distance between data points computed over the entire data set. This value of b is estimated by taking the mostly expected value of similarity between any two points as 0.5 which sometimes may lead to misclassification. A similar form of similarity function (Billot et al., 2008) was given by Sim(z, x) = exp(m(x  z)) for some norm m on Rm. In the present work we set up a new similarity function in the following way:

SimðX t ; X i Þ ¼ 1 if

Xt ¼ Xi

! d  N  ðdðX t ; X i ÞÞ ¼ exp  PN if t¼1 dðX t ; X i Þ

ð3Þ

X t –X i

Here, i = 1, 2, . . . , N and d(Xt, Xi) is given by Eq. (2). From Eq. (3), we can see that the function Sim(Xt, Xi) tends to zero for extremely large values of d(Xt, Xi). We carefully noticed that the above similarity

Table 3 Comparison of our distance measure with distance measures in (Tahir and Smith, 2010). Dataseta

IRIS(150, 4) WINE(178, 13) GLASS(214, 9) PIMA_DIABETES(768, 8) BREAST(699, 10) SONAR(208, 60) IONOSPHERE(351, 34) VEHICLE(846, 18) WDBC(569, 31) SPECTF(267, 44) MUSK1(476, 166) BREAST-TISSUE(106, 10) PARKINSON(195, 23) SEGMENTATION(210, 18) ECOLI(336,8) AVERAGE a

Euclidean

95.29(5.91) 96.34(4.71) 69.67(8.79) 73.34(4.56) 95.78(2.31) 85.45(8.50) 87.29(5.48) 69.92(4.29) 95.40(2.74) 71.12(7.92) 88.59(4.50) 85.84(9.82) 96.20(4.38) 85.05(8.46) 92.07(4.22) 85.82(5.77)

Squared euclidean

95.15(6.08) 95.95(4.95) 69.61(8.43) 72.24(5.25) 95.15(2.44) 85.99(8.21) 87.09(5.60) 69.85(4.29) 95.13(2.53) 70.21(7.94) 88.53(4.61) 85.45(9.70) 96.15(4.26) 84.67(8.40) 92.12(4.21) 85.55(5.79)

City block

94.45(6.39) 97.18(3.72) 73.28(8.38) 72.42(4.82) 95.94(2.25) 87.41(7.93) 90.60(4.69) 70.10(4.32) 95.85(2.20) 70.61(7.99) 86.03(4.67) 85.89(9.53) 96.60(4.12) 86.81(7.27) 93.02(4.21) 86.41(5.50)

Canberra

93.24(6.75) 92.98(6.02) 71.73(8.73) 69.67(5.08) 96.47(2.11) 78.82(9.07) 78.94(6.92) 68.22(4.61) 93.81(3.18) 71.10(7.63) 76.81(5.85) 81.41(11.18) 92.98(5.37) 85.57(7.36) 89.51(4.78) 82.75(6.31)

Squared-chord

95.14(6.20) 95.95(4.34) 70.60(8.96) 70.41(4.49) 95.04(2.41) 87.70(6.96) 86.26(5.84) 69.49(4.20) 95.11(2.44) 70.89(8.51) 87.90(4.53) 81.86(11.06) 94.97(4.39) 85.57(7.73) 91.35(4.26) 85.22(5.76)

Squared-chi-squared

95.14(6.20) 94.76(5.48) 69.76(8.77) 70.40(4.57) 95.17(2.59) 84.38(8.60) 75.54(7.18) 69.62(4.48) 94.80(2.61) 70.80(7.38) 83.55(5.24) 79.42(11.29) 96.01(4.53) 86.38(7.80) 90.61(4.32) 83.76(6.07)

In this column for each dataset the figures within the bracket denote the total number of points and the total number of attributes.

Proposed distance 5-fold

10-fold

94.69(3.98) 96.85(2.70) 74.31(5.28) 72.78(2.67) 96.52(1.60) 86.15(5.86) 93.20(3.09) 70.85(2.71) 95.90(1.54) 74.58(4.57) 87.90(3.57) 87.15(6.03) 97.08(2.68) 86.67(5.22) 87.90(3.57) 86.84(3.67)

94.53(6.39) 97.19(4.12) 74.63(8.58) 72.80(5.13) 96.58(2.14) 86.41(7.82) 93.49(4.30) 70.54(4.21) 95.78(2.30) 74.55(7.41) 88.74(4.33) 87.23(8.93) 97.24(4.00) 86.62(7.56) 91.98(4.51) 87.22(5.45)

360

G. Bhattacharya et al. / Pattern Recognition Letters 33 (2012) 356–363

Table 4 Comparison of scores for the significance of individual terms of affinity function Dataseta

Performance of accuracy using the following terms of the Eq. (2)

IRIS(150, 4) WINE(178, 13) GLASS(214, 9) PIMA_DIABETES(768, 8) BREAST(699, 10) SONAR(208, 60) IONOSPHERE(351, 34) VEHICLE(846, 18) WDBC(569, 31) SPECTF(267, 44) MUSK1(476, 166) Correlation with Col. 2 a

both the terms

1st term

2nd term

94.53 97.19 74.63 72.80 96.58 86.41 93.49 70.54 95.78 74.55 88.74

94.52 96.79 73.94 72.55 96.59 86.88 92.74 70.87 95.82 74.25 88.03 0.999249

94.53 96.96 73.68 72.88 96.61 86.55 93.63 70.70 95.69 74.89 88.99 0.999476

Maximum Accuracy from one of (Yin and Han, 2003; Tahir and Smith, 2010)

95.3 95.5 74.4 75.5 96.4 87.1 92.6 72.6 95.5 70.7 86.2 Sum of all scores Sum of ve scores

Difference of Col. 2 and col. 5 (x)

Col. 3 and col. 5 (y)

Col. 4 and col. 5 (z)

0.8 1.7 0.2 2.7 0.2 0.7 0.9 2.1 0.3 3.8 2.5 3.4 6.2

0.8 1.3 0.5 2.9 0.2 0.2 0.1 1.7 0.3 3.5 1.8 1.2 6.1

0.8 1.5 0.7 2.6 0.2 0.5 1.0 1.9 0.2 4.2 2.8 3.3 6.6

In this column for each dataset the figures within the bracket denote the total number of points and the total number of attributes of the given dataset respectively.

ScðX t ; C j Þ ¼

k X

ZðY i ; C j Þ;

j ¼ 1; 2; . . . ; M

ð4Þ

i¼1

Note that the function Z in the above equation can only take two values: 0 and 1. So, we can write: If yi 2 Cj Z(Yi, Cj) = 1. Else Z(Yi, Cj) = 0. The test point Xt is allotted to that class for which the value of the score Sc is maximum. Some modification is proposed for the score function in the following manner (Monev, 2004; Domeniconi et al., 2002):

ScðX t ; C j Þ ¼

k X

SimðX t ; Y i ÞZðY i ; C j Þ;

j ¼ 1; 2; . . . ; M

ð5Þ

i¼1

where Z is same as in Eq. (4) and Sim(Xt, Yi) is given by Eq. (3). In this paper, we used Eq. (5) to extract the corresponding scores. 2.2. Time-complexity analysis For n number of samples, B-fold partitioning is used for the purpose of cross-validation. Let N be the number of training samples. Let k and NC represents the number of nearest neighborhood points and the number of classes assigned for classification of patterns respectively. Now, we present the detailed (worst-case) time-complexity analysis. Step 1: Complexity of normalization of n samples: O(n). Step 2: Complexity of partitioning n samples: O(n). For each test sample, Steps 3–4: Complexity of calculating distance for N training samples: O(N). Step 5: Complexity of sorting N training samples and calculating the similarity of k samples: O(N log N) + O(k). Step 6: Complexity of score calculation: O(kNC). Total complexity for each test sample: O(N) + O(N log N) + O(k) + O(kNC). For B-fold partitioning of n samples, the number of test samples in each partition is n/B. Considering all such test samples in all partitions, the total complexity is O(B(n/B)N) + O(B(n/B)N log N) + O(B(n/B)k) + O(B(n/B)kNC) = O(nN) + O(nN log N) + O(nk) + O(nkNc).

Table 5 Comparison of the proposed similarity function with the one by (Mitra et al., 2002a) Dataseta

Modified KNN (proposed distance with proposed Similarity) p k = [ N] k=3

Modified KNN (our distance with the similarity function used in (Mitra et al., 2002a) taking p k = [ N])

IRIS(150, 4) WINE(178, 13) GLASS(214, 9) PIMADIABETES(768, 8) BREAST(699, 10) SONAR(208, 60) IONOSPHERE(351, 34) VEHICLE(846, 18) WDBC(569, 31) SPECTF(267, 44) MUSK1(476, 166) BREAST TISSUE(106, 10) PARKINSON(195, 23) SEGMENTATION(210, 18) ECOLI(336, 8) AVERAGE

94.53(6.39) 97.19(4.12) 74.63(8.58) 72.80(5.13) 96.58(2.14) 86.41(7.82) 93.49(4.30) 70.54(4.21) 95.78(2.30) 74.55(7.41) 88.74(4.33) 87.23(8.93) 97.24(4.00) 86.62(7.56) 91.98(4.51) 87.22(5.45)

95.43(5.45) 96.07(4.18) 69.39(9.66) 75.40(4.79) 96.25(2.20) 77.95(9.57) 91.94(4.60) 70.03(4.67) 95.38(2.82) 78.58(7.39) 84.21(5.98) 81.55(10.44) 92.16(5.21) 85.10(7.87) 92.67(4.72) 85.47(5.97)

93.5(6.64) 97.13(3.75) 73.91(8.75) 70.43(5.27) 95.81(2.22) 86.06(7.91) 93.57(4.26) 70.57(4.24) 95.73(2.30) 74.21(7.56) 88.74(4.33) 86.04(9.32) 97.71(3.63) 86.62(7.68) 91.29(4.52) 86.75(5.49)

a In this column for each dataset the figures within the bracket denote the total number of points and the total number of attributes of the given dataset respectively. In other columns within the bracket the standard deviations of classifications are given. Comparison has been made using 10 individual random runs along with 10-fold cross validation.

So, total complexity for the modified kNN algorithm taking the normalization (step 1) and partitioning (step 2) parts into account is: O(n) + O(n) + O(nN) + O(nN log N) + O(nk) + O(nkNC). Since, k, NC  n, N, the above complexity becomes: O(nN log N). 3. Experimental results We compared our method of modified kNN algorithm with some existing clustering methods on standard available datasets from UCI machine learning data repository. In the tables given below, the figures appearing in bold represent best performance (for that particular dataset). Note that no feature selection strategy is thus far incorporated and we have used normalized (using Z-score) dataset for classification. We first show the average classification performance of our modified kNN method for 10-fold cross validation for 10 random seeds in comparison to other well-known

G. Bhattacharya et al. / Pattern Recognition Letters 33 (2012) 356–363

methods like C4.5, RIPPER, CBA, CMAR and CPAR (Yin and Han, 2003). Table 1 demonstrates that our method MKNN yields the highest average accuracy (85.77%) (taking average performance of 10 individual random run along with 10-fold cross validation) among the above mentioned classification methods for eight standard datasets. In addition, MKNN is a winner in 5 and runners in 3 out of these 8 datasets. We next compare our proposed method with a very recent clustering work, done by (Tahir and Smith, 2010). From Table 2, it is evident that we have outperformed Tahir and Smith in 4 out of 6 datasets. Here also we have submitted the average performance on 10 individual random run of the algorithm with 10-fold cross validations. Note that our MKNN method yields a better average accuracy of 84.92% as compared 83.73% from Tahir and Smith. Note that Tahir and Smith (2010) and Yin and Han (2003) have used different datasets with some overlaps (as indicated by Tables 1 and 2). This is why we have used two different tables for performance comparisons. We next show the individual impacts of (i) the affinity-based distance function and (ii) the similarity measure using 10 individual random run with 10-fold cross validation, changing distance functions (with our similarity function only) and again using different similarity functions (with our distance function) individually. Everywhere we have taken p k = [ N], where N is the number of data used for training purpose. In Table 3, we show the effectiveness of our affinity function as a measure of distance by comparing it with different distance functions following Tahir and Smith. The similarity function of Eq. (3) is used with all the above distances. Our proposed distance function yield 86.84% and 87.22% average classification accuracy for the 5fold and 10-fold partitioning respectively, which exceed the other distance measures for the 10 individual random run with 10-fold cross validation. The proposed affinity-based distance function wins respectively in 9 and 8 cases with 10-fold and 5-fold crossvalidations in a total of 15 datasets. In Fig. 3, we provide the range of accuracy obtained for all fifteen datasets using all the distance functions of Table 3. This is indicated by 15 vertical straight lines (topmost point of each such line indicates maximum accuracy and the bottommost point of each such indicates minimum accuracy). Here, we also show the range of accuracy obtained from our MKNN method using fifteen rectangles (top edge of each rectangle indicates maximum accuracy and the bottom edge of each rectangle indicates minimum accuracy). We observe that in all the cases, the rectangle intersects the straight line near the top, which clearly indicates the superiority of the proposed affinity-based distance measure.

361

In Fig. 4, we present the classification accuracy obtained from our proposed kNN method for all fifteen datasets. To make the analysis complete, we have added the accuracies for both 5-fold and 10-fold data partitioning in this figure. From the results of Table 3 and Fig. 4, we can conclude that the average classification accuracy for 10-fold partitioning changes marginally from that of the 5-fold partitioning for the above datasets. Table 4 establishes the importance of the individual terms of the affinity function. The correlation coefficient between the classification accuracy of the affinity-based distance function and that of exclusively the first term of the affinity function is 0.999249. Likewise, the correlation coefficient between the classification accuracy of the affinity-based distance function and that of exclusively the second term is 0.999476. The sum of the differences of all scores and the sum of the differences of the negative scores in the last two rows of Table 6 also emphasize the effectiveness of the respective terms. In addition, the correlation coefficient between the relative gain or loss for the overall expression (x) and that with only the first term (y) is computed. Similarly, we have obtained the correlation coefficient between the relative gain or loss for the overall expression (x) and that with only the second term (z). The respective values obtained, viz., 0.978849 and 0.984858, are found to be quite high. All these above numerical arguments clearly justify the presence of the two different components in the affinity-based distance function. In Table 5, we exclusively demonstrate the effectiveness of our proposed similarity function by keeping the affinity-based distance function (from Eq. (2)) unchanged. The proposed similarity function is compared with the one in (Mitra et al., 2002a). The similarity function of Mitra et al. (2002a), when used in our proposed kNN algorithm, gives an accuracy of 85.47%. In contrast, our proposed similarity function yields average classification accuracies of p 87.22% and 86.75% for k = [ N] and for k = 3 respectively for the same datasets using 10 individual random run with 10-fold cross validation. Our similarity function wins respectively in 11 and 10 p cases for k=[ N] and k = 3 out of 15 cases. For the Ionosphere, Vehicle and Parkinson dataset, k = 3 gives better result than that p of k = [ N]. The classification accuracy for Musk1 and Segmentation datasets do not change at all with change of k. However, for p the remaining datasets, k = [ N] gives better results than k = 3. We have additionally compared our results with the results of kNN using different proposed distance functions viz. Mahalanobis distance (Mahalanobis, 1936), Xing distance (Xing et al., 2002), Large Margin Nearest Neighbor (LMNN)-based distance (Weinberger et al., 2005), Information Theoretic Metric Learning (ITML)based distance (Davis et al., 2007), Kernel Relevant Component

Fig. 3. Comparison profile of the proposed affinity function with respect to the ranges of classification accuracy obtained from different distance functions.

362

G. Bhattacharya et al. / Pattern Recognition Letters 33 (2012) 356–363

Fig. 4. Classification accuracy for 5-fold and 10-fold data partitioning of 15 datasets using 10 individual random run.

Table 6 Comparison of the proposed distance function with the distances mentioned by Wang and Jin (2009). DATA

Mahalanobis

Xing

LMNN

ITML

KRCA

IGML

KIGML

OUR KNN (10 FOLD)

WINE GLASS PIMA SONAR IONOSPHERE AVERAGE

92.5 65.1 72.2 71.1 81.6 76.5

89.2 58.3 72.1 71.1 89.7 76.1

95.9 65.1 72.9 79.7 85 79.7

92.3 63.8 72.2 71.7 88.9 77.8

95.4 63.1 72.2 73.5 82.8 77.4

95 64.2 72.4 71.9 83.4 77.4

93.9 66.7 72.2 85.4 85.8 80.8

97.2 74.6 72.8 86.4 93.5 84.9

Analysis (KRCA) distance (Tsang et al., 2005), Information Geometric Metric Learning (IGML)-based distance (Wang and Jin, 2009) and Kernel Information Geometric Metric Learning (KIGML)-based distance (Wang and Jin, 2009). The superiority of the proposed distance function over the above-mentioned distance functions is clearly evident from Table 6. In four out of the five datasets shown, our distance has outperformed all seven distances. Moreover, the average accuracy using our distance is approximately 5–10% higher (5% higher than KIGML and 10% higher than Xing) compared to that of the other seven distances.

4. Conclusion and future work In this paper, we propose a modified version of the classical kNN algorithm. In particular, we introduce an affinity function between a training point and a test point as a measure of distance. We also design a new similarity function using this newly proposed affinity based distance function. Since, the proposed kNN algorithm realizes vicinity level learning while structuring the proximity functions (i.e., the distance and the similarity function), it can also be categorized as a locally adaptive kNN algorithm (Hastie and Tibshirani, 1996; Domeniconi et al., 2002). We have shown that each of the above modifications has a considerable influence on the performance of

the algorithm. Experimental results clearly indicate that the proposed method has outperformed some well-known variants of the kNN algorithm. Asymmetric proximity functions (distance and similarity function) in recent years (McFee and Lanckriet, 2010) have gained popularity over their traditional counterpart. Note that the proposed affinity based distance function and similarity function are both asymmetric. In fact, they are directed from a test point to a training point and capture more information as they realize local level learning about the concerned training point. A properly chosen value of k can potentially improve the classification results. Thus, in future, we will try to further improve our results by choosing a suitable value of k. We will also explore Metric learning (McFee and Lanckriet, 2010) and Parzen window based learning (Parzen, 1962; Mitra et al., 2002b) to possibly enhance the performance of the proposed kNN algorithm. In this paper, we have experimented only with the numerical data. So, another direction of future research will be to extend the current approach for the classification of categorical data (Boriah et al., 2008). References Atkeson, C.G., Moore, A.W., Schaal, S., 1997. Locally weighted learning. Artif. Intell. Rev. 11, 11–73. Bailey, T., Jain, A., 1978. A note on distance weighted k-nearest neighbour rules. IEEE Trans. Systems Man Cybernet. 8, 311–313.

G. Bhattacharya et al. / Pattern Recognition Letters 33 (2012) 356–363 Baoli, L., Yuzhong, C., Shiwen, Y., 2002. A comparative study on automatic categorization methods for Chinese search engine. In: Proc. 8th Joint Internat. Computer Conf.. Zhejiang University Press, Hangzhou, pp. 117–120. Bermejo, S., Cabestany, J., 2000. Adaptive soft k-nearest-neighbour classifiers. Pattern Recognition 33, 1999–2005. Billot, A., Gilboa, I., Schmeidler, D., 2008. Axiomatization of an exponential similarity function. Math. Soc. Sci. 55, 107–115. Boriah, S., Chandola, V., Kumar, V., 2008. Similarity measures for categorical data: A comparative evaluation. In: Proc. SIAM Data Mining Conf., Atlanta, GA, pp. 243– 254. Cha, S.H., 2007. Comprehensive survey on distance/similarity measures between probability density functions. Internat. J. Math. Models and Methods Appl. Sci. 1 (4), 300–307. Cormen, T.H., Leiserson, C.E., Rivest, R.L., Stein, C., 2001. Introduction to Algorithms. MIT Press, USA. Cover, T.M., Hart, P., 1967. Nearest neighbour pattern classification. IEEE Trans. Inform. Theory 13 (1), 21–27. Davis, J., Kulis, B., Jain, P., Sra, S., Dhillon, I., 2007. Information-theoretic metric learning. In: Proc. ICML. Corvalis, Oregon, pp. 209–216. Deza, E., Deza, M.M., 2006. Dictionary of Distances. Elsevier. Dietterich, T.G., Wettschereck, D., Atkeson, C.G., Moore, A.W., 1993. Memory based methods for regression and classification. In: Proc. NIPS 1993, pp. 1165–1166. Domeniconi, C., Peng, J., Gunopulos, D., 2002. Locally adaptive metric nearest neighbour classification. IEEE Trans. Pattern Anal. Machine Intell. 24 (9), 1281– 1285. Duda, R.O., Hart, P.E., Stork, D.G., 2001. Pattern Classification, 2nd ed. John Wiley & Sons, New York. Dudani, S.A., 1976. The distance-weighted k-nearest-neighbour rules. IEEE Trans. Systems Man Cybernet. 6, 325–332. Fix, E., Hodges, J.L., 1951. Discriminatory analysis, nonparametric discriminators: Consistency properties. Technical Report 4, USAF School of Aviation Medicine, Randolph Field, Texas. Frome, A., Singer, Y., Sha, F., Malik, J., 2007. Learning globally-consistent local distance functions for shape-based image retrieval and classification. In: Proc. Computer Vision, ICCV. Fukunaga, K., Hostetler, L., 1975. k-Nearest-neighbour Bayes-risk estimation. IEEE Trans. Inform. Theory 21 (3), 285–293. Gavin, D.G., Oswald, W.W., Wahe, E.R., Williams, J.W., 2003. A statistical approach to evaluating distance metrics and analog assignments for pollen records. Quater. Res. 60, 356–367. Hastie, T., Tibshirani, R., 1996. Discriminant adaptive nearest neighbour classification. IEEE Trans. Pattern Anal. Machine Intell. 18 (6), 607–616. Hastie, T. et al., 2009. The Elements of Statistical Learning, 2nd ed. Springer, p. 172. Helman, M.E., 1970. The nearest neighbour classification rule with a reject option. IEEE Trans. System Man Cybernet. 3, 179–185. Jozwik, A., 1983. A learning scheme for a fuzzy k-NN rule. Pattern Recognition Lett. 1, 287–289.

363

Keller, J.M., Gray, M.R., Givens, J.A., 1985. A fuzzy k-NN neighbour algorithm. IEEE Trans. Systems Man Cybernet. 15 (4), 580–585. Kulis, B., 2010. Metric learning, ICML 2010 Tutorial. Mahalanobis, P.C., 1936. On the generalised distance in statistics. In: Proc. National Institute of Sciences of India 2(1), pp. 49–55. McFee, B., Lanckriet, G., 2010. Metric learning to rank. In Proc. ICML. Haifa, Israel, pp. 775–782. Michie, D., Spiegelhalter, D.J., Taylor, C.C., 1994. Machine Learning, Neural and Statistical Classification. Ellis Horwood, Upper Saddle River, NJ, USA. Mitra, P., Murthy, C.A., Pal, S.K., 2002a. Unsupervised feature selection using feature similarity. IEEE Trans. Pattern Anal. Machine Intell. 24 (3), 301–312. Mitra, P., Murthy, C.A., Pal, S.K., 2002b. Density-based multiscale data condensation. IEEE Trans. Pattern Anal. Machine Intell. 24 (6), 734–747. Monev, V., 2004. Introduction to similarity searching in chemistry. MATCH – Communications. In: Proc. Mathematical and Computational Chemistry, vol. 51, pp. 7–38. Parvin, H., Alizadeh, H., Minaes-Bidgoli, B., 2008. MKNN: Modified k-nearest neighbor. In: Proc. World Congress on Engineering and Computer Science (WCECS) San Francisco, USA. Parzen, E., 1962. On estimation of a probability density function and mode. Ann. Math. Statist. 33, 1065–1076. Ricci, F., Avesani, P., 1996. Nearest neighbour classification with a local asymmetrically weighted metric. IRST. Protocol No., 9601-12. Tahir, M.A., Smith, J., 2010. Creating diverse nearest neighbor ensembles using simultaneous metaheuristic feature selection. Pattern Recognition Lett. 31, 1470–1480. Tsang, I., Cheung, P., Kwok, J., 2005. Kernel relevant component analysis for distance metric learning. In: Proc. IJCNN. Wang, J., Neskovic, P., Cooper, L., 2007. Improving neighbour rule with a simple adaptive distance measure. Pattern Recognition Lett. 28, 207–213. Wang, S., Jin, R. 2009. Information geometry approach for distance metric learning. In: Proc. 12th Internat. Conf. on Artificial Intelligence and Statistics (AISTATS). JMLR: W&CP, vol. 5. Clearwater Beach, Florida, USA. Weinberger, K., Blitzer, J., Saul, L., 2005. Distance metric learning for large margin nearest neighbor classification. In: Proc. NIPS. Wu, G., Chang, E.Y., Panda, N., 2005. Formulating distance functions via kernel trick. In: Proc. Conf. on Knowledge Discovery and Data Mining (KDD), pp. 703–709. Xing, E., Ng, A., Jordan, M., Russell, S., 2002.Distance metric learning, with application to clustering with side-information. In: Proc. NIPS. Yang, Y., Liu, X., 1999. A re-examination of text categorization methods. In: Proc. 22nd Annual Internat. ACM SIGIR Conf. on Research and Development in Information Retrieval, pp. 42–49. Yin, X., Han, J., 2003. CPAR: Classification based on predictive association rules. In: Proc. SIAM Internat. Conf. on Data Mining (SDM), San Francisco, CA, USA. Zezula, P., Amato, G., Dohnal, V., Batko, M., 2006. Similarity Search the Metric Space Approach. Springer.

An affinity-based new local distance function and ...

Nov 11, 2011 - a Department of Physics, University Institute of Technology, University of Burdwan, Golapbag (North), Burdwan 713104, .... Information Geometric Metric Learning (KIGML)-based distance ...... Clearwater Beach, Florida, USA.

639KB Sizes 2 Downloads 200 Views

Recommend Documents

Distance function design and Lyapunov techniques for ...
a Department of Computer Science, KU Leuven, Belgium ..... The data of the hybrid system (1) is such that G ...... Science. Foundation under CAREER Grant No.

Distance function design and Lyapunov techniques for the stability of ...
Dec 31, 2014 - and is feasible if both trajectories have a hybrid time domain that is unbounded .... to find proper distance functions that do converge to zero in ...

Learning Distance Function by Coding Similarity
Intel research, IDC Matam 10, PO Box 1659 Matam Industrial Park, Haifa, Israel 31015. Daphna Weinshall ... data retrieval, where similarity is used to rank items.

Learning Distance Function by Coding Similarity
School of Computer Science and Engineering, The Hebrew University of Jerusalem, Jerusalem, Israel ... function directly determines the quality of the cluster-.

A distance-function-based Cartesian (DIFCA) grid ...
h dimensionless temperature. Subscripts c center of the cylinder extra addition due to ''irregular” geometry. 1 surroundings. Superscripts. 1 the first interface. 2 ...... ً41ق. 3.3. Solution procedure. The solution procedure for the proposed dis

LNCS 3973 - Local Volatility Function Approximation ... - Springer Link
S&P 500 call option market data to illustrate a local volatility surface estimated ... One practical solution for the volatility smile is the constant implied volatility approach .... Eq. (4) into Eq. (2) makes us to rewrite ˆσRBF (w; K, T) as ˆσ

pdf-14108\distance-education-innovations-and-new-learning ...
... apps below to open or edit this item. pdf-14108\distance-education-innovations-and-new-lea ... hing-methods-and-emerging-technologies-by-taylor.pdf.

Medial frontal cortex function: An introduction and ...
or not an error was made, but also the degree of “badness” .... ing, which suggests that MFC serves as an online detector .... S. J., & Carter, C. S. (2004).

Distance Education Trends: Integrating new ... - Semantic Scholar
Nevertheless, it is second-generation Web tools that promise to take ... Weblogs are best used as student portfolios that keep record of an individual's progress ...

pdf Unstuck and On Target!: An Executive Function ...
Book synopsis. Unstuck and on Target! For students with autism spectrum disorders, problems with flexibility and goal-directed behavior can be a major obstacle ...

Local Diagonal Extrema Pattern: A New and Efficient ...
[8] Murala and Wu, “Local ternary co-occurrence patterns: A new feature descriptor for. MRI and CT image retrieval,” Neurocomputing, 119: 399-412, 2013. 20. 40. 60. 80. 100. 40. 50. 60. 70. 80. Number of Top Matches. A. R. P. (%. ) LBP. LTP. CSLB

Franchising and Local Knowledge: An Empirical ...
Nov 10, 2011 - ∗I am grateful to my advisors for their many helpful comments and support. I am also thankful for helpful ... knowledge of local demand fluctuations than company-owned ones do, as revealed by their .... by telephone or internet.11 ..

Structure and function of mucosal immun function ...
beneath the epithelium directly in contact with M cells. Dendritic cells in ... genitourinary tract, and the breast during lactation (common mucosal immune system).

Executive Function and Medial Frontal Lobe Function ...
events, risk management, shifting focus, flexibility, inhibition ... word denotes--an incongruent trial (i)--the resulting conflict regarding which action plan to execute ...

Distance Matrix Reconstruction from Incomplete Distance ... - CiteSeerX
Email: drinep, javeda, [email protected]. † ... Email: reino.virrankoski, [email protected] ..... Lemma 5: S4 is a “good” approximation to D, since.

Local Search and Optimization
Simulated Annealing = physics inspired twist on random walk. • Basic ideas: – like hill-climbing identify the quality of the local improvements. – instead of picking ...

New Bit-Parallel Indel-Distance Algorithm
[email protected]. 3 Department of Informatics, Kyushu University 33, Fukuoka 812-8581, Japan [email protected]. Abstract. The task of approximate ...

pdf-1466\accelerated-distance-learning-the-new-way-to ...
... the apps below to open or edit this item. pdf-1466\accelerated-distance-learning-the-new-way-to- ... degree-in-the-twenty-first-century-by-brad-voeller.pdf.

New Learning Design in Distance Education: The ...
processing capacity of PCs and the proliferation of the World Wide Web have unleashed ... rather a philosophy of learning based on the idea that knowledge is constructed by the learner ..... Learning to change: The Virtual Business Learn-.

Molecular evidence for long-distance colonization in an ...
Second, as additional recruitment is likely to be rare once a new population ...... 2.000: A software for population genetics data analysis. Genetics and biometry ...

Distance Formulas
The two points are (2,3) and (3,2) , so the distance is. 2. 2. (2 3) (3 2). 2. −. + − .... at the lattice point (0,1,-1), making the minimum distance zero. 10. Under what ...

Local Government Code - Book 2 Local Taxation and Fiscal Matters ...
(9) "Business" means trade or commercial activity regularly engaged in as a means ... Section 139 of this Code, whose activity consists essentially of the sale of kinds of .... Local Government Code - Book 2 Local Taxation and Fiscal Matters.pdf.