Pattern Recognition Letters 33 (2012) 356–363
Contents lists available at SciVerse ScienceDirect
Pattern Recognition Letters journal homepage: www.elsevier.com/locate/patrec
An affinity-based new local distance function and similarity measure for kNN algorithm Gautam Bhattacharya a, Koushik Ghosh b, Ananda S. Chowdhury c,⇑ a
Department of Physics, University Institute of Technology, University of Burdwan, Golapbag (North), Burdwan 713104, India Department of Mathematics, University Institute of Technology, University of Burdwan, Golapbag (North), Burdwan 713104, India c Department of Electronics and Telecommunication Engineering, Jadavpur University, Kolkata 700032, India b
a r t i c l e
i n f o
Article history: Received 16 March 2011 Available online 11 November 2011 Communicated by N. Sladoje Keywords: kNN Affinity function Similarity measure
a b s t r a c t In this paper, we propose a modified version of the k-nearest neighbor (kNN) algorithm. We first introduce a new affinity function for distance measure between a test point and a training point which is an approach based on local learning. A new similarity function using this affinity function is proposed next p for the classification of the test patterns. The widely used convention of k, i.e., k = [ N] is employed, where N is the number of data used for training purpose. The proposed modified kNN algorithm is applied on fifteen numerical datasets from the UCI machine learning data repository. Both 5-fold and 10-fold cross-validations are used. The average classification accuracy, obtained from our method is found to exceed some well-known clustering algorithms. Ó 2011 Elsevier B.V. All rights reserved.
1. Introduction Appropriate measures of distance and similarity are two prime issues in the field of pattern recognition. The last century witnessed a series of efforts to explore novel measures of distance and similarity in the field of pattern classification, clustering and information retrieval problems (Cha, 2007; Duda et al., 2001; Deza and Deza, 2006; Zezula et al., 2006; Monev, 2004; Gavin et al., 2003). From the mathematical point of view, distance is defined as a quantitative degree of how far apart two objects are. One synonym for distance is dissimilarity. The distance measures satisfying metric properties are termed ‘metric’ and those satisfying other non-metric distance measures are coined as ‘divergence’. In contrast, similarity measures the proximity. In traditional algebra, similarity is often used as the inner product in a certain vector space. This concept is modified and adapted suitably for numerical datasets where the concept of similarity is often used as corresponding expressions for weights attached to the proposed distances (Cha, 2007; Zezula et al., 2006). Fix and Hodges (1951) introduced a non-parametric method for pattern classification that is known as the nearest neighbor rule. The nearest neighbor is one of the most popular algorithms that have long been used in pattern recognition, exploratory data analysis and data mining problems. Typically, the k-nearest neighbors of an unknown sample in the training set are calculated using a predefined distance. The class label of the unknown sample is conse⇑ Corresponding author. Tel.: +91 33 2414 6666x2405; fax: +91 33 2414 6217. E-mail addresses:
[email protected] (G. Bhattacharya),
[email protected] (K. Ghosh),
[email protected] (A.S. Chowdhury). 0167-8655/$ - see front matter Ó 2011 Elsevier B.V. All rights reserved. doi:10.1016/j.patrec.2011.10.021
quently predicted to be the most frequent one occurring in the set of k nearest neighbors. Some advantages of the kNN algorithm are: (a) its inherent simplicity; (b) its robustness to noisy training data, especially if the inverse square of weighted distance is used as the ‘‘distance’’ measure; and (c) its effectiveness if the training data is large. Many researchers have found that the kNN algorithm achieves good performance in their experiments on different data sets (Cover and Hart, 1967; Domeniconi et al., 2002; Michie et al., 1994; Wang et al., 2007; Yang and Liu, 1999; Baoli et al., 2002). Cover and Hart showed that for k = 1 and n ? 1 (n denotes the number of sample points) the kNN classification error is bounded above by twice the Bayes’ error rate. However, researches have been generated new rejection approaches (Helman, 1970), refinements with respect to Bayes’ error rate (Fukunaga and Hostetler, 1975), distance weighted approaches (Dudani, 1976; Bailey and Jain, 1978), soft computing technique (Bermejo and Cabestany, 2000) and fuzzy methods (Jozwik, 1983; Keller et al., 1985) as possible enhancements to the classical kNN algorithm. It has been observed that despite the abovementioned advantages, the performance of the kNN algorithm strongly depends on the following factors: (a) the optimum value of the parameter k (the number of nearest neighbors), (b) choice of a proper distance measure (Parvin et al., 2008) and (c) selection of an appropriate similarity measure (Mitra et al., 2002a). In the present work, we have addressed the above critical issues to improve the performance of the kNN algorithm. We take p k = [ N] following (Mitra et al., 2002a) (N is the number of training points and the symbol ‘[ ]’ stands for the greatest integer function). Such a choice of k analytically projects a fair estimation of k. In addition to the conventional distances like Euclidean, Minkowski,
357
G. Bhattacharya et al. / Pattern Recognition Letters 33 (2012) 356–363
Chebyshev, several other distance functions were proposed over the years. Some prominent examples are Mahalanobis distance (Mahalanobis, 1936), Xing distance (Xing et al., 2002), Large Margin Nearest Neighbor (LMNN)-based distance (Weinberger et al., 2005), Information Theoretic Metric Learning (ITML)-based distance (Davis et al., 2007), Kernel Relevant Component Analysis (KRCA) distance (Tsang et al., 2005), Information Geometric Metric Learning (IGML)-based distance (Wang and Jin, 2009) and Kernel Information Geometric Metric Learning (KIGML)-based distance (Wang and Jin, 2009). In most cases the distance function is linear in nature due to its advantages of simplicity of description and efficiency of computation. This same simplicity is insufficient to model similarity for many real-world datasets (Wu et al., 2005). In this work, a new nonlinear affinity function for the distance measure is introduced which has a good resemblance with the concept of non-Mahalanobis approach of local distance function (Frome et al., 2007; Kulis, 2010) and local asymmetrically weighted learning captured in lazy learning and memory-based learning works of Dietterich et al. (1993), Atkeson et al. (1997) and Ricci and Avesani (1996). Other well-known learning based distance functions can be found in (Wang and Jin, 2009). Our proposed distance also captures the effects of other training points for any particular feature. We also introduce a new similarity function (for measuring the proximity of the test point from the training points) using our newly proposed distance function. Depending upon the classes of the first k nearest neighbor training points, scores are allotted to a test point. The classification inference about the test point is finally taken on the basis of the assigned scores. Our goal in this paper is to improve the classification accuracy through proper formulation of the distance and similarity functions without using detailed metric learning (McFee and Lanckriet, 2010) or Parzen window-based learning (Parzen, 1962; Mitra et al., 2002b). However, it is relevant to mention that the local learning has been implicitly used in the proposed affinity-based distance and similarity functions. We have compared the mean classification accuracy of eight datasets with C4.5, RIPPER, CBA, CMAR and CPAR (Yin and Han, 2003) and the mean classification accuracy of six datasets with Tahir and Smith (2010). The mean accuracy obtained from our method (using 10-fold cross validation for 10 individual random seeds) are found to outperform the mean accuracies of both Yin and Han, and Tahir and Smith. The rest of the paper is organized in the following manner: in Section 2, we describe in details the theory of the proposed kNN classifier and provide a time-complexity analysis of the same. In Section 3, we analyze the experimental results and present a performance comparison with other methods (as mentioned above). The paper is concluded in Section 4 with an outline of future research directions. 2. Proposed kNN classifier algorithm In this section, we describe the theoretical basis behind the proposed kNN classifier. In particular, we propose an affinity-based distance function and a new similarity measure in the existing kNN clustering algorithm. We also perform an analysis of the time-complexity of our method (Cormen et al., 2001).
points and N be the total number of training points (patterns).So, these N points in the feature space, namely, X1, X2, . . . , XN are already classified into M number of classes, namely, C1, C2, . . . , CM so that class C contains Nj number of points for j = 1, 2, . . . , M such that Pj M j¼1 N j ¼ N. The goal is to classify a new point X in the above feature space involving these N given sample points and M given classes. The points are expressed in the following way:
X j ¼ ðX j1 ; X j2 ; . . . ; X jd Þ; X ¼ ðX 1 ; X 2 ; . . . ; X N Þ
Now, we describe our modified algorithm using the following steps: Step 1: Note that in this work, we solely experimented with the numerical datasets. Each data under a particular attribute has been first centered by subtraction of the mean and then scaled through division by the standard deviation. Thus, we use a normalized representation of the original data. Step 2: We employ both 5-fold and 10-fold partitioning and cross-validations of the data. For the 5-fold partitioning, five partitions of the total data are made, each set of which contains approximately [n/5] data for test and the rest part for training. Similarly, for the 10-fold partitioning, 10 partitions are used, each of which carrying [n/10] data for test and rest part of training. The whole process of partitioning is done in a completely random and arbitrary manner. Step 3: Here the task is to determine the distance of all the training points from the test points. There exist many conventional distance functions, such as, Euclidean distance, City block distance (Manhattan distance), Chebyshev distance, Minkowski distance, Canberra distance, Bray Curtis (Sorensen) distance, angular separation, correlation coefficient, Mahalanobis distance. There is one more important contribution in the measurement of distance between two d-dimensional points Xp and Xq, which is given by (Cover and Hart, 1967; Domeniconi et al., 2002; Michie et al., 1994):
Dpq ¼
" d X j¼1
2 #1=2 X pj X qj max j min j
ð1Þ
where max j and min j are the maximum and the minimum values computed over all the training points along the jth axis. One very latest trend is to consider attribute-wise local learning while framing the distance. In this connection, the non-Mahalanobis local distance function is a very useful one (Frome et al., 2007; Kulis, 2010) in which the distance between an arbitrary (e.g., test) point Xt and a training point Xi is proposed as follows:ce:display>
dðX t ; X i Þ ¼
d X
wij dtij
j¼1
where wij is the weight function for the ith training point along jth feature and dtij is the distance between a test point Xt and a training point Xi along jth feature. Keeping the concept of non-Mahalanobis local distance, we propose a new formula for distance in terms of an affinity measurement between a training point and a test point in the following manner. We propose the affinity between a d-dimensional test point Xt and any d-dimensional training point Xi by the following manner:
2.1. Algorithmic details Let us assume that each pattern in a typical pattern classification problem is recognized by d number of observable, well-defined, identically distributed and mutually independent features in the feature space. Hence, each pattern is traced as a point in the d-dimensional feature space. Let n be the total number of sample
j ¼ 1; 2; . . . ; N
dðX t ; X i Þ ¼
d X j¼1
0 jX tj X ij j@ PN
þ PN
l¼1 jX tj
X lj j
m¼1;m–i jX ij X mj j
X lj j PN
!1=2
l¼1 jX tj
X lj j þ !1=2 1 A
l¼1 jX tj
PN
PN
m¼1;m–i jX ij
X mj j ð2Þ
358
G. Bhattacharya et al. / Pattern Recognition Letters 33 (2012) 356–363
In this approach, in order to compute the distance between a test point Xt and a training point Xi (i = 1, 2, . . . , N), we consider all the corresponding attribute-wise gaps. To find the distance between a test point and a particular training point, we consider the sum of the products of (i) the absolute gap between the attribute-wise entries which is the first term in the right hand side of Eq. (2) and (ii) a weight function, being the sum of two terms as demonstrated in the parenthesis in the right hand side of Eq. (2) which can be suitably considered as the proposed weight function wij as employed in the non-Mahalanobis local distance. In this paper, our proposed affinity based distance function cannot be kernel-based as it is not satisfying the fundamental property: d(x, y) = [k(x, x) + k(y, y) 2 k(x, y)]1/2, where d(x, y) and k(x, y) stand for distance and kernel respectively between two points given by x and y. This happens due to the fact that we are not taking the weights wij as constants. Otherwise, non-Mahalanobis local distance function with constant weight coefficients happens to be kernel-based. The numerator of the first term in this weight function denotes the sum of all the attribute-wise absolute gaps between a test point and all the training points. The denominator of the first term has two additive components, namely, (i) the numerator itself and (ii) sum of all the attribute-wise absolute gaps of all the training points from the concerned training point. The first term signifies the relative (fractional) effect of all the training points to the concerned test point with respect to the effect of all the training points to both the concerned test point and the concerned training point. The numerator of the second term is exactly the same as the numerator of the first term. The denominator of the second term denotes the sum of distances of all the training points from the concerned training point only. So, the second term implies the relative (fractional) affinity of all the training points to the test point with respect to the affinity of all the training points to the concerned training point. To comprehend the newly proposed concept we take the help of Figs. 1 and 2. In both the figure as mentioned above, the solid lines denote the distances between the concerned test point Xt and the corresponding training points (concerned training point X1 and other training points), whereas the dotted lines demonstrate the distances of the concerned training point X1 from other training points. In both the figures, the distance between Xt and the 1st training point X1 (only the indices of the training points have been shown in the figures, i.e., the point demonstrated as i actually denotes Xi) as calculated in usual schemes looks alike (bold black lines), but according to our newly proposed affinity function, these distances are not same. This is due to the fact that in Fig. 1 the training points are more clustered than in Fig. 2. Under the light of these considerations, it can be seen that the numerators and denominators for both the terms within the parenthesis in the proposed affinity function corresponding to Fig. 2 will be greater than those corresponding to Fig. 1 which in turn suggests that the test point Xt may have different likelihood towards the cluster of training points demonstrated
in these two figures. So, from the above discussion and explanation, it may be stated that our proposed affinity function incorporates more information as it takes into account not only the traditional distance between a training and a test point, but also the impact of spatial distribution of the training points. So, the first term signifies the relative positional influence of the concerned test point in the system taken as a whole while the second term depicts the comparative influence of the concerned test point and the concerned training point. Hence we go through vicinity-level learning while processing the affinity between a selected test point and a selected training point. In Table 4 we have shown the comparison of our affinity function with other distance functions in terms of finding the nearest neighbors. Table 5 highlights the effectiveness of the 1st and the 2nd term of our affinity function given in Eq. (2). The dotted lines altogether both in Figs. 1 and 2 also implicitly give a measurement of the cluster density of the training points, i.e. our proposed affinity function has a connection with the density of sample space. In the later part of the paper, we experimentally show (via Table 5) that each of these two terms (expressed as ratio), when taken individually, bear a strong correlation with d(Xt, Xi). From Eq. (2), it is quite evident that if we take two test points for distance measure, then the equation will fail as it depends on the biasing of the training points on a test point and not on the reverse. Thus, it is not possible to prove the symmetry property of d(Xt, Xi). We will now state and prove two other important properties of d(Xt, Xi). Property 1. d(Xt, Xi) is non-negative.
Proof. Each component in d(Xt, Xi), i.e., the coefficient, the terms appearing in the numerator and the denominator of the two additive components (each being expressed as a ratio), has a modulus sign. So, d(Xt, Xi) cannot be negative. h Property 2. d(Xt, Xi) satisfies identity of indiscernibles. Proof. If we take Xt = Xi the first factor in each of the terms in the entire sum becomes zero making the entire measure to be zero. So, the proposed measure satisfies the property of identity of indiscernibles.Thus, we can conclude that d(Xt, Xi) is positive definite. Based on the above observations, we deem d(Xt, Xi) as an affinity-based distance function. h Step 4: We next discuss formulation of a similarity function using the corresponding affinity function. A number of choices exist for the similarity function. One classical approach is to take the similarity measure in the form of inverse distance. Another effective and popular strategy is use of exponential similarity. For
Xt Xt
dN d2 d2
d1
α12 3
d3
N α1N
d3 2
d1
dN
2
α1N N 1 α13
Fig. 1. First type of distribution of training points with respect to the test point.
α12 1
α13 3
Fig. 2. Second type of distribution of training points with respect to the test point.
359
G. Bhattacharya et al. / Pattern Recognition Letters 33 (2012) 356–363 Table 1 Comparison of the proposed method with C4.5, RIPPER, CBA, CMAR and CPAR (Yin and Han, 2003). Dataset
C4.5
RIPPER
CBA
CMAR
CPAR
Modified KNN (MKNN)
IRIS WINE GLASS PIMA BREAST SONAR IONOSPHERE VEHICLE AVERAGE
95.3 92.7 68.7 75.5 95.0 70.2 90.0 72.6 82.50
94.0 91.6 69.1 73.1 95.1 78.4 91.2 62.7 81.90
94.7 95.0 73.9 72.9 96.3 77.5 92.3 68.7 83.91
94.0 95.0 70.1 75.1 96.4 79.4 91.5 68.8 83.79
94.7 95.5 74.4 73.8 96.0 79.3 92.6 69.5 84.48
94.53 97.19 74.63 72.80 96.58 86.41 93.49 70.54 85.77
function exhibits better performance, than other existing approaches. This fact is demonstrated elaborately in Table 4. It can be noticed that our newly proposed similarity function may have some structural similarity with the Gaussian radial basis kernel function given by Kðxt ; xi Þ ¼ expðmkxt xi k2 Þ (Hastie et al., 2009). But in our expression of similarity we have used our distance/affinity function measurement. In the exponential function, the denominator is the average distances per feature, which depends on the training and test data separations and also on the dimension of training datasets. It depends also on cross validation partitioning. Thus it is not at all same as the constant m proposed in radial basis kernel function. In fact, the proposed similarity function captures more information compared to a standard radial basis kernel. Here in our similarity function we have taken exponential of some dimensionless quantity, whereas the radial basis kernel is the exponential of some squared Euclidean distance. Moreover our affinity function is not a symmetric function which is mandatory for radial basis kernel function. The effectiveness of our similarity function for classification has been established in Table 5 through comparison of our similarity with the similarity function used by Mitra et al. (2002a). Step 5: Using the proposed affinity function in Eq. (2), we obtain the distances of all the training points from any test point. Initially, we sort these distances in the ascending order. We then mark the k-nearest neighboring points of the test point as y1, y2, . . . , yk arranged in the order of increasing distances. Step 6: The next task is to allot some score to each class relative to the test point Xt which is given by Sc(Xt, Cj) for j = 1, 2, . . . , M. In (Bermejo and Cabestany, 2000), the score function was given by the following equation:
Table 2 Comparison of the proposed method with that of Tahir and Smith (2010).
a
Dataset
Tahir and Smitha
Modified KNN (MKNN)a
SONAR IONOSPHERE VEHICLE WDBC SPECTF MUSK1 AVERAGE
87.1(6.53) 92.2(4.53) 70.7(3.60) 95.5(2.45) 70.7(7.71) 86.2(3.84) 83.73(4.78)
86.41(7.82) 93.49(4.30) 70.54(4.21) 95.78(2.30) 74.55(7.41) 88.74(4.33) 84.92(5.06)
Within the bracket the standard deviations of classification are given.
example, see the works of Mitra et al. (2002a), who developed a model of similarity function as: Simpq = exp(bDpq), where b is a positive constant. The expression for b is given by: ln 0:5=D, where D is the average distance between data points computed over the entire data set. This value of b is estimated by taking the mostly expected value of similarity between any two points as 0.5 which sometimes may lead to misclassification. A similar form of similarity function (Billot et al., 2008) was given by Sim(z, x) = exp(m(x z)) for some norm m on Rm. In the present work we set up a new similarity function in the following way:
SimðX t ; X i Þ ¼ 1 if
Xt ¼ Xi
! d N ðdðX t ; X i ÞÞ ¼ exp PN if t¼1 dðX t ; X i Þ
ð3Þ
X t –X i
Here, i = 1, 2, . . . , N and d(Xt, Xi) is given by Eq. (2). From Eq. (3), we can see that the function Sim(Xt, Xi) tends to zero for extremely large values of d(Xt, Xi). We carefully noticed that the above similarity
Table 3 Comparison of our distance measure with distance measures in (Tahir and Smith, 2010). Dataseta
IRIS(150, 4) WINE(178, 13) GLASS(214, 9) PIMA_DIABETES(768, 8) BREAST(699, 10) SONAR(208, 60) IONOSPHERE(351, 34) VEHICLE(846, 18) WDBC(569, 31) SPECTF(267, 44) MUSK1(476, 166) BREAST-TISSUE(106, 10) PARKINSON(195, 23) SEGMENTATION(210, 18) ECOLI(336,8) AVERAGE a
Euclidean
95.29(5.91) 96.34(4.71) 69.67(8.79) 73.34(4.56) 95.78(2.31) 85.45(8.50) 87.29(5.48) 69.92(4.29) 95.40(2.74) 71.12(7.92) 88.59(4.50) 85.84(9.82) 96.20(4.38) 85.05(8.46) 92.07(4.22) 85.82(5.77)
Squared euclidean
95.15(6.08) 95.95(4.95) 69.61(8.43) 72.24(5.25) 95.15(2.44) 85.99(8.21) 87.09(5.60) 69.85(4.29) 95.13(2.53) 70.21(7.94) 88.53(4.61) 85.45(9.70) 96.15(4.26) 84.67(8.40) 92.12(4.21) 85.55(5.79)
City block
94.45(6.39) 97.18(3.72) 73.28(8.38) 72.42(4.82) 95.94(2.25) 87.41(7.93) 90.60(4.69) 70.10(4.32) 95.85(2.20) 70.61(7.99) 86.03(4.67) 85.89(9.53) 96.60(4.12) 86.81(7.27) 93.02(4.21) 86.41(5.50)
Canberra
93.24(6.75) 92.98(6.02) 71.73(8.73) 69.67(5.08) 96.47(2.11) 78.82(9.07) 78.94(6.92) 68.22(4.61) 93.81(3.18) 71.10(7.63) 76.81(5.85) 81.41(11.18) 92.98(5.37) 85.57(7.36) 89.51(4.78) 82.75(6.31)
Squared-chord
95.14(6.20) 95.95(4.34) 70.60(8.96) 70.41(4.49) 95.04(2.41) 87.70(6.96) 86.26(5.84) 69.49(4.20) 95.11(2.44) 70.89(8.51) 87.90(4.53) 81.86(11.06) 94.97(4.39) 85.57(7.73) 91.35(4.26) 85.22(5.76)
Squared-chi-squared
95.14(6.20) 94.76(5.48) 69.76(8.77) 70.40(4.57) 95.17(2.59) 84.38(8.60) 75.54(7.18) 69.62(4.48) 94.80(2.61) 70.80(7.38) 83.55(5.24) 79.42(11.29) 96.01(4.53) 86.38(7.80) 90.61(4.32) 83.76(6.07)
In this column for each dataset the figures within the bracket denote the total number of points and the total number of attributes.
Proposed distance 5-fold
10-fold
94.69(3.98) 96.85(2.70) 74.31(5.28) 72.78(2.67) 96.52(1.60) 86.15(5.86) 93.20(3.09) 70.85(2.71) 95.90(1.54) 74.58(4.57) 87.90(3.57) 87.15(6.03) 97.08(2.68) 86.67(5.22) 87.90(3.57) 86.84(3.67)
94.53(6.39) 97.19(4.12) 74.63(8.58) 72.80(5.13) 96.58(2.14) 86.41(7.82) 93.49(4.30) 70.54(4.21) 95.78(2.30) 74.55(7.41) 88.74(4.33) 87.23(8.93) 97.24(4.00) 86.62(7.56) 91.98(4.51) 87.22(5.45)
360
G. Bhattacharya et al. / Pattern Recognition Letters 33 (2012) 356–363
Table 4 Comparison of scores for the significance of individual terms of affinity function Dataseta
Performance of accuracy using the following terms of the Eq. (2)
IRIS(150, 4) WINE(178, 13) GLASS(214, 9) PIMA_DIABETES(768, 8) BREAST(699, 10) SONAR(208, 60) IONOSPHERE(351, 34) VEHICLE(846, 18) WDBC(569, 31) SPECTF(267, 44) MUSK1(476, 166) Correlation with Col. 2 a
both the terms
1st term
2nd term
94.53 97.19 74.63 72.80 96.58 86.41 93.49 70.54 95.78 74.55 88.74
94.52 96.79 73.94 72.55 96.59 86.88 92.74 70.87 95.82 74.25 88.03 0.999249
94.53 96.96 73.68 72.88 96.61 86.55 93.63 70.70 95.69 74.89 88.99 0.999476
Maximum Accuracy from one of (Yin and Han, 2003; Tahir and Smith, 2010)
95.3 95.5 74.4 75.5 96.4 87.1 92.6 72.6 95.5 70.7 86.2 Sum of all scores Sum of ve scores
Difference of Col. 2 and col. 5 (x)
Col. 3 and col. 5 (y)
Col. 4 and col. 5 (z)
0.8 1.7 0.2 2.7 0.2 0.7 0.9 2.1 0.3 3.8 2.5 3.4 6.2
0.8 1.3 0.5 2.9 0.2 0.2 0.1 1.7 0.3 3.5 1.8 1.2 6.1
0.8 1.5 0.7 2.6 0.2 0.5 1.0 1.9 0.2 4.2 2.8 3.3 6.6
In this column for each dataset the figures within the bracket denote the total number of points and the total number of attributes of the given dataset respectively.
ScðX t ; C j Þ ¼
k X
ZðY i ; C j Þ;
j ¼ 1; 2; . . . ; M
ð4Þ
i¼1
Note that the function Z in the above equation can only take two values: 0 and 1. So, we can write: If yi 2 Cj Z(Yi, Cj) = 1. Else Z(Yi, Cj) = 0. The test point Xt is allotted to that class for which the value of the score Sc is maximum. Some modification is proposed for the score function in the following manner (Monev, 2004; Domeniconi et al., 2002):
ScðX t ; C j Þ ¼
k X
SimðX t ; Y i ÞZðY i ; C j Þ;
j ¼ 1; 2; . . . ; M
ð5Þ
i¼1
where Z is same as in Eq. (4) and Sim(Xt, Yi) is given by Eq. (3). In this paper, we used Eq. (5) to extract the corresponding scores. 2.2. Time-complexity analysis For n number of samples, B-fold partitioning is used for the purpose of cross-validation. Let N be the number of training samples. Let k and NC represents the number of nearest neighborhood points and the number of classes assigned for classification of patterns respectively. Now, we present the detailed (worst-case) time-complexity analysis. Step 1: Complexity of normalization of n samples: O(n). Step 2: Complexity of partitioning n samples: O(n). For each test sample, Steps 3–4: Complexity of calculating distance for N training samples: O(N). Step 5: Complexity of sorting N training samples and calculating the similarity of k samples: O(N log N) + O(k). Step 6: Complexity of score calculation: O(kNC). Total complexity for each test sample: O(N) + O(N log N) + O(k) + O(kNC). For B-fold partitioning of n samples, the number of test samples in each partition is n/B. Considering all such test samples in all partitions, the total complexity is O(B(n/B)N) + O(B(n/B)N log N) + O(B(n/B)k) + O(B(n/B)kNC) = O(nN) + O(nN log N) + O(nk) + O(nkNc).
Table 5 Comparison of the proposed similarity function with the one by (Mitra et al., 2002a) Dataseta
Modified KNN (proposed distance with proposed Similarity) p k = [ N] k=3
Modified KNN (our distance with the similarity function used in (Mitra et al., 2002a) taking p k = [ N])
IRIS(150, 4) WINE(178, 13) GLASS(214, 9) PIMADIABETES(768, 8) BREAST(699, 10) SONAR(208, 60) IONOSPHERE(351, 34) VEHICLE(846, 18) WDBC(569, 31) SPECTF(267, 44) MUSK1(476, 166) BREAST TISSUE(106, 10) PARKINSON(195, 23) SEGMENTATION(210, 18) ECOLI(336, 8) AVERAGE
94.53(6.39) 97.19(4.12) 74.63(8.58) 72.80(5.13) 96.58(2.14) 86.41(7.82) 93.49(4.30) 70.54(4.21) 95.78(2.30) 74.55(7.41) 88.74(4.33) 87.23(8.93) 97.24(4.00) 86.62(7.56) 91.98(4.51) 87.22(5.45)
95.43(5.45) 96.07(4.18) 69.39(9.66) 75.40(4.79) 96.25(2.20) 77.95(9.57) 91.94(4.60) 70.03(4.67) 95.38(2.82) 78.58(7.39) 84.21(5.98) 81.55(10.44) 92.16(5.21) 85.10(7.87) 92.67(4.72) 85.47(5.97)
93.5(6.64) 97.13(3.75) 73.91(8.75) 70.43(5.27) 95.81(2.22) 86.06(7.91) 93.57(4.26) 70.57(4.24) 95.73(2.30) 74.21(7.56) 88.74(4.33) 86.04(9.32) 97.71(3.63) 86.62(7.68) 91.29(4.52) 86.75(5.49)
a In this column for each dataset the figures within the bracket denote the total number of points and the total number of attributes of the given dataset respectively. In other columns within the bracket the standard deviations of classifications are given. Comparison has been made using 10 individual random runs along with 10-fold cross validation.
So, total complexity for the modified kNN algorithm taking the normalization (step 1) and partitioning (step 2) parts into account is: O(n) + O(n) + O(nN) + O(nN log N) + O(nk) + O(nkNC). Since, k, NC n, N, the above complexity becomes: O(nN log N). 3. Experimental results We compared our method of modified kNN algorithm with some existing clustering methods on standard available datasets from UCI machine learning data repository. In the tables given below, the figures appearing in bold represent best performance (for that particular dataset). Note that no feature selection strategy is thus far incorporated and we have used normalized (using Z-score) dataset for classification. We first show the average classification performance of our modified kNN method for 10-fold cross validation for 10 random seeds in comparison to other well-known
G. Bhattacharya et al. / Pattern Recognition Letters 33 (2012) 356–363
methods like C4.5, RIPPER, CBA, CMAR and CPAR (Yin and Han, 2003). Table 1 demonstrates that our method MKNN yields the highest average accuracy (85.77%) (taking average performance of 10 individual random run along with 10-fold cross validation) among the above mentioned classification methods for eight standard datasets. In addition, MKNN is a winner in 5 and runners in 3 out of these 8 datasets. We next compare our proposed method with a very recent clustering work, done by (Tahir and Smith, 2010). From Table 2, it is evident that we have outperformed Tahir and Smith in 4 out of 6 datasets. Here also we have submitted the average performance on 10 individual random run of the algorithm with 10-fold cross validations. Note that our MKNN method yields a better average accuracy of 84.92% as compared 83.73% from Tahir and Smith. Note that Tahir and Smith (2010) and Yin and Han (2003) have used different datasets with some overlaps (as indicated by Tables 1 and 2). This is why we have used two different tables for performance comparisons. We next show the individual impacts of (i) the affinity-based distance function and (ii) the similarity measure using 10 individual random run with 10-fold cross validation, changing distance functions (with our similarity function only) and again using different similarity functions (with our distance function) individually. Everywhere we have taken p k = [ N], where N is the number of data used for training purpose. In Table 3, we show the effectiveness of our affinity function as a measure of distance by comparing it with different distance functions following Tahir and Smith. The similarity function of Eq. (3) is used with all the above distances. Our proposed distance function yield 86.84% and 87.22% average classification accuracy for the 5fold and 10-fold partitioning respectively, which exceed the other distance measures for the 10 individual random run with 10-fold cross validation. The proposed affinity-based distance function wins respectively in 9 and 8 cases with 10-fold and 5-fold crossvalidations in a total of 15 datasets. In Fig. 3, we provide the range of accuracy obtained for all fifteen datasets using all the distance functions of Table 3. This is indicated by 15 vertical straight lines (topmost point of each such line indicates maximum accuracy and the bottommost point of each such indicates minimum accuracy). Here, we also show the range of accuracy obtained from our MKNN method using fifteen rectangles (top edge of each rectangle indicates maximum accuracy and the bottom edge of each rectangle indicates minimum accuracy). We observe that in all the cases, the rectangle intersects the straight line near the top, which clearly indicates the superiority of the proposed affinity-based distance measure.
361
In Fig. 4, we present the classification accuracy obtained from our proposed kNN method for all fifteen datasets. To make the analysis complete, we have added the accuracies for both 5-fold and 10-fold data partitioning in this figure. From the results of Table 3 and Fig. 4, we can conclude that the average classification accuracy for 10-fold partitioning changes marginally from that of the 5-fold partitioning for the above datasets. Table 4 establishes the importance of the individual terms of the affinity function. The correlation coefficient between the classification accuracy of the affinity-based distance function and that of exclusively the first term of the affinity function is 0.999249. Likewise, the correlation coefficient between the classification accuracy of the affinity-based distance function and that of exclusively the second term is 0.999476. The sum of the differences of all scores and the sum of the differences of the negative scores in the last two rows of Table 6 also emphasize the effectiveness of the respective terms. In addition, the correlation coefficient between the relative gain or loss for the overall expression (x) and that with only the first term (y) is computed. Similarly, we have obtained the correlation coefficient between the relative gain or loss for the overall expression (x) and that with only the second term (z). The respective values obtained, viz., 0.978849 and 0.984858, are found to be quite high. All these above numerical arguments clearly justify the presence of the two different components in the affinity-based distance function. In Table 5, we exclusively demonstrate the effectiveness of our proposed similarity function by keeping the affinity-based distance function (from Eq. (2)) unchanged. The proposed similarity function is compared with the one in (Mitra et al., 2002a). The similarity function of Mitra et al. (2002a), when used in our proposed kNN algorithm, gives an accuracy of 85.47%. In contrast, our proposed similarity function yields average classification accuracies of p 87.22% and 86.75% for k = [ N] and for k = 3 respectively for the same datasets using 10 individual random run with 10-fold cross validation. Our similarity function wins respectively in 11 and 10 p cases for k=[ N] and k = 3 out of 15 cases. For the Ionosphere, Vehicle and Parkinson dataset, k = 3 gives better result than that p of k = [ N]. The classification accuracy for Musk1 and Segmentation datasets do not change at all with change of k. However, for p the remaining datasets, k = [ N] gives better results than k = 3. We have additionally compared our results with the results of kNN using different proposed distance functions viz. Mahalanobis distance (Mahalanobis, 1936), Xing distance (Xing et al., 2002), Large Margin Nearest Neighbor (LMNN)-based distance (Weinberger et al., 2005), Information Theoretic Metric Learning (ITML)based distance (Davis et al., 2007), Kernel Relevant Component
Fig. 3. Comparison profile of the proposed affinity function with respect to the ranges of classification accuracy obtained from different distance functions.
362
G. Bhattacharya et al. / Pattern Recognition Letters 33 (2012) 356–363
Fig. 4. Classification accuracy for 5-fold and 10-fold data partitioning of 15 datasets using 10 individual random run.
Table 6 Comparison of the proposed distance function with the distances mentioned by Wang and Jin (2009). DATA
Mahalanobis
Xing
LMNN
ITML
KRCA
IGML
KIGML
OUR KNN (10 FOLD)
WINE GLASS PIMA SONAR IONOSPHERE AVERAGE
92.5 65.1 72.2 71.1 81.6 76.5
89.2 58.3 72.1 71.1 89.7 76.1
95.9 65.1 72.9 79.7 85 79.7
92.3 63.8 72.2 71.7 88.9 77.8
95.4 63.1 72.2 73.5 82.8 77.4
95 64.2 72.4 71.9 83.4 77.4
93.9 66.7 72.2 85.4 85.8 80.8
97.2 74.6 72.8 86.4 93.5 84.9
Analysis (KRCA) distance (Tsang et al., 2005), Information Geometric Metric Learning (IGML)-based distance (Wang and Jin, 2009) and Kernel Information Geometric Metric Learning (KIGML)-based distance (Wang and Jin, 2009). The superiority of the proposed distance function over the above-mentioned distance functions is clearly evident from Table 6. In four out of the five datasets shown, our distance has outperformed all seven distances. Moreover, the average accuracy using our distance is approximately 5–10% higher (5% higher than KIGML and 10% higher than Xing) compared to that of the other seven distances.
4. Conclusion and future work In this paper, we propose a modified version of the classical kNN algorithm. In particular, we introduce an affinity function between a training point and a test point as a measure of distance. We also design a new similarity function using this newly proposed affinity based distance function. Since, the proposed kNN algorithm realizes vicinity level learning while structuring the proximity functions (i.e., the distance and the similarity function), it can also be categorized as a locally adaptive kNN algorithm (Hastie and Tibshirani, 1996; Domeniconi et al., 2002). We have shown that each of the above modifications has a considerable influence on the performance of
the algorithm. Experimental results clearly indicate that the proposed method has outperformed some well-known variants of the kNN algorithm. Asymmetric proximity functions (distance and similarity function) in recent years (McFee and Lanckriet, 2010) have gained popularity over their traditional counterpart. Note that the proposed affinity based distance function and similarity function are both asymmetric. In fact, they are directed from a test point to a training point and capture more information as they realize local level learning about the concerned training point. A properly chosen value of k can potentially improve the classification results. Thus, in future, we will try to further improve our results by choosing a suitable value of k. We will also explore Metric learning (McFee and Lanckriet, 2010) and Parzen window based learning (Parzen, 1962; Mitra et al., 2002b) to possibly enhance the performance of the proposed kNN algorithm. In this paper, we have experimented only with the numerical data. So, another direction of future research will be to extend the current approach for the classification of categorical data (Boriah et al., 2008). References Atkeson, C.G., Moore, A.W., Schaal, S., 1997. Locally weighted learning. Artif. Intell. Rev. 11, 11–73. Bailey, T., Jain, A., 1978. A note on distance weighted k-nearest neighbour rules. IEEE Trans. Systems Man Cybernet. 8, 311–313.
G. Bhattacharya et al. / Pattern Recognition Letters 33 (2012) 356–363 Baoli, L., Yuzhong, C., Shiwen, Y., 2002. A comparative study on automatic categorization methods for Chinese search engine. In: Proc. 8th Joint Internat. Computer Conf.. Zhejiang University Press, Hangzhou, pp. 117–120. Bermejo, S., Cabestany, J., 2000. Adaptive soft k-nearest-neighbour classifiers. Pattern Recognition 33, 1999–2005. Billot, A., Gilboa, I., Schmeidler, D., 2008. Axiomatization of an exponential similarity function. Math. Soc. Sci. 55, 107–115. Boriah, S., Chandola, V., Kumar, V., 2008. Similarity measures for categorical data: A comparative evaluation. In: Proc. SIAM Data Mining Conf., Atlanta, GA, pp. 243– 254. Cha, S.H., 2007. Comprehensive survey on distance/similarity measures between probability density functions. Internat. J. Math. Models and Methods Appl. Sci. 1 (4), 300–307. Cormen, T.H., Leiserson, C.E., Rivest, R.L., Stein, C., 2001. Introduction to Algorithms. MIT Press, USA. Cover, T.M., Hart, P., 1967. Nearest neighbour pattern classification. IEEE Trans. Inform. Theory 13 (1), 21–27. Davis, J., Kulis, B., Jain, P., Sra, S., Dhillon, I., 2007. Information-theoretic metric learning. In: Proc. ICML. Corvalis, Oregon, pp. 209–216. Deza, E., Deza, M.M., 2006. Dictionary of Distances. Elsevier. Dietterich, T.G., Wettschereck, D., Atkeson, C.G., Moore, A.W., 1993. Memory based methods for regression and classification. In: Proc. NIPS 1993, pp. 1165–1166. Domeniconi, C., Peng, J., Gunopulos, D., 2002. Locally adaptive metric nearest neighbour classification. IEEE Trans. Pattern Anal. Machine Intell. 24 (9), 1281– 1285. Duda, R.O., Hart, P.E., Stork, D.G., 2001. Pattern Classification, 2nd ed. John Wiley & Sons, New York. Dudani, S.A., 1976. The distance-weighted k-nearest-neighbour rules. IEEE Trans. Systems Man Cybernet. 6, 325–332. Fix, E., Hodges, J.L., 1951. Discriminatory analysis, nonparametric discriminators: Consistency properties. Technical Report 4, USAF School of Aviation Medicine, Randolph Field, Texas. Frome, A., Singer, Y., Sha, F., Malik, J., 2007. Learning globally-consistent local distance functions for shape-based image retrieval and classification. In: Proc. Computer Vision, ICCV. Fukunaga, K., Hostetler, L., 1975. k-Nearest-neighbour Bayes-risk estimation. IEEE Trans. Inform. Theory 21 (3), 285–293. Gavin, D.G., Oswald, W.W., Wahe, E.R., Williams, J.W., 2003. A statistical approach to evaluating distance metrics and analog assignments for pollen records. Quater. Res. 60, 356–367. Hastie, T., Tibshirani, R., 1996. Discriminant adaptive nearest neighbour classification. IEEE Trans. Pattern Anal. Machine Intell. 18 (6), 607–616. Hastie, T. et al., 2009. The Elements of Statistical Learning, 2nd ed. Springer, p. 172. Helman, M.E., 1970. The nearest neighbour classification rule with a reject option. IEEE Trans. System Man Cybernet. 3, 179–185. Jozwik, A., 1983. A learning scheme for a fuzzy k-NN rule. Pattern Recognition Lett. 1, 287–289.
363
Keller, J.M., Gray, M.R., Givens, J.A., 1985. A fuzzy k-NN neighbour algorithm. IEEE Trans. Systems Man Cybernet. 15 (4), 580–585. Kulis, B., 2010. Metric learning, ICML 2010 Tutorial. Mahalanobis, P.C., 1936. On the generalised distance in statistics. In: Proc. National Institute of Sciences of India 2(1), pp. 49–55. McFee, B., Lanckriet, G., 2010. Metric learning to rank. In Proc. ICML. Haifa, Israel, pp. 775–782. Michie, D., Spiegelhalter, D.J., Taylor, C.C., 1994. Machine Learning, Neural and Statistical Classification. Ellis Horwood, Upper Saddle River, NJ, USA. Mitra, P., Murthy, C.A., Pal, S.K., 2002a. Unsupervised feature selection using feature similarity. IEEE Trans. Pattern Anal. Machine Intell. 24 (3), 301–312. Mitra, P., Murthy, C.A., Pal, S.K., 2002b. Density-based multiscale data condensation. IEEE Trans. Pattern Anal. Machine Intell. 24 (6), 734–747. Monev, V., 2004. Introduction to similarity searching in chemistry. MATCH – Communications. In: Proc. Mathematical and Computational Chemistry, vol. 51, pp. 7–38. Parvin, H., Alizadeh, H., Minaes-Bidgoli, B., 2008. MKNN: Modified k-nearest neighbor. In: Proc. World Congress on Engineering and Computer Science (WCECS) San Francisco, USA. Parzen, E., 1962. On estimation of a probability density function and mode. Ann. Math. Statist. 33, 1065–1076. Ricci, F., Avesani, P., 1996. Nearest neighbour classification with a local asymmetrically weighted metric. IRST. Protocol No., 9601-12. Tahir, M.A., Smith, J., 2010. Creating diverse nearest neighbor ensembles using simultaneous metaheuristic feature selection. Pattern Recognition Lett. 31, 1470–1480. Tsang, I., Cheung, P., Kwok, J., 2005. Kernel relevant component analysis for distance metric learning. In: Proc. IJCNN. Wang, J., Neskovic, P., Cooper, L., 2007. Improving neighbour rule with a simple adaptive distance measure. Pattern Recognition Lett. 28, 207–213. Wang, S., Jin, R. 2009. Information geometry approach for distance metric learning. In: Proc. 12th Internat. Conf. on Artificial Intelligence and Statistics (AISTATS). JMLR: W&CP, vol. 5. Clearwater Beach, Florida, USA. Weinberger, K., Blitzer, J., Saul, L., 2005. Distance metric learning for large margin nearest neighbor classification. In: Proc. NIPS. Wu, G., Chang, E.Y., Panda, N., 2005. Formulating distance functions via kernel trick. In: Proc. Conf. on Knowledge Discovery and Data Mining (KDD), pp. 703–709. Xing, E., Ng, A., Jordan, M., Russell, S., 2002.Distance metric learning, with application to clustering with side-information. In: Proc. NIPS. Yang, Y., Liu, X., 1999. A re-examination of text categorization methods. In: Proc. 22nd Annual Internat. ACM SIGIR Conf. on Research and Development in Information Retrieval, pp. 42–49. Yin, X., Han, J., 2003. CPAR: Classification based on predictive association rules. In: Proc. SIAM Internat. Conf. on Data Mining (SDM), San Francisco, CA, USA. Zezula, P., Amato, G., Dohnal, V., Batko, M., 2006. Similarity Search the Metric Space Approach. Springer.