IJRIT International Journal of Research in Information Technology, Volume 1, Issue 7, July, 2013, Pg. 55-63

International Journal of Research in Information Technology (IJRIT)

www.ijrit.com

ISSN 2001-5569

Novel Approach for Modification of K-Means Algorithm Based On Genetic Algorithms 1

Ankit Mishra, 2 Prof. Gajendra Singh Chandel

1

2

Computer Science & Engineering, SSSIST Sehore, Madhya Pradesh, India Professor, Computer Science & Engineering, SSSIST Sehore, Madhya Pradesh, India 1

[email protected], [email protected]

Abstract Clustering is an unsupervised learning technique. The main advantage of clustering analysis is a descriptive task that seeks to identify homogeneous groups of objects based on the values of their attributes. Clustering algorithms can be applied in many domains. Most of the data categorization process suffered the problem of seed generation and content validation. In this paper we apply clustering technique for data categorization or grouping of data. This work is focused on the k-mean algorithm. K-Means is one of the most common algorithms used for clustering. It is the unsupervised learning technique. For k-mean to minimize the problem of seed generation and right number of cluster using optimization technique such as genetic algorithm. K-means algorithm is a basic partition method in cluster analysis, but it is sensitive to the initial cluster centers, improper choice of cluster centers will result in cluster failure. K values need to artificially pre-determined, which is very difficult for those who has no experience affected by the isolated points, each round of calculation of the cluster center has deviation and eventually lead to cluster failure. At present people mainly using genetic algorithm to determine the value and achieved good results.

Keywords: Data mining, Clustering, K-Means, Genetic Algorithm.

1. Introduction Clustering can be considered the most important unsupervised learning problem; so, as every other problem of this kind, it deals with finding a structure in a collection of unlabeled data [2] enhancement.

Ankit Mishra, IJRIT

55

Figure 1.1 Clustering Procedure Steps

1.1 Types of Clustering There exit a large number of clustering algorithms in the literature .The choice of clustering algorithm depends both on the type of data available and on the particular purpose and application. If cluster analysis is used as a descriptive or exploratory tool, it is possible to try several algorithms on the same data to see what the data may disclose. In general, major clustering methods can be classified into the following categories [9]. 1. Partition based: The partitioning method initially creates partitions. Then an iterative relocation technique is used to improve partitioning and moves objects from one group to another. 2. Hierarchical: A hierarchical method creates a hierarchical decomposition of the given set of data objects [8]. 3. Density based: The density based approach is to continue growing the given cluster as long as the density i.e. number of objects or data points in the neighborhood exceeds. 4 . Grid based: Grid based methods quantize the object space into a finite number of cells that forms a grid structure. 5. Model based clustering: The model based clustering hypothesizes a model for each of the clusters and finds the best fitted data according to the given model. K-means algorithm which is a partition based clustering, and it is one of the most popular methods used in data clustering due to its good computational performance [4]. However, it is well known that its result depends on the initialization process, which is generally done by random selection. To improve the performance a new initialization technique has been proposed. Different runs of K-means on the same input data may produce different results.

2. K-Means for Clustering The K-Means algorithm is one of the partitioning based, nonhierarchical clustering technique [8]. For any given set of numeric objects X and an integer number K, the K-Means algorithm searches for a partition of X into k

Ankit Mishra, IJRIT

56

clusters that minimizes the within groups sum of squared errors. The K-means algorithm starts by Initializing the k cluster centers. The input data points are then allocated to one of the existing clusters according to the square of the Euclidean distance from the clusters, choosing the closest. The mean (centroids) of each cluster is then computed so as to update the cluster center. The processes of re-assigning the input vectors and the update of the cluster centers is repeated until no more change in the value of any of the cluster centers. The K Means clustering method can be considered, as the cunning method because here, to obtain a better result the centroids, are kept as far as possible from each other. The steps for the K-means algorithm are given below: 1. Initialization: choose randomly K input vectors (data points) to initialize the clusters. 2. Nearest-neighbor search: for each input vector, find the cluster center that is closest, and assign that input Vector to the corresponding cluster. 3. Mean update: update the cluster centers in each cluster using the mean (Centroids) of the input vectors assign to that cluster. 4. Stopping rule: repeat steps 2 and 3 until no more change in the value of the means. 2.1 Drawbacks of the K-Means Algorithm Despite being used in a wide array of applications, the K-Means algorithm is not exempt of drawbacks. Some of these drawbacks have been extensively reported in the literature. The most important are listed below:       

The learning algorithm requires Apriori specification of the number of cluster centers. The use of Exclusive Assignment - If there are two highly overlapping data then k-means will not be able to resolve that there are two clusters. The learning algorithm is not invariant to non-linear transformations i.e. with different representation of data we get different results. Euclidean distance measures can unequally weight underlying factors. The learning algorithm provides the local optima of the squared error function. Randomly choosing of the cluster center cannot lead us to the fruitful result. Applicable only when mean is defined i.e. fails for categorical data.

3. Genetic Algorithms Genetic Algorithm is based on the ideas of natural evolution. In general, GA start with an initial population, then find cromosomes for each indiviual and then a new population is created based on the fitness value of chromosomes. Fitness is the measure for how good is the population. Typically a distance measure is the most common [5]. Then a process called crossover is done over the new population by swapping the substrings from selected chromosomes in order to produce new chromosomes. After that mutation process is applied to produce randomization. This process continues until a termination condition is achieved. In literature it has been found that Genetic algorithmis used to initialize K-means and known as GA initialized K-means (GAIK). The purpose of GA is to optimize the performance of K-means. It has been also noticed that the performance of K-means depends upon the initial centroid selection. GA provides the initial cluster centroids, which act as starting point for Kmeans. For using GAs into clustering, an initial population of random clusters is generated. At each generation, each individual is evaluated and recombined with others on the basis of its fitness. New individuals are created using crossover and mutation.

Ankit Mishra, IJRIT

57

3.1 Chromosome representation The first step of GA is representation (or encoding) of chromosomes. The encoding may be done in binary, integer or real numbers. Different research uses different encoding schemes. 3.2 Fitness evaluation A fitness function is needed to evaluate the fitness of chromosomes. The fitness function should return some real value. K being the number of clusters, mk the centre of cluster Ck, which makes it similar to the k-means algorithm [6]. 3.3 Selection Selecting chromosomes for production of new generation is called Selection. Selection is done on the basis of the fitness value. The best fitted chromosomes are selected for crossover. There are verities of selection procedures like uniform selection, roulette wheel selection, tournament etc. 3.4 Crossover The purpose of crossover is to create two new individuals chromosomes from two existing chromosomes selected from current population. Typical crossover is one point crossover, two point crossover, cycle crossover and uniform crossover. 3.5 Mutation Mutation is done in order to produce randomization. Also it extends the search space. It is done with a pre defined rate called mutation probability. For mutation a particular bit is changed randomly with the mutation probability.

4. Proposed Algorithm This work evaluates the performance of K means [1] with VSM and Genetic algorithm for the human activity recognition using Smartphone, Gas sensor arrays in open sampling settings, internet advertisements dataset. The basic idea about selecting initial cluster centers using genetic algorithm In the proposed algorithm, we first use random function to select K data objects as initial cluster centers to form a chromosome, a total of M chromosomes selected, then have K-means operation on each group of cluster center in the initial population to compute fitness, select individuals according to the fitness of each chromosome, select high-fitness chromosomes for the crossover and mutation operation eliminating low fitness chromosomes, format next generation group finally. In this way, within each new generation of groups, the average fitness are rising, each cluster center is closer to the optimal cluster center, and finally select chromosome that have the highest fitness as the initial cluster center. 4.1 Steps of the Proposed Algorithm 1. Set the parameters: population size M, the maximum number of iteration T, the number of clusters K, etc. 2. Generate m chromosomes randomly; a chromosome represents a set of initial cluster centers, to form the initial population.

Ankit Mishra, IJRIT

58

3. According to the initial cluster centers showed by every chromosome, carry out K-means clustering, each chromosome corresponds to once K-means clustering, then calculate chromosome fitness in line with clustering result, and implement the optimal preservation strategy. 4. For the group, to carry out selection, crossover and mutation operator to produce a new generation of group. 5. To determine whether the conditions meet the genetic termination conditions, if meet then withdrawal genetic operation and turn 6, otherwise turn 3. 6. Calculate fitness of the new generation of group; compare the fitness of the best individual in current group with the best individual's fitness so far to find the individual with the highest fitness. 7. Carry out K-means clustering according to the initial cluster center represented by the chromosome with the 8. Highest fitness, and then output clustering result.

Figure 4.1 Flowchart of Proposed Algorithm

Ankit Mishra, IJRIT

59

5. Experimental Analysis & Performance Evaluation Comparing K-means algorithm based on genetic algorithm (the article) with the original K-means algorithm and two known improved algorithms to verify the effectiveness that selecting initial cluster center using genetic algorithm. Improved algorithm 1 is proposed in, the improved algorithm 2 is proposed in. In order to exclude the impact of isolated points, the article use the method proposed in that using the average value of subset whose object is more close to center as a new round of cluster center to improve K-means algorithm, and also apply this method to original K-means algorithm, improved algorithm 1 and improved algorithm 2, having a comprehensive comparison on them. We added groups of isolated points respectively to the two sets of data above-mentioned. Iris data, adding five isolated points (10,3.0,1.5,5), (5.8,3.6,20,0. 2), (0,0,0,0) , (9.0,6.6,14,0), (6.9,9,1.4,9); wine data, adding five isolated points (0,0,0,0,0,0,0,0,0,0,0,0,0), (13.34,94,2.36,1. 7,110,0.55,0.42,3.17, 1. 02, 1. 93,750,5.36,66 6), (14.34,1. 68,2.7,25,98,2.8,31,0.53,2.7, 13,0.57, 1.96,666),( 14.2,1. 76, 2.45,15.2,1. 12, 3.27,3.39,0.34, 1.97,6.75, 1. 05,2.85, 450),(12.67,0.98,2.24,18,99,2.2,1. 94,0.3, 1. 46,2.62, 123,3.16, 450). Experiment parameter settings are as follows: k = 3; pcl = 0.9; pc2 = 0. 6; pml = 0.5; pm2 = 0.1; pc = 0. 6; pm = 0.1; m (initial population size) = 50, maxgen (the maximum number of iteration) = 100.

5.1 Experiments on Human Activity Recognition Using Smartphone Data Set

Figure 5.1 Result of K-mean algorithm on Human Activity Recognition Using Smartphone Data Set

Ankit Mishra, IJRIT

60

Figure 5.2 Result of Proposed (KVG) methods on Human Activity Recognition Using Smartphone Data Set

6 5 4 Threshold 3

Error Rate No. of clusters

2 1 0 K-Means

Modified K-Means

Figure 5.3 Parameter graph between k-mean and KVG for Human Activity Recognition Using Smartphone Data Set

Ankit Mishra, IJRIT

61

Table 5.1 Base Parameter comparison on Human Activity Recognition Using Smartphone Data Set S.N

Clustering Algorithm

Threshold

Iteration

Time

1.

K-means Algorithm

0.8

4

3.79082

2.

KVG Algorithm

0.8

5

3.04202

Error Rate

4.6621

1.73368

K-Means v/s KVG 6

5

Iteration

4

3

KVG K-means

2

1

0 0

1

2

3 Cluster

Figure 5.4 Comparison of Iteration through cluster between k-mean and KVG

Ankit Mishra, IJRIT

62

6. Conclusion and Future Work In this paper K-Means algorithm that is one of the popular clustering techniques has been surveyed and tried to apply one of the optimization method named genetic algorithm improve in unsupervised clustering procedure. Genetic algorithms are population based methods that use from operators for processing of population chromosomes. In this research, we defined a representation of chromosome string and combine K-Means and GA together. Observing simulations in different running show that K-Means clustering based on Genetic algorithm improved clustering measurement better and more efficient rather than pure K-Means considerably

7. References [1] Anwiti Jain, Anand Rajavat, Rupali Bhartiya, “Design, Analysis and Implementation of Modified K-Mean Algorithm for Large Data-set to Increase Scalability and Efficiency”, 2012 Fourth International Conference on Computational Intelligence and Communication Networks, 2012, ISBN: 978-0-7695-4850-0, Val. No.:12, Page No.627, Copy Right 2012 IEEE. [2] Gibbons F.D, Roth F.P. Judging the quality of gene expression-based clustering methods using gene annotation. Genome Res. 2002;12(10):1574–1581. [3] A. K. Jain, M. N. Murty, and P. J. Flynn, “Data clustering: A review,” ACM Computing Surveys, 31(3), pp. 264323, 1999. [4] KailashChander, Dr. Dinesh Kumar, “Vijay Kumar, Enhancing Cluster Compactness using Genetic Algorithm Initialized K-means”, International Journal of Software Engineering Research & Practices Vol.1, Issue 1, Jan, 2011.

[5]Qin Ding and Jim Gasvoda,”A genetic algorithm for clustering on image data”. International Journal of Computational Intelligence, vol.1, 2005.

[6] J. Grabmeier and A. Rudolph, “Techniques of cluster algorithms in data mining,.” Data Mining and Knowledge Discover, 6, pp. 303- 360, 2002.

[7] www.mathworks.com

[8] Qinghe Zhang, Xiaoyun Chen” Agglomerative Hierarchical Clustering based on Affinity Propagation Algorithm” 3rd International Symposium on Knowledge Acquisition and Modeling, 2010.

[9] Divakar Singh, Anju singh, “A New Framework for Texture based Image Content with Comparative Analysis of Clustering Techniques”

Ankit Mishra, IJRIT

63

Novel Approach for Modification of K-Means ...

Algorithm” 3rd International Symposium on Knowledge Acquisition and Modeling, 2010. [9] Divakar Singh, Anju singh, “A New Framework for Texture based ...

1MB Sizes 0 Downloads 155 Views

Recommend Documents

Novel Approach for Modification of K-Means Algorithm ...
Clustering is an unsupervised learning technique. The main advantage of clustering analysis is a descriptive task that seeks to identify homogeneous groups of objects based on the values of their attributes. Clustering algorithms can be applied in ma

The Method of Separation: A Novel Approach for ...
emerging field of application is the stability analysis of thin-walled ...... Conf. on Computational Stochastic Mechanics (CSM-5), IOS Press, Amsterdam. 6.

A Novel Approach for Recognition of Human Faces
Email: * [email protected] , ** [email protected]. ABSTRACT: This paper attempts to describe ... template based methods. Here we are going to recognize human ...

A Novel Approach for Changing Bandwidth of FSS ...
A Novel Approach for Changing Bandwidth of FSS Filter Using. Gradual Circumferential Variation of Loaded Elements. S. M. Choudhury, M. A. Zaman, M. Gaffar, and M. A. Matin. Bangladesh University of Engineering and Technology, Dhaka, Bangladesh. Abstr

25-clustering-and-kmeans-handout.pdf
Connect more apps... Try one of the apps below to open or edit this item. 25-clustering-and-kmeans-handout.pdf. 25-clustering-and-kmeans-handout.pdf. Open.

modification of preloading bund for the construction of ...
According to the supplemental specification section. S2BB-4 and contract ... preloading bund from the actual data obtained from the instruments in the field, ...

Affidavit Of Counsel Re. Decision & Order, Motion For Modification Of ...
Affidavit Of Counsel Re. Decision & Order, Motion For Modification Of Bail Conditions To Permit Contact.pdf. Affidavit Of Counsel Re. Decision & Order, Motion ...

Provision of modification of Ticket.PDF
There was a problem previewing this document. Retrying... Download. Connect more apps... Try one of the apps below to open or edit this item. Provision of ...

A Novel Approach to Cloud Resource Management for Service ...
condition of illumination etc. The pricing scheme used in existing cloud system requires a kind of manual service differentiation, so there is a need ... Facing these problems, we proposed a new approach which is a novel cloud resource allocation fra

A Novel approach for Fingerprint Minutiae Extraction by ...
analysis digital security and many other applications. Fingerprints are fully formed at about seven months of fetus development and finger ridge configurations do not ... point or island, spur and crossover. A good quality fingerprint typically conta

A Novel approach for Fingerprint Minutiae Extraction by ...
analysis digital security and many other applications. .... missing information and removes spurious minutiae. .... Find indices (r, c) of nonzero elements. d.

Machine Learning In Chemoinformatics: A Novel Approach for ... - IJRIT
methods, high throughput docking, drug discovery, data analysis methods, etc[6] .... QSAR, protein-Ligand Models, Structure Based Models, Microarray Analysis,.

A Novel Approach to Cloud Resource Management for ...
A Novel Approach to Cloud Resource Management for ... the jobs employing cloud resources both for communication-intensive and data-intensive computations ...

Novel Approach for Performance Enhancement in Manets
A mobile ad hoc network (MANET) [1] [2][3] consists of a collection of wireless .... Another example is during disaster relief where various rescue crews (e.g., ...

Novel Approach for Performance Enhancement in Manets
1 Guru Gobind Singh Indraprastha University, Dwarka, New Delhi, India. 2 Department of Information Technology, ABES Engineering college, Ghaziabad, UP, ...

A Novel Approach for Intelligent Route Finding through ...
expansion based on adjacency matrix of all nodes forming a network. Dijkstra's algorithm starts by assigning infinity as default score to all nodes except the source. Candidate nodes for subsequent computation will be stored into a priority queue acc

Machine Learning In Chemoinformatics: A Novel Approach for ... - IJRIT
Keywords-chemoinformatics; drug discovery; machine learning techniques; ... methods, high throughput docking, drug discovery, data analysis methods, etc[6].

A Novel Approach To Structural Comparison of Proteins
Apr 13, 2004 - April, 2004. Abstract. With the rapid discovery of protein structures, structural comparison of proteins has become ... Science and Engineering, Indian Institute of Technology, Kanpur. No part of this thesis .... Using the QHull progra

Surface Modification of Polyacrylonitrile-Based ...
(14) Merrill, E. W. Ann. N.Y. Acad. Sci. 1977, 283, 6. (15) Huang, X.-J.; Xu, Z.-K.; Wang, J.-W.; Wan, L.-S., Yang, Q.Polym. Prepr. 2004, 45, 487. (16) Ulbricht, M.

Surface modification of polypropylene microfiltration ...
sorb onto the surface and deposit within the pores of mem- brane, resulting in ... ing could account for up to 90% of permeability losses [2]. Therefore, much attention ... recent years, there has been much interest in developing sur- face treatment

Harley engine modification. - MotoParts
Twin Cam Engines from Harley-Davidson have a good design and some nice ..... Any camshaft with with a duration under 250 degrees and lifts below .500 can ...

Harley engine modification. - MotoParts
Performance and Technical information on modifying engines for Harley-Davidson motorcycles. Looking for high performance engine specifications for your ...