IJRIT International Journal of Research in Information Technology, Volume 3, Issue 5, May 2015, Pg.185-190
International Journal of Research in Information Technology (IJRIT) www.ijrit.com
ISSN 2001-5569
Enhance Performance of K-Mean Algorithm Using MCL Silvi Gupta Research Scholar, CSE Dept. DIET, Karnal
[email protected] Dinesh Kumar, Assistant Prof., CSE Dept. DIET, Karnal
[email protected] ABSTRACT Clustering is the assignment of a set of observations into subsets (called clusters) so that observations in the same cluster are similar in some sense. The result of the clustering process and its domain application efficiency are determined through the algorithms. This research work deals with two of the most delegated clustering algorithms. In k-means clustering, we are given a set of n data points in d-dimensional space Rd and an integer k and the problem is to determine a set of k points in Rd, called centres, so as to minimize the mean squared distance from each data point to its nearest centres. K- Mean does not determine the membership of data point. A popular heuristic for k-means clustering is Lloyd's algorithm. In this Dissertation, we present a simple and efficient implementation of K-Mean clustering with Lloyd’s clustering algorithm that become K - MCL clustering, which we call the modifying algorithm. This algorithm generate the membership function plot. This algorithm is easy to implement, requiring a kd-tree as the only major data structure. Keyboard: Data Mining, Cluster, K-Mean, LIC, Lloyd. I. INTRODUCTION Data Mining, the extraction of hidden predictive information from large databases, is a powerful new technology with great potential to help companies focus on the most important information in their data warehouses. Data Mining tools predict future trends and behaviors, allowing businesses to make proactive, knowledge-driven decisions. The automated, prospective analyses offered by Data Mining move beyond the analyses of past events provided by retrospective tools typical of decision support systems. Data Mining tools can answer business questions that traditionally were too time consuming to resolve. They scour databases for hidden patterns, finding predictive information that experts may miss because it lies outside their expectations. Data Mining is an iterative process within which progress is defined by discovery, through either automatic or manual methods. Data Mining is most useful in an Silvi Gupta, IJRIT-185
IJRIT International Journal of Research in Information Technology, Volume 3, Issue 5, May 2015, Pg.185-190
exploratory analysis scenario in which there are no predetermined notions about what will constitute an "interesting" outcome. Data Mining is the search for new, valuable, and nontrivial information in large volumes of data. It is a cooperative effort of humans and computers. Best results are achieved by balancing the knowledge of human experts in describing problems and goals with the search capabilities of computers.
The process of semi automatically analyzing large databases to find useful patterns. KDD – “Knowledge Discovery in Databases. Attempts to discover rules and patterns from data. Areas of Use o Internet – Discover needs of customers o Economics – Predict stock prices o Science – Predict environmental change o Medicine – Match patients with similar problems cure
1.1 Example of Data Mining Credit Card Company wants to discover information about clients from databases. Want to find: o Clients who respond to promotions in “Junk Mail” o Clients that are likely to change to another competitor o Clients that are likely to not pay o Services that clients use to try to promote services affiliated with the Credit Card Company o Anything else that may help the Company provide/ promote services to help their clients and ultimately make more money. II. MOTIVATION In a supermarket scenario there are number of items of different prices and categories. Obviously every supermarket manager is interested to enhance the profit by any means. Normally a market manager wants to see very old kind of sale which consists of a larger number of items. But beyond that type of sale their exists one special kind of sale named ‘profit maximizing approach’ which mainly consists sale of that items which are higher in their price and some times higher in quantity of items. We can classify this approach into two different types as follows: a) High quantity-high price b) Low quantity-high price Typical business decisions that the management of supermarket makes includes are: what to put together, what to put on very front, how to design store layout, what promotional schemes to consider etc. So to make all these decisions in addition of analysis of past as well as current data a profit maximizing strategy is also necessary
III. ALGORITHM K-mean AlgorithmGiven the cluster number K, the K-means algorithm is carried out in three steps: Initialisation: set seed points • Assign each object to the cluster with the nearest seed point
Silvi Gupta, IJRIT-186
IJRIT International Journal of Research in Information Technology, Volume 3, Issue 5, May 2015, Pg.185-190
•
Compute seed points as the centroids of the clusters of the current partition (the centroid is the centre, i.e., mean point, of the cluster) • Go back to Step 1), stop when no more new assignment. Flow Chart of K-Mean Algorithm
K-Means Algorithm Process • The dataset is partitioned into K clusters and the data points are randomly assigned to the clusters resulting in clusters that have roughly the same number of data points. • For each data point: • Calculate the distance from the data point to each cluster. • If the data point is closest to its own cluster, leave it where it is. If the data point is not closest to its own cluster, move it into the closest cluster. • Repeat the above step until a complete pass through all the data points results in no data point moving from one cluster to another. At this point the clusters are stable and the clustering process ends. • The choice of initial partition can greatly affect the final clusters that result, in terms of inter-cluster and intracluster distances and cohesion. IV. CLUSTERING Clustering algorithms find groups of items that are similar. … It divides a data set so that records with similar content are in the same group, and groups are as different as possible from each other. Example: Insurance Company could use clustering to group clients by their age, location and types of insurance purchased.The categories are unspecified and this is referred to as ‘unsupervised learning.
Silvi Gupta, IJRIT-187
IJRIT International Journal of Research in Information Technology, Volume 3, Issue 5, May 2015, Pg.185-190
Type of Clustering Vector Clustering Each point has a vector, i.e. • X coordinate • Y coordinate • Color
Graph Clustering Each vertex isconnected to othersby (weighted or un-weighted) edges.
V. RESULTS
Figure 1. After Implementation of K-Mean Algorithm DATASET: There are 5 datasets on which we can implement the algorithm that are following: Dataset 1 – 100 rows Dataset 2 – 200 rows Dataset 3 – 300 rows Dataset 4 – 400 rows Dataset 5 – 5000 rows
Silvi Gupta, IJRIT-188
IJRIT International Journal of Research in Information Technology, Volume 3, Issue 5, May 2015, Pg.185-190
Figure 2. Select Data Set
Figure 3. MF PLOT FOR CLUSTER 2 (GREEN)
Figure 4. Accuracy for Dataset 2
VI. CONCLUSION From all the above calculations we come to the conclusion that the K-Mean algorithm is an excellent algorithm when we are dealing with a small or medium sized data. It simply provides good performance vector every time. A direct algorithm of k-means method requires time proportional to the product of number of patterns and number of clusters per iteration. This is computationally very expensive especially for large datasets. The main disadvantage of the kmeans algorithm is that the number of clusters, K, must be supplied as a parameter. In this dissertation we present a simple validity measure based on the intra-cluster and inter-cluster distance measures which allows the number of clusters to be determined automatically. The basic procedure involves producing all the segmented dataset for 2 clusters up to Kmax clusters, where Kmax represents an upper limit on the number of clusters. VII. REFERENCES [1]. Ben-Hur, D. Horn, H.T. Siegelmann, and V. Vapnik. “A support vector clustering method”. In International Conference on Pattern Recognition, 2000. Silvi Gupta, IJRIT-189
IJRIT International Journal of Research in Information Technology, Volume 3, Issue 5, May 2015, Pg.185-190
[2]. Ahmed S, Coenen F, Leng PH “Tree-based partitioning of date for association rule mining”. Knowl Inf Syst 10(3):315–331, (2006). [3]. Banerjee A, Merugu S, Dhillon I, Ghosh J “Clustering with Bregman divergences”. J Mach Learn Res 6:1705–1749, (2005). [4]. Bonchi F, Lucchese C “On condensed representations of constrained frequent patterns”. Knowl Inf Syst 9(2):180–201, (2006). [5]. Breiman L “Addison-Wesley, Reading. Republished in Classics of mathematics”. SIAM, Philadelphia, (1991). [6]. Breiman L “Prediction games and arcing classifiers”. Neural Comput 11(7):1493–1517, (1999). [7]. Breiman L, Friedman JH, Olshen RA, Stone CJ “Classification and regression trees”. Wadsworth, Belmont, (1984). [8]. Johannes Grabmeier and Andreas Rudolph “Techniques of Cluster Algorithms in Data Mining” Received November 12, 1998; Revised May 23, 2001 [9]. U. Fayyad, G.Piatetsky-Shapiro and P.Smyth. From data mining to knowledge discovery in databases.Ai Magazine, Volume 17, pages 37-54, 1996. [10]. In Year 1997, Pavel BerkhinHo performed a work,”Survey of Clustering Data Mining Techniques” [11]. In Year 2001, Glenn Fung performed a work,”A Comprehensive Overview of Basic Clustering Algorithms”. [12]. In Year 2007, Andrea De Lucia, Michele Risi, and Genoveffa Tortora,” Clustering Algorithms and LatentSemantic Indexing to Identify Similar Pages in Web Applications’.
Silvi Gupta, IJRIT-190