IJRIT International Journal of Research in Information Technology, Volume 3, Issue 5, May 2015, Pg.185-190

International Journal of Research in Information Technology (IJRIT) www.ijrit.com

ISSN 2001-5569

Enhance Performance of K-Mean Algorithm Using MCL Silvi Gupta Research Scholar, CSE Dept. DIET, Karnal [email protected] Dinesh Kumar, Assistant Prof., CSE Dept. DIET, Karnal

[email protected] ABSTRACT Clustering is the assignment of a set of observations into subsets (called clusters) so that observations in the same cluster are similar in some sense. The result of the clustering process and its domain application efficiency are determined through the algorithms. This research work deals with two of the most delegated clustering algorithms. In k-means clustering, we are given a set of n data points in d-dimensional space Rd and an integer k and the problem is to determine a set of k points in Rd, called centres, so as to minimize the mean squared distance from each data point to its nearest centres. K- Mean does not determine the membership of data point. A popular heuristic for k-means clustering is Lloyd's algorithm. In this Dissertation, we present a simple and efficient implementation of K-Mean clustering with Lloyd’s clustering algorithm that become K - MCL clustering, which we call the modifying algorithm. This algorithm generate the membership function plot. This algorithm is easy to implement, requiring a kd-tree as the only major data structure. Keyboard: Data Mining, Cluster, K-Mean, LIC, Lloyd. I. INTRODUCTION Data Mining, the extraction of hidden predictive information from large databases, is a powerful new technology with great potential to help companies focus on the most important information in their data warehouses. Data Mining tools predict future trends and behaviors, allowing businesses to make proactive, knowledge-driven decisions. The automated, prospective analyses offered by Data Mining move beyond the analyses of past events provided by retrospective tools typical of decision support systems. Data Mining tools can answer business questions that traditionally were too time consuming to resolve. They scour databases for hidden patterns, finding predictive information that experts may miss because it lies outside their expectations. Data Mining is an iterative process within which progress is defined by discovery, through either automatic or manual methods. Data Mining is most useful in an Silvi Gupta, IJRIT-185

IJRIT International Journal of Research in Information Technology, Volume 3, Issue 5, May 2015, Pg.185-190

exploratory analysis scenario in which there are no predetermined notions about what will constitute an "interesting" outcome. Data Mining is the search for new, valuable, and nontrivial information in large volumes of data. It is a cooperative effort of humans and computers. Best results are achieved by balancing the knowledge of human experts in describing problems and goals with the search capabilities of computers.    

The process of semi automatically analyzing large databases to find useful patterns. KDD – “Knowledge Discovery in Databases. Attempts to discover rules and patterns from data. Areas of Use o Internet – Discover needs of customers o Economics – Predict stock prices o Science – Predict environmental change o Medicine – Match patients with similar problems  cure

1.1 Example of Data Mining  Credit Card Company wants to discover information about clients from databases. Want to find: o Clients who respond to promotions in “Junk Mail” o Clients that are likely to change to another competitor o Clients that are likely to not pay o Services that clients use to try to promote services affiliated with the Credit Card Company o Anything else that may help the Company provide/ promote services to help their clients and ultimately make more money. II. MOTIVATION In a supermarket scenario there are number of items of different prices and categories. Obviously every supermarket manager is interested to enhance the profit by any means. Normally a market manager wants to see very old kind of sale which consists of a larger number of items. But beyond that type of sale their exists one special kind of sale named ‘profit maximizing approach’ which mainly consists sale of that items which are higher in their price and some times higher in quantity of items. We can classify this approach into two different types as follows: a) High quantity-high price b) Low quantity-high price Typical business decisions that the management of supermarket makes includes are: what to put together, what to put on very front, how to design store layout, what promotional schemes to consider etc. So to make all these decisions in addition of analysis of past as well as current data a profit maximizing strategy is also necessary

III. ALGORITHM K-mean AlgorithmGiven the cluster number K, the K-means algorithm is carried out in three steps: Initialisation: set seed points • Assign each object to the cluster with the nearest seed point

Silvi Gupta, IJRIT-186

IJRIT International Journal of Research in Information Technology, Volume 3, Issue 5, May 2015, Pg.185-190



Compute seed points as the centroids of the clusters of the current partition (the centroid is the centre, i.e., mean point, of the cluster) • Go back to Step 1), stop when no more new assignment.  Flow Chart of K-Mean Algorithm

 K-Means Algorithm Process • The dataset is partitioned into K clusters and the data points are randomly assigned to the clusters resulting in clusters that have roughly the same number of data points. • For each data point: • Calculate the distance from the data point to each cluster. • If the data point is closest to its own cluster, leave it where it is. If the data point is not closest to its own cluster, move it into the closest cluster. • Repeat the above step until a complete pass through all the data points results in no data point moving from one cluster to another. At this point the clusters are stable and the clustering process ends. • The choice of initial partition can greatly affect the final clusters that result, in terms of inter-cluster and intracluster distances and cohesion. IV. CLUSTERING Clustering algorithms find groups of items that are similar. … It divides a data set so that records with similar content are in the same group, and groups are as different as possible from each other. Example: Insurance Company could use clustering to group clients by their age, location and types of insurance purchased.The categories are unspecified and this is referred to as ‘unsupervised learning.

Silvi Gupta, IJRIT-187

IJRIT International Journal of Research in Information Technology, Volume 3, Issue 5, May 2015, Pg.185-190

 Type of Clustering Vector Clustering Each point has a vector, i.e. • X coordinate • Y coordinate • Color

Graph Clustering Each vertex isconnected to othersby (weighted or un-weighted) edges.

V. RESULTS

Figure 1. After Implementation of K-Mean Algorithm DATASET: There are 5 datasets on which we can implement the algorithm that are following: Dataset 1 – 100 rows  Dataset 2 – 200 rows  Dataset 3 – 300 rows  Dataset 4 – 400 rows  Dataset 5 – 5000 rows

Silvi Gupta, IJRIT-188

IJRIT International Journal of Research in Information Technology, Volume 3, Issue 5, May 2015, Pg.185-190

Figure 2. Select Data Set

Figure 3. MF PLOT FOR CLUSTER 2 (GREEN)

Figure 4. Accuracy for Dataset 2

VI. CONCLUSION From all the above calculations we come to the conclusion that the K-Mean algorithm is an excellent algorithm when we are dealing with a small or medium sized data. It simply provides good performance vector every time. A direct algorithm of k-means method requires time proportional to the product of number of patterns and number of clusters per iteration. This is computationally very expensive especially for large datasets. The main disadvantage of the kmeans algorithm is that the number of clusters, K, must be supplied as a parameter. In this dissertation we present a simple validity measure based on the intra-cluster and inter-cluster distance measures which allows the number of clusters to be determined automatically. The basic procedure involves producing all the segmented dataset for 2 clusters up to Kmax clusters, where Kmax represents an upper limit on the number of clusters. VII. REFERENCES [1]. Ben-Hur, D. Horn, H.T. Siegelmann, and V. Vapnik. “A support vector clustering method”. In International Conference on Pattern Recognition, 2000. Silvi Gupta, IJRIT-189

IJRIT International Journal of Research in Information Technology, Volume 3, Issue 5, May 2015, Pg.185-190

[2]. Ahmed S, Coenen F, Leng PH “Tree-based partitioning of date for association rule mining”. Knowl Inf Syst 10(3):315–331, (2006). [3]. Banerjee A, Merugu S, Dhillon I, Ghosh J “Clustering with Bregman divergences”. J Mach Learn Res 6:1705–1749, (2005). [4]. Bonchi F, Lucchese C “On condensed representations of constrained frequent patterns”. Knowl Inf Syst 9(2):180–201, (2006). [5]. Breiman L “Addison-Wesley, Reading. Republished in Classics of mathematics”. SIAM, Philadelphia, (1991). [6]. Breiman L “Prediction games and arcing classifiers”. Neural Comput 11(7):1493–1517, (1999). [7]. Breiman L, Friedman JH, Olshen RA, Stone CJ “Classification and regression trees”. Wadsworth, Belmont, (1984). [8]. Johannes Grabmeier and Andreas Rudolph “Techniques of Cluster Algorithms in Data Mining” Received November 12, 1998; Revised May 23, 2001 [9]. U. Fayyad, G.Piatetsky-Shapiro and P.Smyth. From data mining to knowledge discovery in databases.Ai Magazine, Volume 17, pages 37-54, 1996. [10]. In Year 1997, Pavel BerkhinHo performed a work,”Survey of Clustering Data Mining Techniques” [11]. In Year 2001, Glenn Fung performed a work,”A Comprehensive Overview of Basic Clustering Algorithms”. [12]. In Year 2007, Andrea De Lucia, Michele Risi, and Genoveffa Tortora,” Clustering Algorithms and LatentSemantic Indexing to Identify Similar Pages in Web Applications’.

Silvi Gupta, IJRIT-190

Enhance Performance of K-Mean Algorithm Using MCL

K- Mean does not determine the membership of data ... exploratory analysis scenario in which there are no predetermined notions about what will constitute an.

524KB Sizes 7 Downloads 259 Views

Recommend Documents

Letter Advisor_ Med. - MCL
Apr 10, 2012 - Office of the General Manager (A/MP&R). Jagruti Vihar, Sambalpur, Burla, ... (i). Advisor (Anesthesia). 01. P.G. degree in Anesthesia. (ii).

Letter Advisor_ Med. - MCL
Apr 10, 2012 - For the Post of Advisor (Anaesthetia, Obstretics & Gynecology,. Radiology, Surgery and Medicine). 1) Reference No. & Date of Advertisement:.

Performance Evaluation of a Hybrid Algorithm for Collision Detection ...
Extensive tests were conducted and the ... that this approach is recommendable for applications ..... performance in the previous tests for the broad phase.

A High Performance Algorithm for Clustering of Large ...
Oct 5, 2013 - A High Performance Algorithm for Clustering of Large-Scale. Protein Mass Spectrometry Data using Multi-Core Architectures. Fahad Saeed∗ ...

Performance Evaluation of a Hybrid Algorithm for Collision Detection ...
are also approaches other than spatial partitioning data structures. ... from GPU memory is usually a very slow operation, making buffer ... data structures: grids and octrees. Finally, in ... partitioning the cells in a new grid when required (with.

Performance Evaluation of a Hybrid Algorithm for Collision Detection ...
and the performance of the algorithm was evaluated in terms of output ..... (c). Figure 1. The object's geometry and the object's spherical octree with 4 and 5 levels are shown in ..... [15] G. Rowe, Computer Graphics with Java, Palgrave,. 2001.

Using Ontologies to Enhance Data Management in ...
ontology as a means to represent contextual information; and (iii) ontology to provide ... SPEED (Semantic PEEr Data Management System) [Pires 2009] is a PDMS that adopts an ... Finding such degree of semantic overlap between ontologies ...

Using Checkpointing to Enhance Turnaround Time on ...
We propose to share checkpoints among desktop machines in order to ... demand, and prediction-based checkpointing combined with replication. We used a set of .... to implement their practical assignments, and to access email and the web.

TCP Retransmission Timeout Algorithm Using ...
Jan 2, 2010 - and HTTP) running on different hosts on the Internet [2, p. 82]. It is critical for TCP to have ... Manuscript received July 10, 2003; revised October 10, 2003. This work was .... cursive WM RTT estimates, produces the best results. Let

Simulation of Grover's algorithm using MATLAB
However, even quadratic speedup is considerable when N is large. Like all quantum computer algorithms, Grover's algorithm is probabilistic, in the sense that it.

Using hedges to enhance a disease outbreak report ...
outbreak reports — which to the best of our knowl- edge has not been ... registered users as email alerts (Collier et al., 2008). In addition to this ... For example, an article may be about, say ... cle could report a vaccination campaign or resea

Using Fuzzy Logic to Enhance Stereo Matching in ...
Jan 29, 2010 - Stereo matching is generally defined as the problem of discovering points or regions ..... Scheme of the software architecture. ..... In Proceedings of the 1995 IEEE International Conference on Robotics and Automation,Nagoya,.

MCL Scholarship application.pdf
City State ZIP Code. Telephone ( ) Date of Birth: Month Day Year. MILITARY. FAMILY. MEMBER. DATA. Last Name First Middle Initial. The military family member is an active duty, reserve, or discharged member of the United States military. Check one Mar

Lightpath Protection using Genetic Algorithm ... - Semantic Scholar
virtual topology onto the physical topology so as to minimize the failure ... applications and high speed computer networks because of huge bandwidth of optical ...

Using genetic algorithm to select the presentation order of training ...
Mar 18, 2008 - i ¼ 1 ہ ai. (1). F1 layer nodes are connected to output layer F2 nodes through a ..... http://www.ics.edu/mlearn/MLRespository.html (1994).