IJRIT International Journal of Research in Information Technology, Volume 2, Issue 9, September 2014, Pg. 209-213

International Journal of Research in Information Technology (IJRIT) www.ijrit.com

ISSN 2001-5569

Minimal Spanning Tree and FAST used in Feature Clustering to get High Dimensional Data Anusha Perumalla, Dr. A. Govardhan PG Scholar, Department of Computer Science, School of Information Technology, JNTUH Hyderabad, Telangana State, India. [email protected] Professor, Department of Computer Science, School of Information Technology, JNTUH Hyderabad, Telangana State, India. [email protected]

Abstract Feature selection involves identifying a subset of the most useful features that produces compatible results as the original entire set of features. A feature selection algorithm may be evaluated from both the efficiency and effectiveness points of view. While the efficiency concerns the time required to find a subset of features, the effectiveness is related to the quality of the subset of features. Based on these criteria, a fast clustering-based feature selection algorithm, FAST, is proposed and experimentally evaluated in this paper. The FAST algorithm works in two steps. In the first step, features are divided into clusters by using graphtheoretic clustering methods. In the second step, the most representative feature that is strongly related to target classes is selected from each cluster to form a subset of features. Features in different clusters are relatively independent; the clustering-based strategy of FAST has a high probability of producing a subset of useful and independent features. To ensure the efficiency of FAST, we adopt the efficient minimumspanning tree clustering method. The efficiency and effectiveness of the FAST algorithm are evaluated through an empirical study. Extensive experiments are carried out to compare FAST and several representative feature selection algorithms, namely, FCBF, Relief-F, CFS, Consist, and FOCUS-SF, with respect to four types of well-known classifiers, namely, the probability-based Naive Bayes, the tree-based C4.5, the instance-based IB1, and the rule-based RIPPER before and after feature selection. The results, on 35 publicly available real-world high dimensional image, microarray, and text data, demonstrate that FAST

Anusha Perumalla, IJRIT

209

IJRIT International Journal of Research in Information Technology, Volume 2, Issue 9, September 2014, Pg. 209-213

not only produces smaller subsets of features but also improves the performances of the four types of classifiers. Index Terms: Feature subset selection, filter method, feature clustering, graph-based clustering

1. Introduction The rapid development of Computer Science and Engineering technology the electronic data (or information) is increasing very large. It causes a major problem that is how to quickly selecting most useful information (or data) from very large amount of data in databases. To solve this problem we used the technologies like data mining and information retrieval systems. Feature Selection is the process of finding most useful information (or data or features) from the large amount of data that is stored in databases. The main goal of the feature selection is to improve the accuracy of the classification. The feature selection algorithms (or methods) become more popular in applications like datasets with thousands of features (or variables) are available. The datasets are includes text, bio microarray and image data. For high dimensional data feature selection algorithm improves both accuracy and classification of the very large data like thousands of features that consist of a large number of irrelevant features and redundant features. The main objective of feature selection algorithms involves identifying and removing as many as both irrelevant features and redundant features. Because of irrelevant features do not contribute the accuracy of the classification and in redundant feature the most of the information contain in one feature is already available in other features. The feature selection algorithms broadly categorized into four groups: they are filtered, wrapper, embedded and hybrid methods. The first approach (or method) is filter method and it uses statistical properties of the features to filter out irrelevant features and redundant features. This method is independent of learning methods (or algorithms) and computational complexity is very low and also the accuracy of learning algorithms is not guaranteed. The second method is wrapper methods, it provides the accuracy of learning methods (or algorithms) is high but computational complexity is very large. The third method is embedded methods; the examples of this method are artificial neural networks and decision tree. The fourth and final method is hybrid method, it is a combination of both filter and wrapper methods. It achieves best performance with learning algorithms (or methods) as well as time complexity is very low. The paper organized as in section 2 we briefly discussed about various feature subset selection algorithms, in section 3 we compare and analysis of feature selection algorithms and in section 4 conclusion has concluded as followed by references.

2. Background The fast clustering-based feature selection algorithm (FAST) works in two steps. In the first step, features are divided into clusters by using graph-theoretic clustering methods. In the second step, the most representative feature that is strongly related to target classes is selected from each cluster to form a subset

Anusha Perumalla, IJRIT

210

IJRIT International Journal of Research in Information Technology, Volume 2, Issue 9, September 2014, Pg. 209-213

of features. Features in different clusters are relatively independent; the clustering-based strategy of FAST has a high probability of producing a subset of useful and independent features. With respect to the filter feature selection methods, the application of cluster analysis has been demonstrated to be more effective than traditional feature selection algorithms. The general graph-theoretic clustering is simple: compute a neighborhood graph of instances, then delete any edge in the graph that is much longer/shorter (according to some criterion) than its neighbors. The result is a forest and each tree in the forest represents a cluster. We adopt the minimum spanning tree (MST)-based clustering algorithms, because they do not assume that data points are grouped around centers or separated by a regular geometric curve and have been widely used in practice.

3 Methods 3.1 Fast Clustering based Feature Subset Selection Algorithm (FAST): The fast clustering based feature subset selection algorithm (FAST) is involved in identifying and removing irrelevant features and redundant features of high dimensional data. The fast clustering based feature subset selection algorithm (FAST) works in two steps: (i) in the first step, the features are divided into clusters by using prims algorithms (that is graph-based clustering methods) and (ii) in a second step, the most representative features are selected from each cluster to form a most useful subset to improve the accuracy of classification. The fast clustering based feature subset selection algorithm (FAST) performs very well on text, micro array and image data.

1 // ALGORITHM 1: FAST 2 // inputs: D (F1, F2, …, Fn, C)- The Given Data set and ߐ- the T-Relevance threshold. 3 //output: S- Selected feature subset. 4 //Module 1: Irrelevant Feature Removal 5 for i=1 to m do 6

T-Relevance=SU (Fi, C)

7

if (T-Relevance>ߐ) then

8 9

S=S U {Fi}; end if

10 end loop 11 //Module2: Minimum Spanning Tree Construction 12 G=NULL; //G is Complete Graph 13 for each fair of features {Fi’, Fj’} S do 14

F-Correlation=SU (Fi’, Fj’)

15

Add Fi and/or Fj to G with F-Correlation as the weight of the corresponding edge;

16 end loop

Anusha Perumalla, IJRIT

211

IJRIT International Journal of Research in Information Technology, Volume 2, Issue 9, September 2014, Pg. 209-213

17 minSpanTree=Prim(G); 18 //Tree Partition and Representive Feature Selection 19 for each edge Eij Forest do 20

if SU(Fi,Fj) < SU(Fi,C) SU(Fi,Fj)
21 22

Forest=Forest-Eij end if

23 end loop 24 S=null; 25 for each tree Ti Forest do 26

FR=argmaxFk SU(Fk,C)

27

S=S U {FR}

28 end loop 29 eturn S

4. Modules 4.1 Removal of Irrelevant features An effective way for reducing dimensionality, removing irrelevant data, increasing learning accuracy, and improving result comprehensibility. Many feature subset selection methods have been proposed for machine learning applications. if we take a Dataset 'D' with m features F={F1,F2,..,Fn} and class C, automatically features are available with target relevant feature. The generality of the selected features is limited and the computational complexity is large. The hybrid methods are a combination of filter and wrapper methods by using a filter method to reduce search space that will be considered by the subsequent wrapper.

4.2 T-Relevance, F-correlation calculation: T-Relevance between a feature and the target concept C, the correlation F-Correlation between a pair of features, the feature redundancy F-Redundancy and the representative feature R-Feature of a feature cluster can be defined. According to the above definitions, feature subset selection can be the process that identifies and retains the strong T-Relevance features and selects R-Features from feature clusters. The behind heuristics are that 1. Irrelevant features have no/weak correlation with target concept. 2. Redundant features are assembled in a cluster and a representative feature can be taken out of the cluster.

Anusha Perumalla, IJRIT

212

IJRIT International Journal of Research in Information Technology, Volume 2, Issue 9, September 2014, Pg. 209-213

4.3 MST construction: To ensure the efficiency of FAST, we adopt the efficient minimum-spanning tree (MST) clustering method. The efficiency and effectiveness of the FAST algorithm are evaluated through an empirical study. Extensive experiments are carried out to compare FAST and several representative feature selection algorithms. We construct a Minimal spanning tree with weights. a MST, which connects all vertices such that the sum of the weights of the edges is the minimum, using the well-known Prim algorithm.

4.4 Relevant feature calculation: After tree partition unnecessary edges are removed. each deletion results in two disconnected trees(T1,T2).After removing all the unnecessary edges, a forest is obtained. Each tree represents a cluster. Finally it comprises for final feature subset then calculate the accurate/relevant feature

5. Conclusion In this paper discuss about implementation of FAST Algorithm and using FAST Algorithm how to retrieve relevant information from high dimensional dataset within short time. Follow steps in FAST Algorithm get exact accurate information from high dimensional data in dataset. Finally experimental results shows FAST Algorithm is best feature selection algorithm to retrieve information within seconds then compared to existing algorithms like CFS, Relief, FAST Algorithm is best efficient and reliable algorithm

6 References [1] Davi C. de L. Vieira, Paulo J. L. Adeodato, ” Improving Reinforcement Learning Algorithms by the Use of Data Mining Techniques for Feature and Action Selection ”IEEE Trans. 1863-67,201 0 [2] Like Gao, X. Sean Wang “Feature Selection for Building Cost-Effective Data Stream Classifiers” IEEE International Conference on Data Mining,2005 [3]

Jihong Liu,”A Hybrid Feature Selection Method for Data Sets of thousands of Variables” IEEE, 2010.

Anusha Perumalla, IJRIT

213

Minimal Spanning Tree and FAST used in Feature Clustering to ... - IJRIT

technologies like data mining and information retrieval systems. ... The first approach (or method) is filter method and it uses statistical properties of the ... in section 3 we compare and analysis of feature selection algorithms and in section 4 ...

81KB Sizes 0 Downloads 120 Views

Recommend Documents

Survey on Data Clustering - IJRIT
common technique for statistical data analysis used in many fields, including machine ... The clustering process may result in different partitioning of a data set, ...

A characterization result in minimum cost spanning tree ...
Assuming a priority ordering of the agents to resolve cases that yield multiple .... independently it will achieve a strictly positive cost savings for the agents in S ... node inherits the label of the highest priority node in the set S, so that the

Survey on Data Clustering - IJRIT
Data clustering aims to organize a collection of data items into clusters, such that ... common technique for statistical data analysis used in many fields, including ...

Timetable Scheduling using modified Clustering - IJRIT
timetable scheduling database that has the information regarding timeslots .... One for admin login, teacher registration, student registration and last one is exit.

Timetable Scheduling using modified Clustering - IJRIT
resources to objects being placed in space-time in such a way as to satisfy or .... timetable scheduling database that has the information regarding timeslots of college. .... Java is a computer programming language that is concurrent, class-based, o

A Framework for Minimal Clustering Modification via ...
1https://sites.google.com/site/ .... We experiment with two real world data sets (social network ... ten found clustering simply puts most instances in one clus-.

Existence of a spanning tree having small diameter
... Discrete Math. 312 (2012) 207–212. 21. Page 22. [5] B.Y. Wu and K.M. Chao, Spanning Trees and Optimization Problems, Chap- man & Hall/CRC (2004). 22.

Decision Tree State Clustering with Word and ... - Research at Google
nition performance. First an overview ... present in testing, can be assigned a model using the decision tree. Clustering .... It can be considered an application of K-means clustering. Two ..... [10] www.nist.gov/speech/tools/tsylb2-11tarZ.htm. 2961

Fast Clustering of Gaussians and the Virtue of ...
A clustering map c : G→C. • Exponential model parameters θc, c ∈ C for each of the cluster gaussians. We shall measure the goodness of the clustering in terms.

Fast and accurate sequential floating forward feature ...
the Bayes classifier applied to speech emotion recognition ... criterion employed in SFFS is the correct classification rate of the Bayes classifier assuming that the ...

Motion Feature and Hadamard Coefficient-Based Fast ...
the computation saving performance of the proposed algorithm ... Color versions of one or more of the figures in this paper are available online ... algorithm, the best reference frame of current block is deter- .... The high frequency signal power a

Fast Web Clustering Algorithm using Divide and ...
5 : Simulate adding d to c. 6: HR. ( ) ... Add d to c. 9: end if. 10: end for. 11: if (d was not added to any cluster) then. 12: Create a ... Basic theme of the algorithm is ...

Fast Web Clustering Algorithm using Divide and ...
Clustering is the unsupervised classification of patterns .... [3] A.K. Jain, M.N. Murty, P.J. Flynn, Data clustering: a review, ACM Computing. Surveys 31 (3) (1999) ...

Fast and Robust Fuzzy C-Means Clustering Algorithms ...
Visually, FGFCM_S1 removes most of the noise, FGFCM_S2 and FGFCM ..... promote the clustering performs in present of mixed noise. MLC, a ... adaptive segmentation of MRI data using modified fuzzy C-means algorithm, in Proc. IEEE Int.

Dual-Tree Fast Gauss Transforms
nel summations which arise in many machine learning methods such as .... points, and can thus be computed indepedent of any query location – we will call such ... Applying the multinomial theorem to to expand about the new center xQ ...

A Review: Study of Iris Recognition Using Feature Extraction ... - IJRIT
analyses the Iris recognition method segmentation, normalization, feature extraction ... Keyword: Iris recognition, Feature extraction, Gabor filter, Edge detection ...

A Review: Study of Iris Recognition Using Feature Extraction ... - IJRIT
INTRODUCTION. Biometric ... iris template in database. There is .... The experiments have been implemented using human eye image from CASAI database.

Contextual Query Based On Segmentation & Clustering For ... - IJRIT
In a web based learning environment, existing documents and exchanged messages could provide contextual ... Contextual search is provided through query expansion using medical documents .The proposed ..... Acquiring Web. Documents for Supporting Know

Contextual Query Based On Segmentation & Clustering For ... - IJRIT
Abstract. Nowadays internet plays an important role in information retrieval but user does not get the desired results from the search engines. Web search engines have a key role in the discovery of relevant information, but this kind of search is us