Statement of Research Interest

Viewer
Transcript

Statement of Research Interest Pradeep Kumar Junior Research Associate SETLabs, Infosys Technologies Limited Bangalore, INDIA - 560100 Email: [email protected] In recent years, advanced information systems have enabled collection of increasingly large amounts of data. To analyze huge amounts of data, the interdisciplinary field of Knowledge Discovery in Databases (KDD) is very useful. The most important step within the process of KDD is data mining, which is concerned with the extraction of the valid patterns. Recent research focus in data mining includes stream data mining, sequence data mining, web mining, text mining, visual mining, multimedia mining and multi-relational data mining. The emphasis of my research work over the past few years is in the field of sequence data mining, specifically, developing solutions for intrusion detection and web personalization. In general, my research generally revolves around analysis of data and utilization of data mining, statistical and machine learning techniques in a wide range of real life applications. My future research work aims at developing scalable, novel data mining algorithms for the various application areas like web mining, bioinformatics, forensic science etc.

PhD Research Work Thesis Title: An Investigation of Classification and Clustering of Sequential Data Advisors: Dr Raju S. Bapi, Reader, University of Hyderabad, INDIA and Dr. P. Radh Krishna, Associate Professor, IDRBT, Hyderabad, INDIA The extraction of useful and non-trivial information from the huge amount of data is called Data Mining (DM). DM is part of a bigger framework, referred to as Knowledge Discovery in Databases (KDD), that covers a complex process, from data preparation to knowledge modeling. DM tasks includes classification (assign each record of a database to one of a predefined set of classes), clustering (find groups of records that are close according to some user defined metrics), association rules (determine implication rules for a subset of record attributes) and many more. A large number of algorithms have been developed to perform these and others tasks with the help of interdisciplinary fields of science from machine learning to statistics through neural and fuzzy computing. A huge amount of data is collected every day in the form of sequences. Discovering useful and meaningful patterns from large databases of sequences is a challenging and interesting task. For example, how can we use the user web log information more efficiently and effectively for web personalization? Based on the user session how early and efficiently we can raise alarm indicating the current session belongs to one who may be hacker? How do we summarize the differences between biomedical images of retinas in healthy and diseased conditions? Moreover, how do we correlate information from various modalities (e.g., between image and text) for content-based information or retrieval? The main goal is to discover patterns that make information in sequence databases useful and accessible. 1

Research on sequence data concentrates on the discovery of frequently occurring patterns. However, comparatively less amount of work has been carried out in the area of sequence data classification and clustering. The contribution of my research work is in the development of new methods for classification and clustering of sequence data. My research work introduces solutions for real-world applications dealing with classification of system call sequences in intrusion detection and clustering of web navigational sequences in web usage mining. Experiments were conducted on DARP A0 98 IDS (sequence classification) and msnbc web navigational (sequence clustering) benchmark dataset. Initially, the hypothesis that while comparing two sequences, considering only the order or the content information embedded in the two sequences results in poor performance in classification and clustering tasks was established. kN N classification and P AM clustering algorithms are utilized for classification and clustering tasks, respectively and sliding window technique is used for extracting subsequences. To establish the hypothesis four distance/similarity measures namely, Euclidean, Cosine, Jaccard and Binary W eighted Cosine measures were used along with various sliding window sizes. The motive was to address the following two questions: • Does incorporation of the order of occurrence (sequence aspects) information enhance the efficiency of classifier on sequence data ? • Can a better grouping of data that preserves sequentiality be achieved by incorporating sequence information? Thus, empirically we established the following two facts. • While performing classification and clustering of sequence data sequence Information is important. • Apart from sequence information, content information is also important. Based on the results from the hypothesis testing, a new similarity measure, S 3 M which considers both the order of occurrence as well as the content information while computing similarity between two sequences was designed. Better classification accuracy and clustering quality were achieved with the newly devised measure. In clustering of sequences, the goodness of the clusters was measured using Average Levenshtein Distance. A new partitional clustering algorithm for sequence data, SeqP AM was proposed. SeqP AM differs from P AM in mediod selection as well as the optimization function. The superiority of the proposed algorithm was demonstrated over P AM clustering algorithm. Further, a new indiscernibility-based rough agglomerative hierarchical clustering algorithm for sequential data is proposed. Here, the indiscernibility relation has been extended to a tolerance relation with the transitivity property being relaxed. Initial clusters are formed using a similarity upper approximation. Subsequent clusters are formed using the concept of constrainedsimilarity upper approximation wherein a condition of relative similarity is used as a merging criterion. The results of the proposed approach was compared with that of the traditional hierarchical clustering algorithm using vector coding of sequences. In general, my interest is in solving data mining problems by combining techniques from traditional areas such as, machine learning, statistics and databases, as well as the areas of data analysis and mining. My aim is to develop automated and novel algorithms methods that extract

2

useful information from massive databases arising in commercial, information retrieval, scientific and life sciences applications with an aim of extending the scientific knowledge. Working at the forefront of a new discipline requires an understanding of the scientific problems, a firm grasp of existing algorithmic and computational techniques, and the ability to adapt and develop innovative new analytic and computational techniques for the specific application domain. This is the kind of work I have always enjoyed and excelled at. As stepping stones toward the solutions, plan is to first extend the applications of existing algorithms. In the course of this work, the new novel problem domains will be identified with a focus on continuously developing new techniques to address existing and new application areas. My research work would also include the study of current state of art in the application areas of my research. My intention is to provide the guidelines to the policy makers in the application area where my research will be focussing. It would be pleasure for me to collaborate with other research groups and contribute my expertise to multi-disciplinary projects.

List of Publications from the thesis Journals 1 Kumar, P., Krishna, P. R., Bapi, R. S. and De. S. K. Rough clustering of sequential data, Data & Knowledge Engineering (doi:10.1016/j.datak.2007.01.003), 2007. 2 Kumar, Pradeep, Radha Krishna, P. and Raju, B. S. : SeqPam: A clustering algorithm for sequential data In International Journal of Data Warehousing and Mining 3(1), pp. 29-53, 2007. 3 Kumar Pradeep, Radha Krishna, P. and Raju, B. S.: A new measure for sequential Data, Pattern Recognition Letters. 4 Kumar Pradeep, Radha Krishna, P. and Raju, B. S.: Towards a new approach to Classification and Clustering of Sequential Data, Pattern Recognition. Book Chapter 5 Kumar Pradeep, Radha Krishna, P., Raju, B. S. and Padmaja T. M.: Advances in Classification of Sequence Data, Advances in Data Warehousing and Mining, IGI Publisher, 2007. International Conferences 6 Kumar Pradeep, Radha Krishna, P., Bapi, R. S. and De, S. K. : Clustering using Similarity Upper Approximation, In IEEE International Conference on Fuzzy Systems, FUZZ-IEEE 2006, Vancouver, Canada, pages 4230 - 4235, July 2006. 7 Kumar Pradeep, Rao, M. V., Radha Krishna, P. and Bapi, R. S. :Using Sub-sequence Information with kNN for Classification of Sequential Data. In Proceedings of International Conference on Distributed Computing and Internet Technology (ICDCIT), LNCS Springer Verlag, Vol 3816, Bhubneshwar, India, December 2005, pp. 536-546.

3

8 Kumar Pradeep, Rao, M. V., Radha Krishna P., Bapi, R. S. and Laha, A. :Intrusion detection system using sequence and set preserving metric. In Proceedings of IEEE International Conference on Intelligence and security informatics, IEEE-ISI 2005. LNCS Springer Verlag, Vol 3495, Atlanta, Georgia, May 2005, pp. 498-504. 9 Kumar Pradeep, Radha Krishna, P., Bapi, R. S. and De, S. K. :Web Usage Mining Using Rough Agglomerative Clustering. In proceedings of 7th International Conference on Enterprise Information Systems (ICEIS (2)), Miami, USA, May 2005, pp. 315-320.

References [1] R. Agrawal and R. Srikant, Mining Sequential Patterns, In Proc. of the 11th International Conference on Data Engineering, Taipei, Taiwan, March 1995, pp. 3–14. [2] Valerie Guralnik and George Karypis, A Scalable Algorithm for Clustering Sequential Data, 1st IEEE Conference on Data Mining, pp. 179 – 186, 2001. [3] J. Han, M. Kamber, Data Mining: Concepts and Techniques, Morgan Kaufmann, 2001. [4] N. Lesh, M. J. Zaki, M. Ogihara, Mining features for sequence classification, in: S. Chaudhuri, D. Madigan (Eds.), Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM Press, San Diego, 1999, pp. 342 – 346. [5] H. Mannila, H. Toivonen, Discovering generalized episodes using minimal occurrences, in: Knowledge Discovery and Data Mining, 1996, pp. 146 - 151. [6] H. Mannila, H. Toivonen, A. I. Verkamo, Discovery of frequent episodes in event sequences, Data Mining and knowledge Discovery 1 (3) (1997) 259 - 289. [7] T. M. Mitchell, Machine learning, Mc Graw Hill, 1997. 12 [8] I. Simon, Sequence comparison: some theory and some practice, Electronic Dictionaries and Automata in Computational Linguistics, Springer-Verlag, Berlin, Saint Pierre d’Ol´eron, France, M. Gross and D. Perrin, 79–92, 1987. [9] S. Hirano and S. Tsumoto, An Indiscernibility-Based Clustering Method with Iterative Refinement of Equivalence Relations -Rough Clustering, Journal of Advanced Computational Intelligence and Intelligent Informatics, 7, (2), 2003, 169-177.

4