Statement of Research Interest Pradeep Kumar Junior Research Associate SETLabs, Infosys Technologies Limited Bangalore, INDIA - 560100 Email: [email protected] In recent years, advanced information systems have enabled collection of increasingly large amounts of data. To analyze huge amounts of data, the interdisciplinary field of Knowledge Discovery in Databases (KDD) is very useful. The most important step within the process of KDD is data mining, which is concerned with the extraction of the valid patterns. Recent research focus in data mining includes stream data mining, sequence data mining, web mining, text mining, visual mining, multimedia mining and multi-relational data mining. The emphasis of my research work over the past few years is in the field of sequence data mining, specifically, developing solutions for intrusion detection and web personalization. In general, my research generally revolves around analysis of data and utilization of data mining, statistical and machine learning techniques in a wide range of real life applications. My future research work aims at developing scalable, novel data mining algorithms for the various application areas like web mining, bioinformatics, forensic science etc.

PhD Research Work Thesis Title: An Investigation of Classification and Clustering of Sequential Data Advisors: Dr Raju S. Bapi, Reader, University of Hyderabad, INDIA and Dr. P. Radh Krishna, Associate Professor, IDRBT, Hyderabad, INDIA The extraction of useful and non-trivial information from the huge amount of data is called Data Mining (DM). DM is part of a bigger framework, referred to as Knowledge Discovery in Databases (KDD), that covers a complex process, from data preparation to knowledge modeling. DM tasks includes classification (assign each record of a database to one of a predefined set of classes), clustering (find groups of records that are close according to some user defined metrics), association rules (determine implication rules for a subset of record attributes) and many more. A large number of algorithms have been developed to perform these and others tasks with the help of interdisciplinary fields of science from machine learning to statistics through neural and fuzzy computing. A huge amount of data is collected every day in the form of sequences. Discovering useful and meaningful patterns from large databases of sequences is a challenging and interesting task. For example, how can we use the user web log information more efficiently and effectively for web personalization? Based on the user session how early and efficiently we can raise alarm indicating the current session belongs to one who may be hacker? How do we summarize the differences between biomedical images of retinas in healthy and diseased conditions? Moreover, how do we correlate information from various modalities (e.g., between image and text) for content-based information or retrieval? The main goal is to discover patterns that make information in sequence databases useful and accessible. 1

Research on sequence data concentrates on the discovery of frequently occurring patterns. However, comparatively less amount of work has been carried out in the area of sequence data classification and clustering. The contribution of my research work is in the development of new methods for classification and clustering of sequence data. My research work introduces solutions for real-world applications dealing with classification of system call sequences in intrusion detection and clustering of web navigational sequences in web usage mining. Experiments were conducted on DARP A0 98 IDS (sequence classification) and msnbc web navigational (sequence clustering) benchmark dataset. Initially, the hypothesis that while comparing two sequences, considering only the order or the content information embedded in the two sequences results in poor performance in classification and clustering tasks was established. kN N classification and P AM clustering algorithms are utilized for classification and clustering tasks, respectively and sliding window technique is used for extracting subsequences. To establish the hypothesis four distance/similarity measures namely, Euclidean, Cosine, Jaccard and Binary W eighted Cosine measures were used along with various sliding window sizes. The motive was to address the following two questions: • Does incorporation of the order of occurrence (sequence aspects) information enhance the efficiency of classifier on sequence data ? • Can a better grouping of data that preserves sequentiality be achieved by incorporating sequence information? Thus, empirically we established the following two facts. • While performing classification and clustering of sequence data sequence Information is important. • Apart from sequence information, content information is also important. Based on the results from the hypothesis testing, a new similarity measure, S 3 M which considers both the order of occurrence as well as the content information while computing similarity between two sequences was designed. Better classification accuracy and clustering quality were achieved with the newly devised measure. In clustering of sequences, the goodness of the clusters was measured using Average Levenshtein Distance. A new partitional clustering algorithm for sequence data, SeqP AM was proposed. SeqP AM differs from P AM in mediod selection as well as the optimization function. The superiority of the proposed algorithm was demonstrated over P AM clustering algorithm. Further, a new indiscernibility-based rough agglomerative hierarchical clustering algorithm for sequential data is proposed. Here, the indiscernibility relation has been extended to a tolerance relation with the transitivity property being relaxed. Initial clusters are formed using a similarity upper approximation. Subsequent clusters are formed using the concept of constrainedsimilarity upper approximation wherein a condition of relative similarity is used as a merging criterion. The results of the proposed approach was compared with that of the traditional hierarchical clustering algorithm using vector coding of sequences. In general, my interest is in solving data mining problems by combining techniques from traditional areas such as, machine learning, statistics and databases, as well as the areas of data analysis and mining. My aim is to develop automated and novel algorithms methods that extract

2

useful information from massive databases arising in commercial, information retrieval, scientific and life sciences applications with an aim of extending the scientific knowledge. Working at the forefront of a new discipline requires an understanding of the scientific problems, a firm grasp of existing algorithmic and computational techniques, and the ability to adapt and develop innovative new analytic and computational techniques for the specific application domain. This is the kind of work I have always enjoyed and excelled at. As stepping stones toward the solutions, plan is to first extend the applications of existing algorithms. In the course of this work, the new novel problem domains will be identified with a focus on continuously developing new techniques to address existing and new application areas. My research work would also include the study of current state of art in the application areas of my research. My intention is to provide the guidelines to the policy makers in the application area where my research will be focussing. It would be pleasure for me to collaborate with other research groups and contribute my expertise to multi-disciplinary projects.

List of Publications from the thesis Journals 1 Kumar, P., Krishna, P. R., Bapi, R. S. and De. S. K. Rough clustering of sequential data, Data & Knowledge Engineering (doi:10.1016/j.datak.2007.01.003), 2007. 2 Kumar, Pradeep, Radha Krishna, P. and Raju, B. S. : SeqPam: A clustering algorithm for sequential data In International Journal of Data Warehousing and Mining 3(1), pp. 29-53, 2007. 3 Kumar Pradeep, Radha Krishna, P. and Raju, B. S.: A new measure for sequential Data, Pattern Recognition Letters. 4 Kumar Pradeep, Radha Krishna, P. and Raju, B. S.: Towards a new approach to Classification and Clustering of Sequential Data, Pattern Recognition. Book Chapter 5 Kumar Pradeep, Radha Krishna, P., Raju, B. S. and Padmaja T. M.: Advances in Classification of Sequence Data, Advances in Data Warehousing and Mining, IGI Publisher, 2007. International Conferences 6 Kumar Pradeep, Radha Krishna, P., Bapi, R. S. and De, S. K. : Clustering using Similarity Upper Approximation, In IEEE International Conference on Fuzzy Systems, FUZZ-IEEE 2006, Vancouver, Canada, pages 4230 - 4235, July 2006. 7 Kumar Pradeep, Rao, M. V., Radha Krishna, P. and Bapi, R. S. :Using Sub-sequence Information with kNN for Classification of Sequential Data. In Proceedings of International Conference on Distributed Computing and Internet Technology (ICDCIT), LNCS Springer Verlag, Vol 3816, Bhubneshwar, India, December 2005, pp. 536-546.

3

8 Kumar Pradeep, Rao, M. V., Radha Krishna P., Bapi, R. S. and Laha, A. :Intrusion detection system using sequence and set preserving metric. In Proceedings of IEEE International Conference on Intelligence and security informatics, IEEE-ISI 2005. LNCS Springer Verlag, Vol 3495, Atlanta, Georgia, May 2005, pp. 498-504. 9 Kumar Pradeep, Radha Krishna, P., Bapi, R. S. and De, S. K. :Web Usage Mining Using Rough Agglomerative Clustering. In proceedings of 7th International Conference on Enterprise Information Systems (ICEIS (2)), Miami, USA, May 2005, pp. 315-320.

References [1] R. Agrawal and R. Srikant, Mining Sequential Patterns, In Proc. of the 11th International Conference on Data Engineering, Taipei, Taiwan, March 1995, pp. 3–14. [2] Valerie Guralnik and George Karypis, A Scalable Algorithm for Clustering Sequential Data, 1st IEEE Conference on Data Mining, pp. 179 – 186, 2001. [3] J. Han, M. Kamber, Data Mining: Concepts and Techniques, Morgan Kaufmann, 2001. [4] N. Lesh, M. J. Zaki, M. Ogihara, Mining features for sequence classification, in: S. Chaudhuri, D. Madigan (Eds.), Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM Press, San Diego, 1999, pp. 342 – 346. [5] H. Mannila, H. Toivonen, Discovering generalized episodes using minimal occurrences, in: Knowledge Discovery and Data Mining, 1996, pp. 146 - 151. [6] H. Mannila, H. Toivonen, A. I. Verkamo, Discovery of frequent episodes in event sequences, Data Mining and knowledge Discovery 1 (3) (1997) 259 - 289. [7] T. M. Mitchell, Machine learning, Mc Graw Hill, 1997. 12 [8] I. Simon, Sequence comparison: some theory and some practice, Electronic Dictionaries and Automata in Computational Linguistics, Springer-Verlag, Berlin, Saint Pierre d’Ol´eron, France, M. Gross and D. Perrin, 79–92, 1987. [9] S. Hirano and S. Tsumoto, An Indiscernibility-Based Clustering Method with Iterative Refinement of Equivalence Relations -Rough Clustering, Journal of Advanced Computational Intelligence and Intelligent Informatics, 7, (2), 2003, 169-177.

4

Statement of Research Interest

Bangalore, INDIA - 560100. Email: [email protected]. In recent years, advanced information systems have enabled collection of increasingly large.

114KB Sizes 0 Downloads 287 Views

Recommend Documents

Statement of Research Interest
in data mining includes stream data mining, sequence data mining, web ... The emphasis of my research work over the past few years is in the field of sequence data .... per Approximation, In IEEE International Conference on Fuzzy Systems, ...

Research Interest and Statement
research in financial big data modeling. For the next 5 to 6 years, I plan to continue research in the following areas dealing with statistical problems arising from their related disciplines. FMCI Approach in Queueing Models. In my thesis work, I le

Conflict of Interest Statement - SIM.pdf
There was a problem previewing this document. Retrying... Download. Connect more apps... Try one of the apps below to open or edit this item. Conflict of ...

Statement of Research
are the major tools used in this methodology. ... to develop useful and powerful tools for optimal decision making. Network ... Automation Conference, 2009.

Conflict of Interest Statement 2017.pdf
Page 3 of 25. Conflict of Interest Statement 2017.pdf. Conflict of Interest Statement 2017.pdf. Open. Extract. Open with. Sign In. Main menu. Displaying Conflict of ...

Research Statement
Jun 1, 2017 - Moreover, it encourages me to investigate alternative .... how we can develop a quantum annealing algorithm to compute the expected energy.

Research Statement -
Nov 2, 2012 - First, I find that search frictions generate a counter-cyclical interest rate spread by varying bank loans in both extensive and intensive margins, which amplifies ... mechanism reduces intertemporal substitution towards savings.

Research Statement
Nov 7, 2016 - (2006) argue that, first, public health infrastructure and, later, medical innovations made large contributions to the mortality ... In particular, I draw on transcriptions of hand-collected archival material, complete-count census reco

Research statement
Nov 29, 2016 - The energy of φ ∈ Ham is. E(φ) := inf{. ∫ 1 .... alternative: 1. b1(L;Z) is ... point of L, whose energy is smaller than the Hofer distance. When the ...

Research Statement
Nov 2, 2012 - In my research, I aim to understand the linkage between real and finan- ... In my job market paper, titled “Search Frictions, Bank Leverage, and ...

research statement
Fractal geometry is the study, within geometric measure theory, of sets and .... game, and the target set S is said to be (β, c, H)-potential winning if Alice has a ...

research statement
forward and automatically learn from these data sets answers to sociological ... have significant implications for sociologists, information analysts as well as online ..... Towards Better and Faster Topic Models: There is still room for improvement 

Research Statement
a ten-year book series of design recommendations for ITS [19]. ... tionships may be persistent (e.g., in-person social networks) or temporary (e.g., students ...

Research Statement
Symbolic Logic, 63(4):1404–1412, 1998. [3] Arthur W. Apter and Joel David Hamkins. Universal indestructibility. Kobe J. Math., 16(2):119–130, 1999. [4] Arthur ...

Statement of research interests - Etienne Laliberté
May 22, 2009 - I have also recently conducted a meta-analysis on the impacts of land use ... I have recently developed the FD R package (http://cran.r-.

Statement of research interests - Etienne Laliberté
May 22, 2009 - domain have been to clarify and develop multivariate methods for analyzing spatial patterns and quantifying the importance of niche and other ...

Statement of research interests - Etienne Laliberté
May 22, 2009 - Empirical ecology has made enormous recent progress in evaluating the impacts of human ... A next frontier is to define interactions not by the.

Patricia Klein Research Statement My area of research ...
commutative algebraists and algebraic geometers but also has myriad applications, includ- ing phylogenetics [ARS17, ERSS05, AR08], disclosure limitation [SF04, DFR+09, HS02], ..... Moduli of finite flat group schemes, and modularity. Annals of Mathem

Research Interest PSK.pdf
Studying the changes that takes in the plant during various. biotic and abiotic stress in plants at molecular level will give us a clue to develop. various stress tolerant plants. And also we can apply the mechanism adopted by the. plants to the vari

Research Statement Background
infinite descending chains and incompatible elements in the consistency hierarchy, but it is a surprising empirical fact that all natural extensions of ZFC are well-ordered. Any cardinal whose existence can not be proved in ZFC is considered a large

Research Statement Ruslana Rachel Palatnik
Feb 26, 2008 - the economy-wide consequences for Israel of meeting the targets of the ... In collaboration with Climate Change Modeling and Policy team of ...

Problem Statement Data Layouts Unique Research ... - GitHub
Cluster C Profile. HDFS-EC Architecture. NameNode. ECManager. DataNode. ECWorker. Client. ECClient. BlockGroup. ECSchema. BlockGroup. ECSchema. DataNode. DataNode. DataNode … ECWorker. ECWorker. ECWorker. BlockGroup: data and parity blocks in an er