Data mining information visualisation

Viewer
Transcript

Data Mining Information Visualisation – Beyond Charts and Graphs Nigel Robinson Faculty of Informatics, University of Ulster Jordanstown [email protected]

Abstract Data Mining has become a major academic research area over the last ten years. However, in the transition from academic prototypes to commercial products there have been few successes with commercial Data Mining applications failing to make any significant impact in the marketplace. In this paper it is argued that most Data Mining applications concentrate on the model-building phase of the Data Mining process and rarely engage the user in other stages. The paper reviews the challenges of producing successful Data Mining applications and in particular the role of information visualisation. Visualisation in Data Mining tends to be used to present final results, rather than playing an important part throughout the entire process. We argue that an immersive environment may provide the user with a more suitable interface than is commonly offered. We present a Virtual Data Mining Environment which attempts to integrate a Data Mining application interface and information visualisation in a seamless manner, using the concept of ‘liquid data’. Index Terms – Data Mining, Information Visualisation, Virtual Environments, Immersive Interfaces

1.

Introduction

Data Mining (or Knowledge Discovery in Databases) is defined as ‘the exploration and analysis by automatic or semi-automatic means of large quantities of data in order to discover meaningful patterns or rules’ [1]. In effect, Data Mining exploits the abundant supply of data accumulated in the day to day functioning of a system in order to build models of a domain of interest.

Mary Shapcott Faculty of Informatics, University of Ulster Jordanstown [email protected]

Such models can then be utilised to make predictions about the domain. Clearly, there are potential commercial benefits from employing Data Mining. The use of mined information can improve business processes or identify new business opportunities. Notable examples of the successful application of Data Mining include the following: Daimler-Chrysler used Data Mining to identify unusual data densities in their warranty claim data, allowing quality engineers to quickly identify problems caused by external usage or internal process variations [2]. Farmers Insurance Group exploited Data Mining by identifying sports car owners who owned a second conventional car as being a low risk group among the sports car owner population. They then targeted them with specially priced policies [3].

2.

The Development of Data Mining

Data Mining was first recognised as a field in its own right ten years ago, when researchers from a number of different fields started to explore the possibilities of extracting information from the large quantities of data held in databases. Several workshops, titled “Knowledge Discovery in Databases” were held and. associated newsletters, conferences and journals followed [4][5]. Data Mining is now moving from academic research to application in mainstream information technology. There are now over one hundred commercial Data Mining software (siftware) products in the market place. It is considered that the software is in the early adoption phase but is expected to mature. It had been predicted that within the next few years Data Mining and Knowledge Discovery would be an integral part of the enterprise information technology [6]. Despite the general interest in Data Mining and the potential business rewards of a successful Data Mining

Proceedings of the Sixth International Conference on Information Visualisation (IV’02) 1093-9547/02 $17.00 © 2002 IEEE

strategy siftware products have struggled in the marketplace. In 1998 there were over seventy companies producing siftware, but of these, all but one or two were estimated to be losing money, based on their Data Mining software alone. Many relied on consulting services to survive. Large companies (e.g. IBM, SGI) used Data Mining software as a loss leader, the software attracting revenue from hardware sales and consulting [7]. The vendors of Data Mining products themselves [8] recognise that Data Mining software still has some way to go before reaching sufficient maturity to be accorded ‘mainstream’ status. Vendors interviewed on this subject have agreed that the academic origins of some packages have been obvious with only a thin layer of applications software built on the core technology. The resulting products are suitable only for statisticians and expert analysts. Hipp [9] and Thearling [10] agree that Data Mining software still has too much of the focus of academia i.e. on the underlying algorithms. They argue that end users are most interested in integrating Data Mining within a domain, with the goal of solving business problems.

3.

Data Mining Application Failings

In the hypothesis testing approach to Data Mining there are essentially six main phases [11]: •

•

Identifying business problems and postulating potential solutions or explanations. These constitute hypotheses. The Data Mining process is conducted with a view to testing these hypotheses. Extracting the data set for mining from the source database.

•

Pre-processing the data i.e. dealing with missing values or noisy data.

•

Building the required model from the data.

•

Analysing the model against the hypotheses.

•

Actioning the model i.e. modifying the domain process based on the information provided by the Data Mining.

Applications which are designed around the core Data Mining technology (e.g. a neural network or classification tree algorithm) are often focused on only one phase of the overall process i.e. the model building phase, and as a

result are focussed on a subset of the user tasks. It has been argued convincingly that the core algorithm should only be a small part of the overall application (perhaps only 10% of the whole) [10]. Such applications fail because they do not support the user throughout the whole process. The common definitions of Data Mining imply an automatic, machine driven process and as a result the trend amongst researchers and commercial developers has been to design products which perform everything automatically. The exclusion of user interaction is not always feasible when considered in the context of the overall Data Mining process presented at the start of this section. While the application of the algorithms may be automated the other phases certainly imply a requirement for user involvement [12]. The formulation of the hypotheses to be tested must be performed by the human Data Miner, for example, as must be the assessment of the actioning of the information. Current Data Mining tools provide little support for interaction and as a result provide limited, if any, possibilities of incorporating domain knowledge. If Data Mining is to incorporate domain knowledge then application design must be more user (Data Miner) centred than function centred. The users in question being from the domain where the Data Mining is being applied as these are the individuals with the pre-requisite domain knowledge. Data Mining applications should not be focussed on expert analysts but on domain based Data Miners.

4.

Visualisation and Data Mining

4.1

Requirement for Visualisation

As outlined in the previous section, there are a number of challenges for designers in providing suitable Data Mining applications i.e. applications which: a) Support the whole of the Data Mining process. b) Are usable by domain resident Data Miners who are not expert analysts. c) Support Data Miners who are from a diverse range of backgrounds. d) Allow interaction with the Data Mining tool. Improved visualisation facilities have been suggested as the means to meet the challenges [13][14]. It is argued that visualisation will aid the understanding by non-expert analysts of the information extracted by the Data Mining.

Proceedings of the Sixth International Conference on Information Visualisation (IV’02) 1093-9547/02 $17.00 © 2002 IEEE

Data Mining searches for ‘hidden information’, which contrasts with standard database retrieval operations. In standard retrieval operations the user is presented with material they already know resides within the database. Data Mining, by definition, produces information that the user did not know or had only hypothesised about. The assimilation of such information is not always intuitive. Comprehension of the output of Data Mining is considered essential as the Data Miner must understand the models constructed through Data Mining before they can assess the usefulness of the model and how it can be actioned in the domain. With understanding comes trust. Trust is essential to encourage Data Miners to act on the model. Presenting information with no insight into how it has been derived i.e. leaving the Data Mining tool totally ‘black box’ precludes the Data Miner from being able to explain the logic of model to colleagues and building confidence in the output. Consequently, it can be argued that the Data Miner should understand not only the model but also how the model was derived. While the detailed inner workings of the underlying Data Mining tool do not have to be visible, the visualisation of the information should provide a degree of semantics with respect to the tool from which it was derived. Understanding can be further improved by putting information in context. This can be accomplished by allowing the user to interact with the visualisation to answer ‘what if’ questions and have the opportunity of supplying domain knowledge.

4.2

Current Status of Data Mining Information Visualisation

While most vendors of Data Mining products have recognised the need for visualisation and many offer it as a feature of their products, the visualisations on offer in many cases [15] are conventional 2D charting facilities or 3D versions of their 2D counterparts. Some applications (notably Mineset by SGI [23]) offer 3D visualisation with interactive exploration. In all cases the visualisations are embedded within a standard Windows style GUI with the conventional artefacts of buttons, scroll bars etc. Visualisation is, in the main, currently employed in Data Mining at the end of the process as a means of presenting the results. Given, as noted earlier, the failure of conventional Data Mining products it is proposed that the use of visualisation within the Data Mining process should go beyond its current employment.

4.3

Proposed Development of Information Visualisation in Data Mining

It is proposed that visualisation should go beyond the charts and graphs which present information at the end of the Data Mining process. Data Mining should employ visualisation throughout the entire Data Mining process. Data Mining is, after all, about identifying patterns in data, the greater the employment of visualisation the greater the opportunity to use our own pattern recogniser – the human eye-brain system. If visualisation were to be applied throughout the Data Mining process, intuition suggests that such an approach could only be successful if the visualisation was consistent across phases with the boundaries between the phases transparent. It is clear that with this approach to visualisation the level of integration of the user interface and the visualisation is high, perhaps to the point where the visualisation is the interface and vice versa. A review of Data Mining products indicated that the majority offered limited scope for interaction [22]. Most of the interaction and interface control was restricted to the traditional ‘Windows’ GUI elements. This constrains the mental model required for understanding to that which can be constructed from the components of a Windows tool kit such as buttons, menus, sliders etc. It is possible that interfaces which immerse the user in the interface and visualisation could prove more effective. The modern games culture provides evidence to support this approach. Games players are immersed in unfamiliar environments often based on fantasy rather than reality, however, they rapidly learn to navigate and problem solve within these environments. Schneiderman [16] argues that the highly interactive and challenging environment of a computer game is suited to the field of entertainment but would not be relevant to application development. Application users focus on their task and do not want to encounter unpredictable events or results. This is certainly true when using a word processor to type a letter but, as noted earlier, Data Mining is very much an exploratory process and the user cannot predict the results. Could it be that Data Miners may find it acceptable to explore data in search of information while immersed in a virtual world constructed from the domain database in the same manner a games player explores the labyrinths of a fantasy world in search of points? Advances in graphic card technologies, high level 3D APIs and games/modelling languages allow immersive virtual environments to be created on the desk top. The user views a conventional display but the design model is

Proceedings of the Sixth International Conference on Information Visualisation (IV’02) 1093-9547/02 $17.00 © 2002 IEEE

based on the user ‘being in’ rather than ‘looking at the interface’. There has been some research into the use of such virtual environments for information visualisation in the context of Data Mining [17]. The focus has been on presenting data in a reasonably conventional manner e.g. scatter plots or surfaces but using the third dimension in conjunction with shape, colour etc. to display the additional attributes of the data. The hypothesis being proposed in this paper, as touched upon in previous sections, is that an effective Data Mining application can be produced by integrating the information visualisation with the interface design within a virtual environment. The information visualisation should be presented in the context of the metaphor framework on which the interface is based. The chosen metaphor framework should be accessible by a range of users. There is evidence that such an approach would be successful as similar techniques have been employed in other fields : Kahn [18] developed Toontalk, originally as an animated programming environment for children. The programming environment maps abstract programming concepts to concrete metaphors. Object orientated programmes are constructed using the interface metaphors in what is essentially a video game. Hutzler [19] suggests an imaginative use of virtual environments for visualisation.. A garden metaphor is used for the real time visualisation of complex systems. Data from the underlying system is mapped to the “Data Garden”, and the evolution of the garden over time is representative of the state of the underlying system. The Data Mining community has recognised the potential of a metaphor-based design with respect to virtual environments and research into the formalisation of such design is being undertaken[20].

5.

5.1

Virtual Data Mining Environment Design Proposed Approach

The previous section proposed the hypothesis that a virtual environment, integrating application interface and information visualisation in a seamless manner could provide the basis for an effective Data Mining application. It is proposed that this approach may prove successful where more conventional approaches had failed.

5.2

Evaluation of the Approach

In order to test the hypothesis it was considered that a Virtual Data Mining Environment (VDME) should be designed and constructed. This could then be tested against more conventional Data Mining applications. As there are many Data Mining tools (e.g. classification, clustering) and a number of database models, it was considered necessary to set finite bounds on the scope of the Virtual Data Mining Environment. To this end it was decided to select a single tool and database format. The tool selected was classification, as it perhaps the most widely employed Data Mining technique. The specific implementation is a decision tree building algorithm (ID3). The source database was selected as being a database built on the relational model.

5.3 5.3.1

VDME Design Selection of the Metaphor Framework

The first step in the design process was to analyse the Data Mining tasks from the perspective of a domain based Data Miner (not an expert analyst). This was essentially the decomposition of the operations performed in each of the phases identified in Section 3. The next, and perhaps the most difficult step in the design process was to establish a suitable overall metaphor theme or framework which would map the Data Mining tasks (database tasks, model building tasks and information analysis tasks) to interface tasks in a transparent manner. Additionally the selected metaphor framework had to seamlessly integrate data and information visualisations. In the course of examining techniques employed for database visualisation Schneiderman’s filter flow model was discovered [21]. The model employs the metaphor of a liquid to represent data. Filters are used to select data by decreasing the flow to the required dataset. Essentially queries are executed by setting the filters to represent AND (sequential flow) or OR (parallel flow) queries (or a combination of both) . The ‘liquid data’ metaphor led on to the concept of building a Data Mining application interface as a ‘processing plant’ of pipes, tanks and filters in which the raw data is ‘refined’ into information. Illustrations of the visualisations/interface elements which can be derived from this metaphor framework are outlined in the subsequent section.

Proceedings of the Sixth International Conference on Information Visualisation (IV’02) 1093-9547/02 $17.00 © 2002 IEEE

5.3.2

Interface Elements and Visualisation

The visualisation elements, interface objects and task mappings from which the Virtual Data Mining Environment is composed are described in this section. The source database is represented as a series of interconnected tanks (Figure 1). Each tank represents a relation. The interconnection of the tanks can be used to present the relational model. Figure 2 – Inside a Filter Station

Figure 1 - Source Database – Presented as tanks with pipes and query filter stations

All the pipes join to a ‘dataset tank’ into which the data passing through the filters, i.e. the dataset required for the data mining, flows. If the dataset is large then the ‘fill level’ of this tank can be adjusted to obtain a sample, or training set, of the source data. In order to classify the data the system builds a decision tree based on the target classification. The decision tree is presented as a series of filter stations (the decision nodes) and data tanks (the leaf nodes), as illustrated in Figure 3. The leaf data tanks contain the (ideally) ‘pure’ liquid data. This is pure in the sense that data in any given leaf tank has the same value of a classification attribute. The filter stations and data tanks are essentially a visualisation of the decision tree produced by the classifier.

Data pre-processing can be supported by the appropriate visualisation of the data within each tank. For example, data with missing values could lie at different levels in the tank from the ‘clean data’ in the same way oil floats on water. Associated colour coding to indicate the boundaries between the layers affords the Data Miner the opportunity to ‘drain off’ or ‘distil’ part of the data to remove noisy or missing values. Extraction of a dataset for mining, is normally performed in the task domain through the execution of (e.g. SQL) queries. The execution of queries is mapped to the configuration of filters in the VDME domain. Filter stations are placed ‘in line’ between tanks. Adjusting the filter parameters at any given filter station alters the flow of data, effectively running a query on the source database (Figure 2).

Figure 3 – Decision Tree -Presented as tanks and query filter stations

The tree visualisation can be built entirely automatically, or alternatively the Data Miner allowed to add and configure the filter stations at each level. The key feature to note is that the result visualisation is metaphorically based, and is consistent with the earlier

Proceedings of the Sixth International Conference on Information Visualisation (IV’02) 1093-9547/02 $17.00 © 2002 IEEE

phases of the process i.e. it is the same representation as was used to extract the source data. The user is familiar with the visualisation and can explore the filter stations (the split nodes) and understand their meaning in terms of partitioning and selecting the data from the source data set. The various combinations of split nodes from the dataset tank to the leaf node data tanks effectively constitute the set of rules used to classify data i.e. the queries which have to run to partition unclassified data into predicted classifications. The visualisation can be further enhanced by colour coding the leaf node data tanks and similarly coding the data flowing. The ‘purity’ of the data as it flows through each filter can be determined by observing how close it is to the colour coding of a classifier attribute. The data colour is determined by the percentage of the data which corresponds to each classifier attribute value. This allows the Data Miner to determine where the tree can be pruned and which attributes serve to influence classification the most.

6

Implementation

The implementation of the environment is still work in progress. Upon completion it is intended that it will be used by novice Data Miners and the usability benchmarked against more conventional Data Mining and Visualisation applications. The application is being implemented using Dark Basic. Dark Basic is a rapid application development tool for games programmers and is distributed with an extensive library of 3D models. Although sometimes maligned by games programming purists, it allows standalone interactive 3D environments to be created very quickly. The applications are built upon the DirectX 7.0 API but the details of the API are hidden from the developer by the DarkBasic Language. A recent Dark Basic enhancement has provided the facilities to integrate custom DLLs. This means the language can be extended via user functions packaged in DLLs. This affords the opportunity to add database and network access facilities as required. The interface as indicated in Figures 1 to 3 runs comfortably (with no signs of latency) on a PII 350Mhz with 128MB RAM and a 4 MB 3Dfx card. Obviously current PC technology is much in advance of this specification, suggesting PC technology is capable of rendering much more sophisticated environments if required. Should large databases prove to be a problem then the network interface provides the opportunity to

distribute the graphical interface, the database access and mining tools across several machines.

7.

Conclusions

A literature survey indicated that there is a significant challenge in making Data Mining accessible to Data Miners across a wide range of domains. Visualisation has been suggested as a means of meeting this challenge, but visualisations as offered by the current batch of Data Mining applications do not appear to be satisfactory. Visualisation has to go beyond charts and graphs and become an integral part of the interface in a manner consistent with the metaphor framework around which the interface has been designed. It is proposed that a Virtual Data Mining Environment based on the metaphor of data being a liquid which can be filtered and purified, could provide, a suitable Data Mining application interface. A Virtual Data Mining Environment based around this metaphor is currently under construction. While it may yet prove difficult to expand this environment to large datasets or datasets which have attributes with numerous potential values, benchmarking the usability of this environment against more conventional approaches would give an indication as to the correctness of the hypothesis.

Proceedings of the Sixth International Conference on Information Visualisation (IV’02) 1093-9547/02 $17.00 © 2002 IEEE

REFERENCES [1] M. Berry , G. LinHoff, Data Mining Techniques for Marketing, Sales and Customer Support, Chapter 1, Wiley Computer, Publishing, 1997.

[13] K. Thearling et al., “Visualising Data Mining Models”, Information Visualisation in Data Mining and Knowledge Discovery, Morgan Kaufman, 2001.

[2] Information Discovery Inc., Case Studies, 2002 [http://www.datamining.com/casestudies.htm]

[14] R. Kohavi , “Data Mining and Visualisation”, NAE US Frontiers of Engineering Symposium, 2000.

[3] E. Booker, “Insurer Mines Data on Drivers”, Internetweek.com, June 1999. [http://www.internetweek.com/case/study062199-1.htm]

[15] M. Goebel , L. Gruenwald , “A Survey of Data Mining and Knowledge Discovery Software Tools”, SIGKDD Explorations, Vol 1, pp. 20-33, 1999.

[4] G. Paetestky-Shapiro, Knowledge Discovery Nuggets. [http://www.kdnuggets.com/]

[16] B. Schneiderman, Designing the User Interface, AddisonWesley, Chapter 6, 1998.

[5] ACM, SIGKDD Charter, 1999. [http://www.acm.org/sigkdd/charter]

[17] Henrik et al., “Methods for Visual Mining of Data in Virtual Reality”, Proceedings of the International Workshop on Visual Data Mining (VDM@PKDD2001), 2001.+

[6] U. Fayyad, Editorial, SIGKDD Explorations, Issue 1, pp. 1-3, 1999. [7] R. Kohavi, “Crossing the Chasm : from Academic Machine Learning to Commercial Data Mining” , International Conference on Machine Learning, July 1998. [8] D. Howlett, “Data Mining”, Information Week, Issue 28, 1998. [9] J. Hipp et al., “Data Quality Mining”, DMKD2001 Workshop on Research Issues in Data Mining and Knowledge Discovery DMKD2001, 2001. * [10] K. Thearling, “Thought on the Current State of Data Mining Software Applications, DSStar, Vol 2, 1998. [http://www.tgc.com/dsstar/] [11] P. Smyth, “Breaking Out of the Black-Box: Research Challenges in Data Mining”, DMKD2001 Workshop on Research Issues in Data Mining and Knowledge Discovery, 2001. * . [12] M. Ankerst, “Human Involvement and Interactivity of the Next Generation’s Data Mining Tools”, DMKD2001 Workshop on Research Issues in Data Mining and Knowledge Discovery, 2001. *

[18] K. Kahn, “ToonTalk – An Animated Programming Environment for Children”, Journal of Visual Languages and Computing, 7(2), pp. 197-217, 1996. [19] Hutzler et al., “Grounding Virtual Worlds in Reality”, Proceedings of Virtual Worlds First International Conference, pp. 274-284, 1998. [20] S. Dimoff , “Towards the Development of Environments for Designing Visualisation Support for Visual Data Mining”, Proceedings of the International Workshop on Visual Data Mining (VDM@PKDD2001) , 2001. + [http://www.informatik.uni-freiburg.de/~ml/ecmlpkdd/WSProceedings/w03/index.html]

[21] B. Schneiderman, Designing the User Interface, AddisonWesley, Chapter 15, 1998. [22] N. Robinson, “MPhil/DPhil Transfer Report”, UUJ, 2001. [23] Silicon Graphics, “Mineset”, 2000. [http://sgi.com/software/mineset]

*[http://www.cs.cornell.edu/johannes/dmkd2001.htm] + [http://www.informatik.uni-freiburg.de/~ml/ecmlpkdd/WSProceedings/w03/index.html]

Proceedings of the Sixth International Conference on Information Visualisation (IV’02) 1093-9547/02 $17.00 © 2002 IEEE

data mining on the installed base information - Rashid Bakirov