Ontology Based Query Expansion Framework for Use ...

Viewer
Transcript

J. WEB INFOR. SYST. VOL. 1, NO. 1, MARCH 2005

1

Ontology Based Query Expansion Framework for Use in Medical Information Systems D. Wollersheim Department of Computer Science and Computer Engineering, La Trobe University, Melbourne, Australia, [email protected] W. J. Rahayu Department of Computer Science and Computer Engineering, La Trobe University, Melbourne, Australia, [email protected]

Received: January XX 2005; revised: November XX 2005

Abstract— This paper presents a framework which combines data and text retrieval techniques to exercise and evaluate ontology based query expansions. We prepare by using linguistic techniques to identify query and document concepts, locating them in a ontologically defined semantic space. Expansions originate from the identified query concepts, with success determined by matching in the relevant document set. We identify three orthogonal dimensions that can affect query expansion success; relationship source, success measure technique, and query expansion technique. Expansion technique is further divided into six different categories: simple pruning, complex probability, voting, directional, semantic propagation, and multiple source concept. We describe each technique and show examples where they would be useful. The system architecture used facilitates plugging in of various expansion and evaluation routines, and flowing results from one method to the next. The system is useful for microanalysis of query expansion, discovering which components of ontological derived knowledge most influence query expansion success. In this work, we apply our framework to the medical domain. Index Terms— Information Retrieval, Query Expansion, Ontologies

I. I NTRODUCTION Information retrieval is the process of taking a query and retrieving documents relevant to that query. Current practice, for example most search engines, works at a lexical level, retrieving only documents containing the words from the query (Sherman 2002). We call this targeted text retrieval, because the user must target their query with existing words. The alternative situation, the words from the query do not exist in the relevant documents, is called imprecise retrieval. Query expansion (QE) addresses imprecise retrieval by modifying the query, adding in words related to the original query words. There are many possible relationships. This paper describes a system which determines the characteristics of good query expansion by discovering and describing the general features of this network of valuable relationships. Query expansion is useful because imprecise queries cannot retrieve the entire set of relevant documents. From the mere words in a query, we

cannot tell the exact meaning that a searcher has in mind, as language comprehension does not yet exist. Yet we do have resources that give us more information than the query words provide alone. The current situation presents a rich opportunity to explore ontology based query expansion. Information is becoming more explicitly interconnected, and therefore semantically richer. Ontological resources are increasing in number, complexity and quality; for example, Wordnet(Wordnet 2004), Cyc (Lenat 1995), and UMLS (Unified Medical Language System) (McCray, Aronson et al. 1993). Also, the query expansion task provides another use for these resources, and a way to test them against real world information needs. QE is a simple, readily evaluated application of domain specific knowledge to a discrete problem. The rise of ontologies gives us a wealth of semantic connections, describing relationships between words. This resource is merely potential because the use of ontological semantic knowledge to perform query expansions is not fully explored. Because of the number of connections in an ontology there are many possible fruitful expansions, and expansion strategies. This is only exacerbated by the increase in ontological richness. Current thinking about the use ontological intraconnection is primitive, especially regarding their use in query expansion. State of the art query expansion uses hierarchical relationships only, expanding in the directions of parents and children. In spite of specific requests for detailed evaluation of query expansion relationships(Jones, Gatford et al. 1995), (Hersh, Hickam et al. 1994) there has been little systematic research into the characteristics of good expansion relationships beyond unidirectional hierarchical expansions. We start by performing linguistic analysis of both the query and document set. This anchors the items in semantic space, and allows us to directly evaluate the query expansions. In this way, we can precisely assess success or failure of individual concept expansions by whether they exist in the relevant document set. This makes it easy to test a wide variety of expansion techniques, against a range of input query descriptive variables. In other words, if we think of documents and queries as points in an information space, we want to determine where the documents relevant to a query exist in

J. WEB INFOR. SYST. VOL. 1, NO. 1, MARCH 2005

Fig. 1.

Types of algorithmic query expansion.

the semantic neighbourhood around a query, and find which extensions to that neighbourhood best describe the intent of the query. An ontology is a conceptualisation of that information space, providing reference points which we will call concepts, and the links between them. The purpose of our system is to create a framework that can help determine which expansion parameters, for a given query, best generate the ontological path from query concepts to relevant document concepts within a semantic neighbourhood. This paper describes the expansion techniques used in that system. II. E XISTING W ORK IN Q UERY E XPANSION QE needs a source of relationships to provide the connections between the query words and the relevant documents. As evaluation of these relationships play a central role in our work, it behooves us to review the various QE schemes from the point of view of relationship source. There are two main categories: 1) unstructured relationships derived from a document corpus analysis, and 2) hard coded relationships from human sources, such as a thesauri or ontologies. Figure 1 shows the methods we will cover in this section. The corpus based QE generally provide more accurate expansions that the more universal pre-existing relationship based QE. This follows from the fact that the corpus based methods derive their relationships the actual corpus upon which the query retrieves, and are not necessarily accurate in the general case. A. Corpus based QE Corpus QE has three general streams: relevance feedback, local analysis, and global analysis. Relevance feedback uses user interaction to determine a few relevant documents, and then retrieves more like these. Local analysis does a similar thing automatically, looking for relationships from among the documents initially retrieved. Global analysis obtains interterm relationships from the entire corpus, and expands the query from these. In relevance feedback (Salton and Buckley 1990; Harman 1992) documents are retrieved according to one of the standard IR methods, and then the user chooses relevant documents (from among the top 10 or 20 document

2

query. This reformulated query is then rerun, and the process continues. The bias can be either expand the query, adding in terms, and/or reweight query terms. The methodology used to modify the query to make it accord with the identified relevant documents can be based either on the vector or probabilistic models. In its favour, this algorithm provides a recognition phase that involves the user. The user identifies documents which would contain useful keywords, and the computer derives the keywords from the documents. Relevance feedback is good for retrieval from within large collections, the ’needle in a haystack’ search. Similar to relevance feedback, local analysis expands a query based on the context established by the set of documents initially retrieved as a result of a query (Attar and Fraenkel 1977). This is done by retrieving the top ranking documents for the current query, and then modifying and rerunning the current query based on this retrieved content. This technique also can be used to refine expansions generated by pure thesaurus or co-occurrence based relationships, by looking at the local context of the current query. Xu and Croft (Xu and Croft 1996) refined local analysis techniques by incorporating ideas from global analysis into a local analysis algorithm. This included retrieving and searching for passages, not documents, and comparing similarity of candidate expansion terms to the entire query rather than individual query terms. In automatic global analysis, the relationships that are used to modify the query come from the entire corpus. We discuss 2 variants. The first is based on a similarity thesaurus (Qiu and Frei 1993). This thesaurus is not derived from a co-occurrence matrix as used in the early QE work. Instead, the similarity thesaurus, while still arising from the document corpus, uses a distance metric which is based on the terms being concepts in a concept space. Innovatively, the concept space is indexed by the documents in which the terms appear. Additionally, this algorithm expands the query by choosing terms close (in concept space) to the centroid of the entire query, rather than those terms close to the individual query terms. The second form of global analysis joins terms by first clustering the documents into tight clusters, and then choosing the low frequency, high discrimination terms from within these clusters (Crouch and Yang 1992). B. Relationship Based QE Whereas the relationships in corpus based QE are derived from the collocation of terms within documents, relationship based QE uses relationships from outside sources. We divide these sources into 2 areas; thesauri and ontologies. Thesauri are sets of words that are identified as being related to each other. At their most simple, they nominate synonymous relationships. Some thesauri contain information about other types of relationship. While not a clear distinction, ontologies tend to specify relationships in a more formal manner, and they tend to be used in more of a knowledge retrieval context. As such, the constraints on ontology based IR are more rigid, and the results more ’correct’, than those of thesaurus based QE. The area of automatic medical query expansion has been especially fruitful. This is due to the existence of

J. WEB INFOR. SYST. VOL. 1, NO. 1, MARCH 2005

text, and high user demand. The premier thesaurus resource is in the medical area is UMLS. There has been much interest in using UMLS to facilitate query expansion, with mixed results. Early UMLS based QE showed that such a system, combined with retrieval feedback, returned results significantly above baseline statistical retrieval (Aronson and Rindflesch 1997) (Srinivasan 1996), showing the promise of using the rich resource provided by UMLS. Hersh et al (Hersh, Price et al. 2000) found that unconstrained UMLS based synonym and hierarchical query expansion on the Ohsumed collection degraded aggregate retrieval performance, but some specific instances of QE improved individual query performance. They noted that there was a role for investigation of specific cases where QE is successful. Overall, there are a wide variety of possible relationships usable for expansion. Basic categories include synonymy, hierarchy, and associative type relationships. Within these gross categories, there are many different types. Green et al (Green 2001) point out that some categories probably have a enumerable, finite number of types (hierarchy, synonymy), but the set of associative type relationships is open, limited only by the ability of the human mind to make connections. Relationship based QE, like corpus based QE, has also been found to be useful, with this area being widely explored. Too, the UMLS has been found to be useful source of QE relationships. Even given this broad examination, there are places where more work is needed. The first point concerns the types of relationships used by QE. The possible set of useful QE relationships is large and growing, and there will be no simple categorisation of relationships that will yield perfect query expansion, because the human mind is more complex than this. Notwithstanding this, there are possibly more complex categorisations than are currently being used, and this complexity is unexplored. There has been little systematic study of topicality prediction in terms of relationship type. Existing QE systems use a conservative set of expansion relationships, and in the case of ontology based systems, very much so, only accepting strict (i.e. subsumption based) relationships. A broad survey of the relative effectiveness of different types of relationships, in relationship to the other thesaurus and query variables, has not been done. This concern is a dynamic one, in that as relationship sources continue to grow and become more complex, they will need continued study. In the QE field, relationship based QE has not been as successful as corpus based QE. We hypothesize that this is partly due to the fact that relationship based QE has not been evaluated on a fine level, which motivates our research. Our motivation is enhanced by opportunism; there exists rich metadata associated with thesaurus and ontology relationships, which gives us more statistical raw material on which to base our analysis. This is contrasted by the lack of metadata for corpus based expansions. III. Q UERY E XPANSION F RAMEWORK D ESIGN The framework into which our query expansion techniques fit is shown in Figure 2. We generate our expansions based on the concepts and relationships from the UMLS medical

3

document test collection (Ohsumed). The UMLS is database put out by the USA National Institute of Health, aiming at unifying different sets of medical terminology. We use UMLS as a source for document concept identification, and query expansion. UMLS is wide ranging, and has been widely used, and it is of unparalleled richness; being is an amalgamation of many different vocabularies, with much metadata. It is ripe for data discovery. There is no other candidate that could provide such a resource. We use Ohsumed to determine query expansion effectiveness, by looking at which documents are relevant to a query. Ohsumed contains documents, a set of queries, and relevance judgements connecting them. Ohsumed was developed by Hersh in 1994 (Hersh, Hickam et al. 1994) to test the efficacy of different medical information retrieval strategies. The collection consists of 355013 documents, being the titles and abstracts of Medline (a medical journal database) from years 1987 to 1991. The queries come from real life situations, and the relevance judgements were provided by domain experts. While there are other medical test collections, we choose Ohsumed for several reasons. It is one of the largest available medical test collections, and because of the statistical nature of our evaluation, size is important. More importantly, because query expansion has been previously studied with Ohsumed (Hersh, Price et al. 2000), we will be able to compare the effectiveness of our method. Our evaluation necessitates knowing where Ohsumed resources exist in the UMLS defined semantic space. Specifically, we will need to identify the UMLS concepts in both the Ohsumed queries, and documents. Thanks to the advances in document concept identification e.g. (Nadkarni, Chen et al. 2001), this can be done automatically. After we have setup the structure, we then generate and evaluate expansions. An overview of the query expansion evaluation framework can be seen in Figure 2. In preparation, the documents from the Ohsumed collection (1) are categorised in step 2 using the UMLS concepts. The end result of this step is the set of documents associated with their constituent concepts (3). A second similar process takes place with the Ohsumed test query collection (4). They are also categorised (5), albeit it manually, and associated with their constituent UMLS concepts. After the data preparation stage, we are ready to judge the various query expansion techniques. For each concept identified query (9), we expand it using one of a set of expansion techniques (10). In the evaluation step (12), the expanded query set (11) is matched against the relevant document concept set (7), leading to a success measure for this query and expansion technique. The purpose of this paper is to describe the different methods of query expansion (10), and give examples in the medical domain of where these methods would be specifically useful. We also list the UMLS and Ohsumed specific variables which will influence query expansion success. A. A Filter Based Approach This section details the query expansion framework implementation. We model the query expansions as operations on a graph, which generate a set of neighbouring concepts

J. WEB INFOR. SYST. VOL. 1, NO. 1, MARCH 2005

Fig. 2.

4

Query expansion evaluation framework.

expansion is a graph based on the entire ontology, a source concept starting point (or, in the case of an entire query, a set of concepts), and an method used to select neighbouring concepts from the ontology. The framework generates the actual expansions by running an increasingly fine grained succession of source concept centric filters against the ontology, A filter is an operation on a graph which either prunes, or scores, nodes. Successive filters are used for the reasons of modularity, and as a way to manage the size of the input space. Some of the expansion methods do not scale well, and cannot use the entire ontology as an input. Because of this, the first level filters often do drastic pruning. This makes logical sense, because successful expansions will probably be from the immediate ontological neighbourhood of a source concept. The output of the final filter is a set of concepts predicted to be good expansions of the source concept. The sets of concepts generated by each expansion for a single query are unioned, and this result is then passed to evaluation modules, which judge retrieval success. The result answers the question, ”how well does this type of expansion expand this set of query terms?” The rationale for such a system to generate and judge expansions is that there are many possibly fruitful query expansions. We want to contrast expansion schemes, judging utility. The query expansions are evaluated on a statistical

basis. We compare retrieval rates for the different methods, looking at which methods, and in what situations, are the best performers. This leads to a set of rules that determine what situations benefit from which methods. B. Measures for Evaluating Expansion Success There are three dimensions of expansion success. The first dimension is precision. This is a simple measure that takes a set of expanded concepts and compare them against the set of concepts that exist in the relevant documents: the number of true positives divided by the number of concepts retrieved. Recall, another standard information retrieval measure, measures how well the expansion covers the entire relevant concept set. It is not a useful measure here, because of the sheer size of the target concept set, being all the concepts from all the relevant documents. Recall will always be a very low number, because, unlike the case in document retrieval, we are not trying to retrieve the entire relevant concept set. We are merely trying to generate fruitful alternative concepts. Our precision measure can be further refined by the observations that 1) most query expansion methods generate an expansion set that includes the source concept itself, and 2) the source concept is highly likely to be present in relevant documents. This

J. WEB INFOR. SYST. VOL. 1, NO. 1, MARCH 2005

more effective than larger expansions. In large expansions, the source concept set is relatively smaller, and has less effect. To combat this problem, we propose an alternative precision measure, called expanded precision. With this, we only count query concepts that have been expanded into. This is a truer picture of query expansion success, because such a measure is more clearly measuring the success of the expansion process alone. Our second dimension of retrieval success is an amendment of the basic precision measure. Instead of merely counting concept hits and misses, we sum a weight calculated by a standard information retrieval inverse document frequency measure (Harman 1992). We call this measure IDF, and it is useful because a commonly occurring concept has lower value than a rare one. The formula for this −Ndoc )+0.5 , where Ntotal is weight is: Wconcept = log(Ntotal Ndoc +0.5 the total number of documents in the collection, and Ndoc is the number of documents containing this concept. The final evaluation dimension is related to the rarity of the target set. When documents are retrieved via word statistical means, they are given a score, which in turn determines their ranking among the entire document population with respect to this query. In regards to this, we can say that word statistical methodology does a good job of retrieving the high scoring relevant documents, while it does a poor job of retrieving the low scoring relevant documents. Given this dichotomy, it is useful to see which query expansions are better at discovering concepts from low ranked relevant documents. We measure this by dividing up the relevant document set into 4 quartiles according to word statistical score, and measuring precision in each quartile. This has the effect of giving us 4 different target sets of varying value to test our expansion strategies against. IV. C ORPUS AND O NTOLOGY D ESCRIPTIVE VARIABLES When contrasting query expansions, there are two categories of variable: descriptive, and manipulable. The descriptive variables are inherent in a query, pre-existing any expansion, while the manipulable variables are those that we can vary to alter query expansion output. With the descriptive variables, we hypothesise that their presence will influence expansion method success. These variables apply at either the document or concept level. Instances include document term frequency, query type, concept semantic type, and concept depth. See Table 1 for a summary. Document term frequency is a standard information retrieval measure corresponding to the frequency of a term across the entire document collection. It is of interest because it has been shown to be useful in standard IR. The rarer a concept, the more value it has when it is found. This variable is also used in the calculation of concept weight, which is one of the alternative success measures. An interesting feature in the area of medical IR is the advent of medical query types (Berrios, Kehler et al. 1998). These are a set of general templates that fit over specific medical queries; for example, ”what is the definition of X?”, and ”Does X cause Y?” Table 2 shows a complete list. They were conceptualised as a way to categorise document content according to the type of question it answers. We hypothesise that query type,

5

TABLE I D ESCRIPTIVE VARIABLES INHERENT IN THE SOURCE CONCEPT OR QUERY WHERE WE HYPOTHESISE AN INFLUENCE ON EXPANSION SUCCESS .

Variable Name

Applies to

Description

Document term frequency

Concept

The frequency of this concept in the document collection

Query type

Query

Type of standard medical query template does the query fits into.

Semantic type

Concept

The UMLS specific semantic type of a variable

Semantic type depth

Concept

The number of edges from this semantic type to the top of the semantic type tree.

Concept depth

Concept

The number of edges from this concept to the top of the metathesaurus tree.

TABLE II M EDICAL Q UERY P ROTOTYPES . Code 1 2 3 4 4.1 5 6 6.1 7 7.1 8 8.1 9 10 11 12 13

Defining Question What is the definition of X? What are the risk factors for X? What is the aetiology of X? Can X cause Y? What is the differential diagnosis of Y? What distinguishes X from Y? How can X be used in the evaluation of Y? (including diagnosis and follow-up) How can Y be evaluated? How can X be used in the treatment of Y? What are the treatments for Y? How can X be used in the prevention of Y? How can Y be prevented? What are the performance characteristics of X in the setting of Y? How does X compare with Y in the setting of Z? Is X contraindicated by Y? What are the sequelae and prognosis of X? What are the physical properties of X?

an X or a Y) will affect expansion strategy success. To this end, the Ohsumed queries have been categorised by a medical domain expert with one or more query types. In a similar vein, we believe that concept semantic type will have an effect on expansion success. Concept semantic type is defined to be the semantic type of a query concept within the UMLS semantic network. Examples include Organic Chemical, or Tissue. There are 134 distinct semantic types from the concepts identified in the Ohsumed documents. As with query type, a UMLS concept can be classified with have multiple semantic types. Many ontological relationships can be thought of has having some type of hierarchy, and this can be used to assign a depth for each concept. There are several ways to calculate depth. One is to use distance from the top or bottom of the ontology. These is the distance from the furthest node in an upward or downwards direction, specifically, continuous travel via a parent or child type relationships. In the case of UMLS, there is another alternative. In UMLS, every concept is classified by semantic type, which has a precalculated depth, defined to be the number of nodes separating the semantic type from the top of the semantic type hierarchy. Many of

J. WEB INFOR. SYST. VOL. 1, NO. 1, MARCH 2005

TABLE III MRREL Code PAR CHD QB AQ RB RN RL RO SIB

BASE RELATIONSHIP TYPES .

Relationship Type Parent Child Qualified by Accept qualifier Broader relationship Narrower relationship Synonym; relationship like Other relationship Sibling relationship

is in the nature of subsumption that the number of children is greater or equal to the number of parents. This makes calculating the maximum downward distance more expensive than calculating the maximum upward distance, as there are more downward paths to follow. Also, the upward distance is a more reliable indicator of depth; it will tend to converge, while the downward path has no such limit. V. M ANIPULABLE VARIABLES I NFLUENCING Q UERY E XPANSION S UCCESS While inherent variables are set in stone, manipulable variables are those factors that we have influence over. In our case, this encompasses all the different ways we can choose query expansion. There are two orthogonal variables that can be manipulated to vary query expansion outcome: relationship source and type of expansion generation techniques. Relationship source refers to the choice of relationship pool from which we obtain our expansions. Expansion generation technique is a larger source of variation. This refers to the way we choose actual expansions, given a source of relationships. A. Relationship Source Relationship source is the source of the relationship information that we will use to traverse the ontological graph. While this may seem a moot point, there is ample complexity here. Within UMLS itself, there are 3 major sources of relationship information: the thesaurus itself, the co-occurrence tables, and the semantic network. The metathesaurus consists of a large number of medical concepts, the relationships between them, and descriptive information for each. The concepts themselves are points in semantic space, which map to a set of synonymic strings. Within the metathesaurus itself, the relationships are stored in the MRREL file, and have several attributes. There are eight major types of relationship, based on broad semantic categories (see Table 3). Additionally, the relationships derive from various source vocabularies. We will test both of these variations to see if they have an influence on expansion success. The UMLS co-occurrence data is another source of relationships between the metathesaurus concepts. We will compare the expansion functionality of the cooccurrence relationships with that provided by the thesaurus relationships. Each concept in the thesaurus belongs to one or more categories in the semantic network. The semantic network then defines a set of relationships between the semantic

6

network was constructed by the overseers of UMLS, while the metathesaurus was agglomerated from multiple data sources), the relationships in the semantic network are more principled. A straightforward way to generate expansions using these relationships would be to use the semantic network alone to generate expansions. While this idea is has value due to the principled nature of the semantic network relationships, there are other factors that make this unfeasible. The problem with this is that, while each UMLS concept is in a semantic class, and each class is connected to one or more semantic network classes, not every concept in a semantic network category has that same identified relationship with every other concept in the related semantic network class. For example, the semantic network has the relationship, Gene or Genome ISA Fully Formed Anatomical Structure, but it would not correct to say that a gene is a kidney. As such, expansion solely using semantic network would be at best a blunt instrument. As an alternative, we will instead use these relationships to constrain metathesaurus relationship expansion. In this part of the experiment, we expand only via relationships that join concepts of same semantic type. This idea has been used in other work based on the UMLS semantic network (Berrios, Kehler et al. 1998). This has the potential to restrain and therefore improve thesaurus based expansion by avoiding the large semantic jumps that could arise from unrestricted expansion. B. Expansion Generation Techniques In comparison to the relationship source, the choice of expansion technique is a larger domain of experimental variation. In this section, we will explain the breadth of method possibilities. There is a wide potential variety of query expansion techniques. General guidelines include: 1) a filter should never prune the source node. We can assume that the user knows what they are searching for, and 2) when a filter assigns a score to a concept, it should assign the highest score to the source concept. Because the expansions are modelled as filters on the ontology, expansions can be used successively, each acting on the subset ontology output by the previous filter. This is important because some of the expansions are of relatively high computational complexity and/or traverse the entire ontological graph, both of which would be prohibitively expensive without some preparatory pruning. In total, we describe ten different expansion techniques, which, to aid comprehension, we classify into six categories. The categories are not entirely orthogonal, but they group the methods according to a primary attribute. We name the categories simple pruning, probability, voting, directional, semantic propagation, and multiple concept. Simple pruning methods are the most simple methods possible, using no direct semantic information. They are the building blocks upon which we build the other methods. Complex probability methods feature the use of node degree in its distance calculation. This is in comparison to the voting class, which use link degree. Directional class uses basic semantic information from the ontology to determine direction. The semantic propagation methods selectively reward certain

J. WEB INFOR. SYST. VOL. 1, NO. 1, MARCH 2005

7

TABLE IV E XPANSION METHOD FEATURE SET SUMMARY. Distance Measure IS THE WAY EACH TECHNIQUE MEASURES THE DISTANCE BETWEEN ONE NODE AND THE NEXT.

Use Semantics COLUMN TELLS WHETHER

AN TECHNIQUE USES ONTOLOGY SPECIFIC SEMANTIC INFORMATION IN ITS PROCESSING .

Category

Method Name

Distance Measure

1. Simple Pruning

Basic depth

2.Complex Probability 3. Voting 4. Directional 5. Semantic Propagation 6. Interconnected Source Query Concept

Summary

Number of edges

Uses Semantics N

Basic probability Subtractive Average probability Average Probability Simple voting Semantic voting Central tendency Pagerank

Probability Probability

N N

Probability Probability Probability Number of edges Pagerank

N N Y Y N

Uses node degree Subtracts a proportion of parent node probability at each succeeding distance Takes into account destination node degree Counts number of edges only Uses semantic weights calculated from basic probability Goes one direction (ancestor/descendent) only Uses a distance measure to assign initial probabilities.

Log Multiple Interconnected Query Concept

Log Multiple Many

N Y/N

Uses a distance measure to assign initial probabilities. Uses all query concepts with above methods

Source

concept methods get value from the additional information available when there are more than one concept in a query. Table 4 summarises the methods, which are then explained in more detail below. C. Simple Pruning Methods The most straightforward expansion methods are simple pruning types; here, each edge is said to have equal value, and nodes are eliminated from the output graph when they exceed some distance measure from the source query node. The simplest distance measure is that of depth, the number of links between concepts. Most previous work in query expansion use this method, for example see (Hersh, Price et al. 2000). Depth based query expansion is simple and easy to conceptualise. Its problem is that it treats all concepts, and relationships, the same. Using a depth method, a connection of length 2 that traverses a concept with many edges (for example, ’drug’) has the exact same semantic distance as a length 2 link that traversed a concept with only two edges. This is clearly simplistic. Also, because node degree is not evenly dispersed throughout the network, highly connected nodes ensure that large portions of the ontology will be covered by a filter of low degree; there is insufficient pruning. Even given these defects, there are situations where depth based pruning performs well. Figure 3 shows a depth pruning expansion using RO relationships. This particular example starts with the concept fibromyalgia, which was extracted from the Ohsumed ”fibromyalgia/fibrositis, diagnosis and treatment”. As this is a medical query of the form ”What is the differential diagnosis of Y?”, we speculate that RO relationships are good at answering this query type, with RO relationships being most effective at distance 1. We attack the problems of the depth method by incorporating the idea of node degree, that is, the number of edges leaving a node. We use the idea that a highly connected node transfers less of its semantic weight to each of its neighbours than a more isolated node. Where depth stops at a certain distance from the source concept,

Classical query expansion

Simple examples of these alternatives can be seen in Figure 4. Diagram A shows a linear distance measurement, with the distance being the minimum number of edges between the current node and the source node. The B diagram shows a simple probabilistic measurement, which takes into account outgoing node degree. The probability of a node is equal to the probability of the upstream node (i.e. the node that is closer to the source node that is linked to this node), divided by the number of links leaving the upstream node. We call the first probability method simple probability, defined to be the value that, if one simply chose at random an outgoing edge from a node, is the probability that one would stumble on this node by accident, given the number of outgoing edges from this node. For example, if there are 4 edges impinging on the source node, the probability for each of these edges is . This probability can then be used as a distance measure in the i−1 ) pruning above. In general terms, p(ni ) = p(n |ni−1 | , where p(n) is the probability of node n, —n— is the number of edges leaving node n, and ni is a node at depth i. Figure 5 hows another example. Here, note the three nodes connected in a linear line, A, C, and F. A is the source node, C is at depth 1, and F is at depth 2. The semantic weights are calculated by p(A)=1, p(C) = 1/—A—, and p(F) = p(C)/—C—. In Figure 5, note that node H has the same probability as node F. This shows a weakness of the simple probability method; it does not subtract any probability at each traversal. A solitary neighbour ni+1 of node ni has the same probability as ni , even though it is one edge more distant. A simple example of where probability based methods provide better expansions than depth based methods is in the case of vancomycin, an antibiotic drug. Because UMLS places vancomycin adjacent to antibiotic, and the latter is connected to 1600 other concepts, a 2 level expansion from vancomycin has a fan-out of a little less than 30,000 nodes. A probability based expansion would prune antibiotic, leaving a more focused set. Figure 6 shows another example, derived from the Ohsumed test query ”chronic inflammatory demyeli-

J. WEB INFOR. SYST. VOL. 1, NO. 1, MARCH 2005

8

Key A B C D E F G H I J K L M N O P Q R S T

Expansion Concept Fibromyalgia Arthritis Chronic Fatigue Syndrome Chronic pain Fatigue Myalgia Myopathy Rheumatic Diseases Soft tissue rheumatism Temporomandibular Joint Dysfunction Syndrome Back Pain Fibromyositis, NOS Fibrosis Leukopenia Myofasciitis Myofibrositis Musculoskeletal problems NOS Myalgia or myositis NOS Normocytic anemia Parascapular Fibromyositis

Fig. 3. First 20 concepts generated by a depth pruning method on RO relationships only, starting with the UMLS concept fibromyalgia. Circles indicate concepts which exist in the relevant Ohsumed documents for the query.

Fig. 4. Diagrams contrasting basic query expansion distance measurements. Diagram A shows a linear distance measurement, diagram B shows simple probabilistic distance measurement.

dashed lines show paths that were excluded due to criteria specific to this expansion method. This diagram demonstrates

how a probability based expansion is able to generate deeper expansions than depth limited methods.

J. WEB INFOR. SYST. VOL. 1, NO. 1, MARCH 2005

9

# A B C D E F G H I J K L M N O P Q R S T

Concept Name Chronic inflammatory polyneuropathy inflammatory polyneuropathy, unspecified polyneuropathy polyneuropathy in disease nos peripheral nervous system diseases nervous system diseases System disorders of the nervous system peripheral neuropathy neuromuscular diseases myopathy neurologic Disease diseases (mesh category) physical disorders disorder by body system or organ function health and disease nervous system disorders, general and nec epilepsy, absence nervous system illness, nos

Fig. 6. Probability based expansion using broader type relations starting at chronic inflamatory polyneuropathy. Circled expansions exist in the relevant document set. Dotted links represent expansion paths excluded due to method’s criteria.

D. Complex Probability Methods A feature of probability methods is that they take into account the outgoing link degree when deciding which path to traverse next. An upgrade on this idea is to incorporate the outgoing link degree of the next node into the calculation. This idea supposes that target nodes of high degree will be less valuable than those of low degree, because they are too easy to get to. We call this method average probability,

average probability between the source node and target node. The formula is p(n0 ) = |n11 | , where n0 is the source node, ) and p(ni ) = p(n|ni−1 for all i>0. A base flaw in the simple i| probability method is mentioned above, in that no probability is subtracted for traversing a node that does not have multiple outgoing links. This can be fixed through a simple modification. We call this modification subtractive probability, and is laid out so that every link consumes semantic weight. The

J. WEB INFOR. SYST. VOL. 1, NO. 1, MARCH 2005

10

TABLE V C ONCEPTS GENERATED STARTING AT H YPOALDOSTERONISM S UBTRACT AVERAGE P ROBABILITY EXPANSION TYPES ARE DEFINED IN

Fig. 5. Simple probability example. Node A is the source node, with nodes B..E at distance 1, F and G at distance 2, and H at distance 3. Note that H has same probability as F, which is a deficiency of the basic probability method.

p(ni−1 ) |ni−1 |

p(ni−1 ) |ni−1 |

where a formula we use is: p(ni ) = − c∗ and b are nodes, p(n) is the semantic weight of node n, and |n| is the number of edges leaving from node n. The damping factor, c, ranges from 0 and 1, and denotes the proportion of the parent node value that is lost at every step away from the source node. In and of itself, this modification is not different from the simple probability method. The reason for this is because it, like the simple probability method, is fair, in that all children of a node have the same probability. This means that a basic subtraction method generates expansions in exactly the same order as the simple method, because they both treat all outgoing edges from a node the same. With subtract, the final probability is lower, due to the incorporation of a damping factor, but the order that the expansions are generated is the same. Subtract probability only shows a difference when incorporated into methods that generate different scores for each edge leaving a node, where the damping subtraction can make a difference in output order. Therefore, we examine the subtract modification via incorporation into the average probability method. Table 5 shows the first 20 expansions of a Subtract Average Probability expansion starting with the concept Hypoaldosteronism. The graph for expansion is similar to Figure 3, because this expansion also only chooses concepts adjacent to the source concept. This is because there are numerous concepts, arising from the lack of other expansion restrictions, and so, expansion does not progress past the first level. This method differs from standard depth limited because it ranks the choices. Because of the simplicity of the probability methods (they are O(n), where n = number of output nodes), they are ideal first candidates in multistage expansions. They can reduce the size of the search space, and make the task of the more sophisticated methods manageable. E. Voting Methods It is the nature of relevance that there is no absolute right and wrong. What is a relevant document for one person’s query might be an irrelevant document for the same query of another person. It is so too with ontologies; there are many possible correct relationships between concepts, depending on information need. For example, the relationship between the concepts Hysteria (UMLS# C0020701) and Dissociative Disorders (UMLS# C0012746) is classified as no less than six separate types: child, parent, relation broader, relation narrower, relation other, and sibling. While this implies that

Connected Concept Acquired Immunodeficiency Syndrome Addison’s Disease Adrenal Gland Neoplasms Adrenal hypertrophy or hyperplasia Aldosterone Cushing Syndrome Hyperaldosteronism Hyperreninemic hypoaldosteronism Acidosis, Renal Tubular Adrenal Gland Diseases Adrenal Gland Hyperfunction Adrenal gland hypofunction Adrenal insufficiency due to adrenal metastasis Adrenoleukodystrophy Congenital hypoplasia of adrenal gland Iatrogenic adrenal insufficiency Post-adrenalectomy adrenal insufficiency Type I Renal Tubular Acidosis Virilism

METHOD .

USING

R ELATIONSHIP

TABLE 3.

Relationship RO

Is Relevant? Y

SIB SIB SIB

Y Y Y

RO SIB SIB CHD

Y Y Y Y

RO PAR;RB SIB PAR SIB

N N N N N

SIB SIB

N N

SIB SIB

N N

RO SIB

N N

between the concepts, the exact nature of that relationship is uncertain. While this variety can be source of frustration, it also can serve as a source of usable meaning. A set of variously labelled links between two concepts merely has a different meaning than a single link, or even a set of links with the same label. We propose to use this variety for query expansion. The idea is to conceptualise each link as a vote for a semantic relationship between terms. This extends the redundancy exploration work done by (Bodenreider 2003). Using this idea, we will evaluate several alternative experiments which use link degree within the basic probability query expansion framework. We call the first such method simple voting. It treats each link as having equal value. If there are 4 links between A and B, and 1 link between A and C, if p(A)=1, then p(B)=4/5, and p(C)=1/5. In simple voting, and for that matter, all the methods considered until now, we have been democratic; all relationships are created equal. With some relationship sources, such as co-occurrence data, this is all that is possible; we have little other relationship information. But most ontological sources are richer, containing much ancillary relationship data. Even within the limited UMLS metathesaurus framework, there are 9 major categories of relationship. In the second voting method, semantic voting, we use this wealth of relationship information to weight our votes according to how much semantic information they transfer. The problem with this is that, while we can surmise that the different relationships correspond to differing semantic distance, we still need quantify this distance. We bootstrap this method using values calculated with a previous methodology, evaluating each relationship type on its own, and seeing how

J. WEB INFOR. SYST. VOL. 1, NO. 1, MARCH 2005

11

TABLE VII

TABLE VI P RECISION OF EACH OF THE MAJOR UMLS RELATIONSHIP TYPES , AS A PROPORTION OF THE TOTAL PRECISION , OVER THE FIRST GENERATED .

30

CONCEPTS

F OR EXAMPLE , AQ IS MORE THAN SIX TIMES MORE LIKELY

TO FIND A RELEVANT EXPANSION THAN DEFINED IN

Relationship Type AQ CHD PAR QB RB RL RN RO SIB

RN. R ELATIONSHIP TYPES ARE

TABLE 3. Precision 22.20% 7.30% 10.50% 3.50% 15.00% 12.20% 6.50% 13.20% 9.50%

Fig. 7. Example of weighted voting calculation. Given simplified weights of relationships as shown (RB=0.4, PAR=0.2, SIB=0.2), then P(B) = P(A) * W(B)/W(total).

first generate expansions, drawing exclusively from each of the 9 UMLS relationship types. The generation method uses the simple probability method with a minimum probability of 0.001, limited to the first 30 expansions. The expansions are then evaluated as to how likely they be to find a concept that exists in a relevant document, and then we use the results discovered to calculate semantic weighting scores for each of the 9 relationship types; they are the normalised average precision that this relationship has when performing expansions on it’s own. The precision ratios generated can be seen in Table 6. Semantic voting makes use of this calculated semantic weighting, making the vote of a link equal to a semantic weight determined by the relationship type. It is this formulation that allows us use UMLS relationship type information. This idea is shown in Figure 7. Here, semantic voting uses simple probability as its base model, but, it could also incorporate ideas from both subtractive and average probability models. Table 7 shows the expansions generated by the voting method based on the concept angiotensin converting enzyme inhibitor (ACEI) from the Ohsumed query looking for an ACEI review article. This query falls under query type 8, How can X be used in the prevention of Y? Again, because of the lack of other restrictions on the expansion, the concepts are all adjacent to the source concept. F. Directional Methods A directional method is another method that uses ontology

F IRST 20

CONCEPTS GENERATED BY A WEIGHTED VOTING METHOD

STARTING AT THE

UMLS

CONCEPT

Angiotensin-Converting Enzyme

Inhibitors. Relationship IS THE TYPE OF RELATIONSHIP BETWEEN ACEI AND THE CONNECTED CONCEPT, AND

Is Relevant? DENOTES WHETHER

THIS CONCEPT EXISTS IN THE RELEVANT DOCUMENT SET. TYPES ARE DEFINED IN

R ELATIONSHIP

TABLE 3.

Connected Concept

Relationship Type

Adrenergic beta-Antagonists Antihypertensive Agents Captopril Cilazapril Enalapril Enalaprilat Lisinopril Perindopril Renin-Angiotensin System Saralasin Benazepril Quinapril Anti-Arrhythmia Agents Fosinopril Protease Inhibitors Ramipril Tissue Inhibitor of Metalloproteinases Moexipril Trandolapril

SIB PAR, RB, RO CHD, RN, RO CHD, RN CHD, RN, RO RN, RO CHD, RN, SIB CHD, RN RO RN, SIB CHD, RN CHD, RN PAR, SIB CHD, RN PAR, RB CHD, RN SIB CHD, RN CHD, RN

Is Relevant? Y Y Y Y Y Y Y Y Y Y Y Y N N N N N N N

of outside knowledge which imposes an structure on the relationship types. It says that there are certain relationships that are point upwards (PAR, RB), and others that point downwards (CHD, RN). These directions are then used by an expansion scheme is called central tendency. The idea behind central tendency is that the depth of a concept in the ontology will influence it’s successful expansion strategy; specifically, we say that concepts will more likely expand towards the centre of the ontology. The method we use is: 1) where h is the depth function, if h(n) > average(h(n)), expand upwards, and 2) if h(n)
J. WEB INFOR. SYST. VOL. 1, NO. 1, MARCH 2005

weights. The former in fact is a conceptualisation equivalent to the basic probability depth measure, explained above. The first semantic propagation method we examine is the pagerank method, due to it’s proven worth, and unique usefulness. But there are several problems with this choice. Pagerank does not have the concept of a source node; the rank of the pages is determined by iterating through all the pages, until the entire system arrives at an equilibrium. In our case, we have one source of semantic value in the entire graph, but the pagerank method generates rank spontaneously. There would be a high probability that our source node would not end up as the highest ranking node in the graph. Pagerank is inappropriate in our case, because in pagerank, hypertext links, in a sense, vote for the page that they point to. In an ontology, the semantic meaning of link flows both ways. Even though pagerank is not a suitable vehicle for calculating semantic neighbourhood, it is a prototype for a class of methods that have such functionality. One exemplar we call log multiple. The log multiple method derives from the observed fact that, often within the UMLS metathesaurus, where there are multiple paths between two nodes, these nodes are more highly related than the mere sum of their probabilities would imply. An example of this motivation is from Ohsumed, consisting of the relationship between the concepts Adrenergic beta-Antagonists and Propanolamine, two classes of organic chemical. The former is a concept from query 19, while the latter is a concept from the documents said to be relevant to that query. While UMLS does not note any direct relationship between these two concepts, there are 160 distance two relationships between them. We surmise there is small semantic distance between these 2 concepts, even though there is no distance=1 link. We want an method that will reward a concept when there are multiple paths between it and a source concept. We do this by calculating the probability of arriving at a node n P to be Pincoming ∗ (1 + log(max(in − out, 0) ∗ w), where

12

TABLE VIII F IRST 20 CONCEPTS GENERATED AT THE

BY A LOG MULTIPLE METHOD STARTING

UMLS CONCEPT Malignant Neoplasms. Relationship IS THE TYPE

OF RELATIONSHIP BETWEEN THE SOURCE CONCEPT AND THE CONNECTED CONCEPT, AND

Is Relevant? DENOTES WHETHER THIS CONCEPT EXISTS IN

THE DOCUMENT SET RELEVANT TO THE

O HSUMED QUERY. R ELATIONSHIP

TYPES ARE DEFINED IN

TABLE 3.

Connected Concept

Relationship

Adenocarcinoma Carcinoma Lymphoma Lymphoma, Non-Hodgkin’s Etiology Carcinoma, Renal Cell Hydatidiform Mole Infection Laryngeal Cancer Leukemia Leukemia, Erythroblastic, Acute Leukemia, Hairy Cell Leukemia, Lymphocytic Liver neoplasms Malignant neoplasm of endometrium Melanoma Myeloid Leukemia Skin Cancer Thyroid Cancer

RO;SIB CHD;RN;RO;SIB SIB SIB RB RO;SIB SIB SIB RN;RO;SIB CHD;SIB SIB RO;SIB SIB CHD;SIB RN;RO;SIB RO;SIB SIB CHD;RN;RO;SIB RO;SIB

Is Relevant? Y Y Y Y Y N N N N N N N N N N N N N N

method, that of initial sample set generation method type, and set size. Table 8 shows the first 20 concepts generated by the log multiple method, starting from the concept malignant neoplasm, in a from an evaluation type medical query. The actual query is ”‘sigmoidoscopy in preventive care, whether the recommended frequency of sigmoidoscopy is effective and sensitive in detecting cancer.”’ The graph is similar to that shown in Figure 4.

1

Pincoming is the probability of each of the incoming links, in and out are the number of incoming and outgoing links respectively, and w is a weighting factor that is the reward given to multiple links. We use log(n) to reduce the reward for larger number of links, while the max(in−out, 0) ensures that rewards flow to only relatively well source-connected node, and not to merely promiscuous ones. In our method, we do not repeatedly traverse all the edges on a graph a la pagerank, but instead, we only propagate the semantic meaning outwards from the source node. More formally, p(N1 ) directly affects P(N2 ) iff there is a link between N1 and N2 , and d(N1 ) <= d(N2 ), where d is the number of links distant from the source node. Note that if the connected nodes are at the same distance, they affect each other. The problem with log multiple is that it is impracticable to traverse the entire graph for each expansion to determine the scores relative to each source node. For this reason, we calculate the log multiple weights on a subset of the entire graph, chosen via one of the simple subset methods above. After log multiple scores are calculated, we then perform our final selection of expansion candidates according to their log multiple scores. This gives rise to a second parameter for the log multiple

H. Interconnected Source Query Concept Methods (ISQC) Many queries are composed of multiple concepts, representing various semantic neighbourhoods, for example ”Heart disease and digoxin” or ”Stroke and prophylaxis”. The previously explained schemes focus on expanding the each individual concept from a query, and then evaluating a union of the multiple result sets against relevant document concept set. While this is how many query expansion strategies work, it ignores the potential synergy that comes from expanding concepts in tandem. We theorise that expansions will be more successful when they are expanded in the semantic direction of the other concepts in a query. The concepts that are common neighbours, albeit possibly not proximal neighbours, of our query concepts, are likely to be related to the query. This applies too to the concepts that are in the neighbourhood of the common neighbours. The former is shown in Figure 8, and the latter Figure 9. A expansion that uses multiple concept information from a single query is as follows. First, find the minimum distance ontological path between the source concepts. This is done by expanding each concept outwards

J. WEB INFOR. SYST. VOL. 1, NO. 1, MARCH 2005

Fig. 8. Multiple query concepts method assumes that the concepts labelled A will be a better expansion for the query than B.

13

Fig. 10. A second multiple query concepts method. Here, query concepts CE (and A, by extension) are theorised to be better expansion candidates than UE.

VI. C ASE S TUDY

Fig. 9. Building on Figure 8, we theorise that concepts labelled A’ will be a better expansions for the query than B’.

reached a maximum distance threshold. Then, we expand in the neighbourhood of that path. The distance threshold for the secondary expansion will range upwards from zero. An alternate variety of multiple query concept expansion uses the idea that query concepts which are interconnected are more likely to generate fruitful expansion concepts than concepts that are not so connected. The rationale for this is that the connected concepts are likely to be closer to the semantic centre of the query than the unconnected concepts. In the example in Figure 10, we theorise that the set of concepts (CE) that are expansions emanating from the set of interconnected query concepts (C) (interconnected via A) are more likely to be good expansions than the expansion set (UE) connected to the set of concepts (U) that are not interconnected with the other query concepts. The variables in this sort of expansion is the maximum distance between the interconnected concepts, and the maximum expansion distance from the set of interconnected concepts. A problem with ISQC type methods is that we do not know how much each individual concept contributes to the overall success, and therefore cannot directly compare to the other methods. A second problem is that, to use ISQC, a query must have more than 1 concept, and there must be a discoverable connection between these concepts. Figure 11 shows an simple example of ISQC based on the Ohsumed test query, ”complications and management of anorexia and bulimia”. Using the first 5 expansions generated from a probability based method, we see the concepts connected to both anorexia (A) and bulimia (F), are more likely relevant than those connected

Section 5 described the various variables we could manipulate to generate different types of query expansion, and also showed specific Ohsumed test document instances where each of these techniques would be useful. In this section, we outline a scenario where we choose between various query expansion techniques, given a single input query. In this example, we examine a query consisting of a single concept, prostate cancer. This query is ambiguous, and could contain a variety of information needs. Firstly, we consider the case where the user searches for basic information, fitting into the medical query type ”‘What is the definition of X?”’. This category of query implies that the user does not have extensive knowledge about the subject. Because learning happens best in a bootstrap fashion, when the user is given information close to what they already know, we want to generate query expansions in the area of the user’s current knowledge. This leads us to the idea of manipulating relationship source as a way to fine tune the query expansion process. There are three gross classes of ontological relationship: more specialised ’child’ links, less specialised ’parent’ links, and other ’sibling’ links. Because our definitional query desires information that is somewhat connected to what the user already knows, this rules out the more specialised child type links, or off-topic sibling links. These classes of information are useful expansions after the basic knowledge has been mastered. This means that the appropriate expansions for a definitional query would be broader type links. In the case of UMLS, this translates into an exclusive selection of Parent (PAR) and Relation Broader (RB) relationships. Additionally, in terms of expansion method, a definitional query such as this demands close relationships. Due to this, we use the semantic voting expansion method, preferentially weighting broader type relationships. This expansion proceeds as follows. First, we find all PAR and/or RB links which connect to prostate cancer. For each of these concepts, we note the number and nature of the links connecting this concept to prostate cancer, summing the total weighted link score for each. From the semantic weighting scale in Table 6, we see that the weight for PAR links is 10.5 points each, and RB links are about 50% higher at 15 points each. In the case of our example, prostate cancer has thirteen broader type neighbours, with three of those neighbours connected by multiple PAR type link. The semantic voting

J. WEB INFOR. SYST. VOL. 1, NO. 1, MARCH 2005

width=17cm

Fig. 11.

14

# A B C D E F G H I J K L

Concept Anorexia Anorexia Nervosa Appetite Disorders Appetite Depressants Mental disorders Bulimia Gastrointestinal Diseases Complication Tracheostomy complication, NOS Open wound of auditory canal with complication Open wound of ear drum with complication Open wound of ossicles with complication

I

nterconnected Source Query Concept example. Nodes A, F and H are concepts from the Ohsumed test query ”complications and management of anorexia and bulimia”; circled nodes exist in the documents relevant to that query.

their connectedness. This fits well with the definitional aim of our query, as the three multiply connected neighbours have high semantic generality in relation to the source concept. The multiply connected concepts are (with link degree in brackets): malignant tumor of male genital organ (3), diagnosis/diseases component (3), and male genital system including breast (2). This means that these concepts would be ranked highest by such an expansion. The entire list of concepts generated by this expansion can be seen in Table 9. In contrast to the definitional query, if the user was seeking information fitting the query type ’What is the differential diagnosis of prostate cancer?’, broader type concepts would not be appropriate. Here, we recognise that the user already has base knowledge, and is now seeking more detail. In this case, we generate concepts connected by the specialised child type relationships, using the log-multiple method. Figure 12 shows the resulting expansion tree. This log-multiple expansion proceeds as follows. All the nodes directly connected to prostate cancer are given an equal proportion of that node’s value. As we deem the source node to have a value of 1, each connected node gets a value of 0.14 from this link. For the nodes that have only this connection, this is all the weighting that they accrue.

TABLE IX C ONCEPTS GENERATED BY A SEMANTIC VOTING EXPANSION METHOD USING PAR AND RB RELATIONSHIPS ONLY STARTING FROM THE CONCEPT Prostate Cancer. T HIS IS USEFUL FOR A QUERY SEEKING BASIC DEFINITIONAL INFORMATION .

Parents of Prostate Cancer adult solid tumor clinical classification of neoplasms of the genitourinary system Diagnosis/diseases component Genitourinary male genital system including breast male genital system, nos male reproductive cancer malignant neoplasm of genitourinary organ nos malignant tumor of male genital organ prostate cancer prostatic diseases prostatic neoplasms urologic cancer

which are pointed to by both the source node and by nodes downstream of the source node: Carcinoma in situ of prostate, and adenocarcinoma of the prostate. In the case of the former,

J. WEB INFOR. SYST. VOL. 1, NO. 1, MARCH 2005

so, it gains no advantage from the log multiple reweighting factor, and merely receives the sum of weights of it’s two parents. Adenocarcinoma, on the other hand, with its three incoming links and no ongoing links, receives full benefit from the log multiple reweighting. As a result, the log-multiple algorithm appropriately rewards the multiply connected concepts: adenocarcinoma and carcinoma in situ have the highest scores (apart from the source concept). Next highest weights belong to the are the remaining source concept direct neighbours. Interestingly, the next highest weights belong to Prostatic Intraepithelial Neoplasia and Carcinoma in situ of prostatic ducts. This is because of their multiply connected parent. This weighting assignment supports our query purpose, being more related to diagnosis than the other distance two nodes. VII. D ISCUSSION In this paper, we offer three sources of expansion relationship information. While this offers broad coverage of UMLS based ontological expansion, it would also be interesting to compare these results with those from corpus based methods. While it has been reported that there is little overlap between successful corpus and thesaurus based expansion candidates, this would be interesting to quantify, shedding light on the strengths of each relationship source. This type of evaluation would well suit our evaluation framework. Because a large part of our work describes expansion function, we now discuss implementation details. A problem common to all expansion methods relates to the magnitude of ontological interconnection. Highly interconnected ontologies present difficulty for any work in this space, and the UMLS is no exception; for example, the maximum number of UMLS MRREL distinct relationships attached to a single concept is 16088. For the semantic propagation techniques, the magnitude of interconnection causes a problem in terms of run time, which we solve by working with a subset of the ontology, derived via one of the other methods. The other methods are O(N), and because they start at the semantic location of the source concepts, have only the problem of generating too many concepts. We address this by setting a maximum expansion set size. This work is a first step in the creation of system implementing finely tuned query expansions. Future work describes the implementation and evaluation of these methods. This framework is designed to fit into a self adjusting system, in that we will implement and test various QE methods, and then use the results to refine and redesign the methods. While this first set is a broad extension based on the current literature, it is merely a beginning; there is more to be realised from a detailed analysis of the QE field. Ontological relationship exploration is a huge area, with many controlling variables. Each of the current methods can be extended and enhanced, in response to the results arising from evaluation. For example, ISQC is an method with high potential because it uses semantic information from more than one source concept. From this there are many possible variations; for example, we could expand choosing concepts minimising distance to all of the source concepts. Even though this type of refinement is possible for each of the other

15

not explore any method in depth, because first we need to look at the initial evaluation results to prove an method’s potential. A further exploration area is ontology semantics. While our work uses some semantics, ontologies support a much broader range. Full categorisation of the range of relationships will need a too look deeper into these semantic features. One of the reasons for this is due to our use of UMLS, which does not provide broad relationship semantics. Even so, UMLS is the premier resource in the medical field, rich in metadata, but the metadata is not so precise that it precludes grouping of results. Notwithstanding this, it would be interesting to examine other ontologies, and compare their query expansion performance at such a fine grained level. The problem with using an alternative ontology, is that it demands recategorisation of both the documents and queries. Comparison would be problematic as well, due to the different relationship types. VIII. C ONCLUSION In this paper, we describe a framework for generating and evaluating candidate query expansion methods, which will be used for corpus exploration resulting from imprecise queries. We detail a framework for evaluating query expansion methods. The core work of this paper sets out five classes of expansion methods. We also describe the system into which these methods fit, consisting of 1) query characteristics which potentially influence expansion success, 2) medical relationship sources and 3) success measures. This fits into an innovative concept centric evaluation structure which dramatically increases system usefulness. The framework has been implemented in the Java 1.4 language using an XML based expansion method specification, with the data is stored in a Oracle 9i database. Implementation details, evaluation, and statistical analysis of the results follows in later work. Query expansion success is related to the concept of topicality, the range of ways that the human mind can find meaning. Our work provides a framework for exploring the general characteristics of this making of meaning, by data mining a document test collection using ontological relationships as our prospecting tool. This range of meaning is large, and because of this, it is useful to systematically identify and describe the factors which can influence it, prior to measuring their influence. R EFERENCES [1] Aronson, A. R. and T. C. Rindflesch (1997). Query expansion using the UMLS(R) Metathesaurus(R). 1997 AMIA Annual Fall Symposium. A Conference of the American Medical Informatics Association, Philadelphia, PA, USA, Hanley & Belfus. [2] Attar, R. and A. S. Fraenkel (1977). Local Feedback in Full-Text Retrieval Systems. Journal of the ACM. 24 (3): 397-417. [3] Berrios, D. C., A. Kehler, et al. (1998). Automated Text Markup for Information Retrieval from an Electronic Textbook of Infectious Disease. AMIA 98 Annual Symposium. [4] Bodenreider, O. (2003). Strength in Numbers: Exploring Redundancy in Hierarchical Relations Across Biomedical Terminologies. Proc. AMIA Symposium. [5] Brin, S. and L. Page (1998). The anatomy of a large-scale hypertextual web search engine. Computer Networks & Isdn Systems 30(1-7): 107117. [6] Crouch, C. J. and B. Yang (1992). Experiments in automatic statistical thesaurus construction. Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval,

J. WEB INFOR. SYST. VOL. 1, NO. 1, MARCH 2005

16

Fig. 12. Log multiple expansion for child concepts of the term Prostate cancer. The numbers on the links are the weights transferred by this link, while the numbers in brackets are the weights associated with this node.

[7] Green, R. (2001). Relationships In The Organization Of Knowledge: An Overview. Relationships In The Organization Of Knowledge: An Overview. C. A. G. Bean, Rebecca. New York, Kluwer Academic Publishers. 2: 3-18. [8] Harman, D. K. (1992). Ranking Algorithms. Ranking Algorithms. W. B. Frakes and R. Baeza-Yates, Pearson Education: 363-392. [9] Harman, D. K. (1992). Relevance feedback and other query modification techniques. Relevance feedback and other query modification techniques. W. B. Frakes and R. Baeza-Yates, Pearson Education: 241-263. [10] Hersh, W., S. Price, et al. (2000). Assessing thesaurus-based query expansion using the UMLS metathesaurus. Journal of Americian Medical Informatics Association Suppl. S 2000: 344-348. [11] Hersh, W. R., D. H. Hickam, et al. (1994). A performance and failure analysis of saphire with a medline test collection. Journal of the American Medical Informatics Association 1(1): 51-60. [12] Jones, S., M. Gatford, et al. (1995). Interactive thesaurus navigation: intelligence rules ok? Journal of the Americian Society of Information Science 46(1): 53-59. [13] Lenat, D. B. (1995). Cyc: A Large-Scale Investment in Knowledge Infrastructure. Communications of the ACM 38(11): 33-38. [14] McCray, A. T., A. R. Aronson, et al. (1993). UMLS Knowledge for Biomedical Language Processing. Bulletin of the Medical Library Association 81(2): 184-194. [15] Nadkarni, P., R. Chen, et al. (2001). UMLS Concept Indexing for

Informatics Association 8: 80-91. [16] Qiu, Y. and H.-P. Frei (1993). Concept based query expansion. Proceedings of the 16th annual international ACM SIGIR conference on Research and development in information retrieval, ACM Press. [17] Salton, G. and C. Buckley (1990). Improving retrieval performance by relevance feedback. Journal of the American Society for Information Science 41(4): 288-97. [18] Sherman, C. (2002). Google’s Gaggle of New Goodies, Jupitermedia Corporation. 2004. [19] Srinivasan, P. (1996). Query expansion and MEDLINE. Information Processing & Management 32(4): 431-43. [20] Wordnet (2004). Web Wordnet 2.0. 2004. [21] Xu, J. and W. B. Croft (1996). Query expansion using local and global document analysis. Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval, Zurich, Switzerland, ACM Press.

Dennis Wollersheim Dennis Wollersheim is an associate lecturer at La Trobe University in Melbourne, Australia, and is completing a PhD in medical informatics. Prior to that, he worked as a computer systems manager for

J. WEB INFOR. SYST. VOL. 1, NO. 1, MARCH 2005

Wenny J Rahayu Wenny Rahayu is currently an associate professor at the department of computer science and computer engineering La Trobe University Australia. Her major research is in the area of databases, semantic web and ontology. She has published more than 70 papers in international journals and conferences. She has edited three books which forms a series in web applications, including web databases, web information systems and web semantics. She has recently published a book in the area of Object-Oriented Oracle.

17

Ontology Based Query Expansion Framework for Use ...

them in a ontologically defined semantic space. Expansions origi- nate from .... relationships, and in the case of ontology based systems, very much so, only ..... relationships are stored in the MRREL file, and have several attributes. There are ...

Download PDF

191KB Sizes 4 Downloads 307 Views

Report

Ontology Based Query Expansion Framework for Use ...

Recommend Documents