Using Papers Citations for Selecting the Best Genomic ...

Viewer
Transcript

Using Papers Citations for Selecting the Best Genomic Databases Daniel Lichtnow1,2, Ronnie Alves1, José Palazzo Moreira de Oliveira1 1

Instituto de Informática Universidade Federal do Rio Grande do Sul Porto Alegre, RS, Brazil 2 Centro Politécnico Universidade Católica de Pelotas Pelotas, RS, Brazil {dlichtnow, ralves, palazzo}@inf.ufrgs.br

Abstract— Selecting the right data is an essential activity in Genomic-related Information Systems. This work aims to analyze if it is possible to select the best genomic databases from a catalog using information about papers citations related to these genomic databases. The motivation for using information about citations has to do with the fact that it is not easy to obtain proper metadata with respect to these databases. Thus, in this work, information related to papers citations is used for measuring three distinct data quality dimensions: believability, timeliness, and relevancy. Believability is evaluated through the inspection of the number of citations. The variation of the number of citations over time is useful for determining the recency of a database and it is related to the timeliness dimension. Regarding to relevancy, the keywords of papers are useful to indicate the main context of application of these databases. Database selection; database catalogs; quality indicators

I.

INTRODUCTION

The motivation of the present work is mainly related to the difficulties in the selection of proper genomic database. The aim is to measure the overall quality of genomic databases stored in a database catalog prior its selection for further analysis. Database catalogs provide assistance for users, indicating a set of possible useful databases. However, in general, these catalogs do not contain enough detailed information about the databases, and especially about its quality; neither provide a quality-based ranking of databases for users [1]. Besides, it is very difficult to maintain these catalogs updated and the extraction of metadata is not a trivial task. Although the main focus of this work is related to the database catalogs, the approach can also be used to guide database integration process. Database integration systems usually consider the overall quality of databases in the integration process. However, how to measure the overall quality is an open question. In the majority of the cases, users of the database integration systems have to explicitly indicate their preferences by specific databases. In this sense, considering database integration systems, where the basic tasks are (i) selecting

Ana Levin3, Oscar Pastor3, Ignacio Medina Castello4, Joaquin Dopazo4 3

Centro de Investigación en Métodos de Producción de Software (PROS) Universitat Politècnica de València València, Spain {alevin,opastor}@pros.upv.es 4 Bioinformatics Dept./Functional Genomics Node, INB Centro de Investigación Príncipe Felipe (CIPF) València, Spain {imedina, jdopazo}@cipf.es the databases to search, (ii) searching the selected databases, and (iii) merging the results; the approach presented here can help in the first task. This work is focused on genomic databases. The term genomic databases is used for making proper reference to databases storing data related to genetic diseases, polymorphisms, proteins, nucleotide sequences, etc. In general, these genomic databases use to have a paper (sometimes several papers) describing their characteristics and content. Indeed, some information about papers, like the number of citations, it is being used as an quality indicator for evaluating papers, journals and even researchers (e.g. hindex) [2]. Thus, in the proposed strategy, we analyze the possibility of using the information about paper citations describing genomic databases in the database selection process. Basically, we analyze how one could effectively explore the information about citations in order to establish an overall quality of a particular database. In a first experiment, we analyzed the use of the number of paper citations as quality indicator. In this sense, we considered the number of paper citations as a believability indicator and compared with others quality indicators. Given the good results obtained in this first experiment, and the difficulties for extracting others database metadata, we decide to use others kinds of information about paper citation for measuring others quality dimensions. Thus, in the present work, the emphasis is on how to use the information about citations for evaluating aspects related to timeliness and relevancy of genomic databases. Timelines and principally relevancy are quality dimensions related to the context of use. The context of use is almost ignored in the first experiment. The paper is organized as follows. Section II presents related work. Section III describes the proposed approach and architecture of a system based on it. Section IV discusses the use of papers citations to measure believability aspects. Section V discusses how the analysis of the variation of the number of citations over time can be used for determining the recency of a database. Section VI discusses the use of the

keywords of papers for measuring relevancy aspects. Finally, section VII presents final remarks. II.

RELATED WORK

There are several catalogs of biological resources on the Web. These catalogs are important for users since mediation systems take care of a limited number of databases in function of complex mapping between local and mediator schema. As a consequence, sometimes the wrapped databases do not fit users’ needs and users need to access others [3]. One example of catalog is the Nucleic Acids Research – NAR list of databases. This catalog is published every year since 1996, the version of 2011 includes 1,330 databases [4]. It is important to note that the size of this catalog has been increasing constantly (the version of 2010 had 1,230 databases [5]). Another similar catalog is the BioMed Central Databases1. In general, these catalogs do not contain much information about the databases. There are some initiatives related to the creation of richer database catalogs. One example is the BioRegistry, a database catalog that is generated using NAR database catalog with some complementary metadata. In the BioRegistry, part of metadata is related to quality aspects: entry revision (manual or automatic), update and release frequency, existence of documentation and cross-references to other databases. Besides, in the BioRegistry the databases are also characterized by MeSH terms (see Section VI). These terms are extracted from the papers describing the databases. A similar initiative is the CASIMIR Database Description Framework, where a set of metadata is defined to describe databases [6]. In CASIMIR initiative, 3 levels of conformance are defined for each quality dimension considered. Thus, with respect to the timeliness quality dimension (currency in the CASIMIR), a database can be:  Level 1. Closed legacy database or last update more than a year ago;  Level 2. Updates or versions more than once a year;  Level 3. Updates or versions more than once a month. In CASIMIR the data is captured manually, for each quality dimension it is necessary to assign a level of conformance. The inexistence of genomic database metadata is an issue for the users. Recently, a group of expert starts an initiative, called BioDBcore, to create a community-defined, uniform, generic description of the core metadata of biological databases. The proposal aims “gather the necessary information to provide a general overview of the database landscape, and compare and contrast the various resources” [1]. In this sense, the researches responsible for maintaining the NAR database catalog are asking researchers responsible for the databases referenced in the NAR database catalog for

1

http://databases.biomedcentral.com/home

providing supplementary data about their databases, taking into account the core attributes of BioDBcore [1]. Summarizing, database catalogs do not provide rankings based on quality aspects and the extraction of metadata for elaborating quality indicators is a problem as well. Beyond the database catalogs, the database selection is also an issue for database integration systems where database quality aspects are frequently ignored [7]. In this sense, in the database integration systems where quality aspects are taken into account, quality is measured solely based on users’ explicit ratings or using a small piece of the database content as an example [8] [9]. III.

DESCRIPTION OF THE APPROACH

In the present work, the aim is to identify the best databases from a set of previously identified databases stored in a catalog. This database selection process involves the calculation of the overall quality of a database. Quality must be measured by take into account a set of quality dimensions or factors. However, there is no consensus about which quality dimensions or factors must be considered to measure or represent the distinct data quality aspects [10]. Considering the lack of consensus, we adopted the definitions of [11] for the quality dimensions. We focus on three distinct quality dimensions [11]:  Believability/Reputation. “The extent to which data is accepted or regarded as true”;  Timeliness. “The extent to which the age of data is appropriate for the task”;  Relevancy. “The extent to which data are helpful for the task”. Some quality dimensions are considered context dependent i.e. the “fitness to use” determines the degree of quality in relation to the quality dimension. Timeliness and relevancy are examples of quality dimensions categorized as contextual ones, while believability is categorized as an intrinsic quality dimension. The intrinsic quality dimensions independent of the user’s context, emphasizes that data have quality in their own right [11]. The focus of the present work is to measure the overall quality of genomic databases. The content, aims, shapes and usages of these databases are very distinct [12]. Some of these databases can contain data about polymorphisms, other ones data about genetic diseases, proteins, etc. Considering the genomic databases, in the present work, the context is represented by database categories (NAR categories) [4] and generic users’ tasks (e.g. population genetic study). A task represents users’ needs and requires specific data (e.g. information about geographical origin of a data for population genetic study). Each quality dimension of a genomic database can be quantified by one or more metrics and indicators. However, the extraction process of these quality indicators remains a challenge [6]. The difficulties with metadata extraction and the results of some previous experiments [13], serve as motivation for analyzing, in a more effective way, the use of information

about papers citations related to genomic databases as quality indicators. In this sense, many genomic databases are described in scientific papers and researchers cite these papers as an indication that such database has been used in their experiments and publications. We consider that information related to papers describing the genomic databases and also information about papers citing these papers, can be used for guiding the selection process of the best databases from a catalog. More specifically, we investigate the possibility of measuring believability, timeliness, and relevancy aspects using information about papers citations. A. Illustrating the Approach The Fig. 1 shows a simplified architecture of a proposed system related to our approach.

Figure 1. System Architecture

In this system, there are modules responsible for a specific type of metadata extraction. These metadata are extracted from distinct Web sources. There is a module responsible for classifying new databases. These databases can be new unclassified databases included by the administrator’s users. Besides, the database classifier module can be integrated to a crawler designed for discovering new genomic databases on the Web [14]. The system has a repository of genomic databases where a set of genomic database metadata is stored (the catalog). Part of these metadata consists of quality indicators used to calculate the overall quality of each database. In the system architecture is also defined an administrator module used to complement or update the catalog.

Ordinary users use the ranking module to obtain the ranking of databases. Basically, the user can indicate a task, a database category or, inform a set of terms related to his task. Next, the user receives a set of ordered genomic databases according to their overall quality. In the following section, we present some techniques that can be useful to generate the database ranking using information about papers citations. We also discuss how the use of information about papers citations can be useful for the database classifier module (we evaluate the relevancy of a database for a specific category). IV.

MEASURING BELIEVABILITY USING PAPER CITATION INFORMATION

Following [8], we consider believability and reputation “The extent to which data are trusted or highly regarded in terms of their source or content” [11] as the same quality dimensions. Believability is specially related to authorship [15], is based on the opinion of others [16] and, can be measured by user’s ratings [8]. In the Web, it is quite difficult to identify clearly either the author of content or to obtain explicit rates about a particular source. Thus, there are some quality indicators or metrics that can be considered to measure the believability or reputation of a Web source. Examples of believability indicators for Web pages and Web sites are PageRank [17] and the number of Web links pointing to Web page (inlinks) [18]. Some of these quality indicators can be considered for evaluating the quality of Web genomic databases. Besides, as databases have papers associated, the number of citations of the papers related to genomic databases can be considered a believability indicator. In this sense, it is possible to infer how many times a database is used, thus the more reputation a database has, the more the number of citations of the database’s paper. Believability is considered a context independent quality dimension [11]. However, we believe that some genomic databases related to some categories tend to be more cited than others. In this sense, with respect to the use of papers citations as quality indicators, Hirsch observes that h indices (a metric based on papers citations) tend to be higher in some specifics areas than others [2]. It is possible to evaluate if the number of paper citations is a useful quality indicator by comparing the number of citations with others quality indicators. In this sense, in a first experiment, we focused on three distinct quality dimensions believability, timeliness, and accessibility. For each quality dimension we consider a set of quality indicators. Thus the number of citations of related papers, inlinks (the number of web links pointing to database’s homepage) and, the Pagerank are believability indicators. The creation date and the last update of database are timelines indicators. The existence of web services, the possibility to download the data and the use of exportation formats are accessibility indicators [13]. In this experiment, the number of inlinks was obtained using the Google API. The number of citations was

extracted from Google Scholar using Web Harvest 2 . The creation date and the last update of a database were manually extracted from the database’s homepage. We identified the presence of accessibility indicators exploring the database facilities. After, we calculated the age and the recency of a database, the average of inlinks and papers citations by year and how many accessibility indicators a database have. Finally, we aggregate the metric values to generate a single value to each quality dimension. The result is a vector that represents the quality degree of a database, where each element contains a numeric value, between 0 and 1, assigned to each quality dimension. For the experiments, we used databases classified as General Polymorphism Databases 3 in [4]. We generated a unified ranking with 18 of these databases based on average of 3 experts’ ratings. We used SAW - Simple Additive Weighting to generate the ranking [8]. Besides, in that work, we also evaluated some indicators individually. We compared experts’ ranking with ranking generated using Spearman correlation [19]. The range of the Spearman correlation is [-1,1] (1 indicates perfect correlation). All correlations (Table I) are significant for p<0.05. Table I shows the best results. Details about this experiment are described in [13]. TABLE I.

RANKING EVALUATION

Dimensions/Indicators All quality dimensions Inlinks (average by year) Citations (average by year)

Spearman 0.6511 0.6182 0.7978

This experiment indicates that the number of paper citations is useful for measuring aspects related to believability. Besides, the experiment also shows that it is possible to measure the overall quality of a genomic database using only the number of citations related to it. This is a very interesting observation since it is quite difficult to extract some of quality indicator from Web sites related to genomic databases. Many times even simply quality indicators like the last update of a database are not available in the genomic databases Web sites. In this sense, we emphasized again that in the first experiment with exception of inlinks, PageRank and the number of citations, others quality indicators were manually extracted. In the first experiment, we extracted the number of citations from Google Scholar. In another experiment, we decide to extract the number of paper citations from a more specialized source: the PubMed4. The PubMed is a free database accessing a database of papers (especially MEDLINE) on life sciences and biomedical topics. We decide to analyze the data retrieved from PubMed because it is easier to extract data about citation from 2

http://web-harvest.sourceforge.net/ http://www.oxfordjournals.org/nar/database/subcat/8/32 4 http://www.ncbi.nlm.nih.gov/pubmed/ 3

PubMed than from Google Scholar (PubMed has a set of utilities that become easier to do this process). Besides, the PubMed has a strict relationship with genomic databases, many genomic databases provides links to references in PubMed. The problem is that the number of citations presents in the PubMed is smaller than the number of citations presents in the Google Scholar. Thus, we repeated the previous experiment using PubMed. Table II shows the results of this new experiment. In this evaluation, only citations of papers published between 2000 and 2010 (11 years) were considered. TABLE II.

NEW RANKING EVALUATION

Dimensions/Indicators Citations (total) Citations (average by year) Spearman

Google Scholar 3,836 348.73 0.7874

PubMed 2,536 230.55 0.6553

The new results ensure that the number of paper citations is a good quality indicator - even when using another data source (PubMed) with a small number of citations registered. However, there are some problems and limitations related to the use of paper citations as quality indicators:  The number of citations only indicates the overall quality of a database. The only context aspect considered in the experiments is the database category;  Old genomic databases (some outdated) can have a bigger number of citations than new ones;  New genomic databases can have a smaller number of citations, despite their quality, than old ones. Take into account these facts and given that it is not easy to obtain the metadata about databases, we decided to explore the use of information about papers citations for measuring timeliness and relevancy. V.

MEASURING TIMELINESS USING PAPER CITATION INFORMATION

In the Section III is presented a definition about timeliness that assigns the timeliness aspects to fitness to use. In general, for any task, the more updated is the data, the better is the data, but some aspects must be considered for measuring the timeliness. Timeliness aspects of a data unit depending on two factors: currency and volatility. Currency refers to the age of data. Volatility refers to how long the item remains valid [20]. In some contexts, the age of data does not matter (e.g. historical data like the date of arrival of Columbus in America). In the context of genomic databases a researcher in general wants the more updated data. One exception can be related to molecular biologist who wants to reproduce an experiment. For this researcher, a database that keeps historical versions of data has higher degree of timeliness [21]. In the present work, we aim to evaluate the overall quality of a genomic database. Thus the most appropriate

definition for timeliness could be “timeliness is the average age of the data in a source” [22]. However, it is quite difficult to calculate the average of the data in a source because, it is necessary to access the database and this process involves a set of integration database problems (schema mapping, tuple similarity, etc) [23]. Thus, we opted to calculate the timeliness degree considering the creation and the last update of a set of databases [13]. The problem is that this metadata is not always available in the Web pages of genomic databases. This fact is easy to check by accessing some of the genomic database Web sites. Take into account such difficulties to the identification of the date of the last update of a genomic database; we decide to explore information about papers citations for determining timelines aspects. In this sense, we verified that some of the most updated genomic databases (e.g. OMIM5 and dbSNP6) tend to have newer paper citations. Thus, for doing the temporal paper citation analysis, we selected 20 databases classified as Protein Structure 7 from NAR catalog that we can identify easier the last date of actualization. We also use this set of databases because the number of databases is bigger than the number of databases classified as General Polymorphism databases used in the previous experiments. One difficulty for doing this experiment is that an updated catalog contains more updated databases (the NAR catalog is updated all year). The most adequate could be use older databases. In the experiment, we consider databases outdated when it was impossible to access the database (the Web link for database is broken) or when the database Web site contains information about the last update and this information indicates that the last actualization of the database was made before 2010. In another way, if the database Web site contains information about the last update, and this information indicates that the last actualization of the database was made after 2009, we consider the database updated. In order to measure the degree of actualization of a database using paper citations, we extracted the year of publication of papers that cite the database paper using PubMed utilities. Next, for each database, we verified the number of citations by year. It is important to note that the main goal is to analyze aspects regarding to timeliness not to believability. Thus, we did not analyze the total number of citations, but the distribution of the number of citations along the years. It is important to mention that some old outdated databases can have a great number of citations. As an illustration of such behavior, data about 6 genomic databases are shown in Table III having the number of citations by year between 2000 and 2010. 5

http://www.ncbi.nlm.nih.gov/omim http://www.ncbi.nlm.nih.gov/projects/SNP/ 7 http://www.oxfordjournals.org/nar/database/subcat/4/14 6

Besides, Fig. 2 shows the variation of citations by year for the databases of Table III.

Figure 2. Citations by Year - Normalized.

Regarding to Fig. 2, since our aim is to analyze timeliness aspects, all values related to the number of papers citations were normalized to 1, to allow a better view of the variations of citations. This normalization process consists in divide each value assign to number of citations by the highest value associate to the number of citations of each database. We also analyze the standard deviations of the number of citations. We calculated the standard deviations using the normalized valued of paper citations and considering the 20 databases classified as Protein structure databases (7 out of 20 databases are outdated). Next, we verified if the most updated databases keep a more constant number of paper citations. We verified that the 4 databases with the highest standard deviation are outdated. These standard deviation values are high because the outdated databases have either a decreasing number of citations or only eventual citations (see Table III). However, we confirmed that many updated databases have values of standard deviation bigger than outdated databases. Indeed, this occurs because the new databases would have a small number of paper citations in the first years of their existence (e.g. database C – Table III). The results of the experiments are preliminary. However, the experiments demonstrated that distribution of paper citations can be used as timeliness indicators. In fact, databases with an increasing number of citations are usually updated (e.g. database C – Table III).

TABLE III. Database A B C D E F

VI.

2000

6

2001

9

2002

14

2003

21

2004

35

MEASURING RELEVANCY USING PAPER CITATION INFORMATION

Maybe the most prominent contextual quality dimension is the relevancy. A user of genomic databases looks for databases that contain relevant data for his work. In this sense, sometimes the most relevant database is not the most updated database and can have a smaller degree of believability when compared with others databases. Measuring the relevancy it is also important for reducing problems with information overflow. One example is a search engine, where the documents are sorted according to relevancy for a user query, using information retrieval methods [24] [25] and Web link analysis [17]. Information retrieval methods work with textual documents, considering a document relevant if query terms appear frequently in the document. Besides the presence of the terms in a document, some aspects related to the document structure can be also considered (e.g. a term that appears in the title of a paper, for example, has more relevancy, in the majority of the cases, than terms that appear in the body of the paper). Although we do not scrutinize the database content, it is important to have a better understating of the content of a genomic database for measuring its overall quality, especially for measuring relevancy aspects. Thus, considering information retrieval methods, we explored the possibility of evaluating the relevancy of a genomic database for a task using a specific metadata: terms of papers related to these genomics databases. More specifically, we used the set of keywords used for characterize these papers. These keywords consist of MeSH terms. The MeSH - Medical Subject Heading8 is a controlled vocabulary thesaurus used for indexing the papers of PubMed. The motivation for using this type information is also related to the fact that we observed, for example, that the MeSH terms Gene Frequency and Genetics, Population are among the most used terms in papers that cite papers related to ALFRED9 database - a database with data about the allele frequency in human population. In the same way, the MeSH term Genetic Diseases, Inborn is among the most used terms in papers that cite papers related to OMIM - a well known database about genetic diseases. Thus, in the further experiments, we used the MeSH terms extracted from papers that describe it and from papers

8 9

http://www.ncbi.nlm.nih.gov/mesh http://alfred.med.yale.edu/

CITATIONS BY YEAR 2005

1

2006 5 2 10

2007 0 0 14

2008 0 1 23

31

5 43

0 42

0 57

2009 0 0 24 1 0 56

2010 1 0 25 3 1 42

Updated NO NO YES YES NO YES

that cite these databases. In these experiments, each database is characterized by a set of terms extracted from papers. Regarding to database catalogs, it is possible to make two observations:  The number of online databases has increased constantly. Thus, the catalogs do not contain all databases and it is necessary to assign new databases to the catalog taxonomy;  The catalog taxonomy can be insufficient for characterize databases. The first observation is relevant especially to the catalog’s managers, who have to decide how to classify a new database. Besides, there are some initiatives to create online database catalogs automatically where it is necessary to assign automatically databases to a specific category [14]. These aspects are related to the database classifier defined in the system architecture (Section III). Regarding to the second observation, some experts who help us in the first experiment, indicated that some subcategories could be created for some categories of NAR taxonomy. Thus, we did some experimental analysis to evaluate if these MeSH terms can be used to:  Search for the most relevant genomic databases in a catalog using terms;  Classify new databases into a previously defined taxonomy. In all experiments, we identified the papers that described each database reading the documentation about the databases on the databases Web sites. Next, we accessed these documents in the PubMed to extract the PMID (a unique number assigned to each PubMed record of each paper). We used utilities of PubMed to identify the papers that cite the database paper. Finally, the MeSH terms were extracted also using PubMed utilities and some programs specially designed for it. A. Using a Information Retrieval System for Measuring the Relevancy of Databases for a Task In this experiment, we evaluate the possibility of retrieving the most relevant databases of a specific category using techniques of information retrieval systems. The motivation for using these techniques is related to the fact that two of the experts, who evaluated the set of databases classified as General Polymorphism databases in the first experiment (Section IV), did some observations indicating that some of these databases should be classified in another way. Basically, these experts suggest that

databases classified as General Polymorphism databases could be split into three distinct classes:  Medical Clinical databases. These databases contain data about diseases related to mutations (e.g. OMIM5 database);  Functional databases. These databases contain data about gene and protein functions and interactions related to variations (e.g. F-SNP10 database);  Population Genetic databases. These databases contain data about frequency and interaction of alleles and genes in populations (e.g. ALFRED9 database). These subcategories represent three distinct tasks, that a user can do using genomic databases classified as General Polymorphism databases in the NAR catalog. An alternative to create these subcategories could be use information retrieval techniques to measure the relevancy of each database. Thus, by employing information retrieval techniques, a user of a database catalog could be aware of a database category, providing either a term or set of terms related to the database of interest. In this case, the search result is a set of databases ordered by relevancy, according to the information retrieval techniques. An information retrieval system is basically a system used for finding textual documents in a set that satisfies an information need [26]. This information need is expressed by a term or a set of terms. Thus, in this experiment we used a set of textual documents representing genomic databases classified as General Polymorphism databases in the NAR catalog. We created 18 textual documents, one for each database considered in the experiment related to believability (Section 4). Next, using PubMed’s utilities, we retrieved the MeSH terms related to database papers i.e. the MeSH terms of papers that cite papers describing these databases. The reference to a database in a paper indicates the use of these databases. Thus, it is possible to think that the MeSH terms related to these papers could be a potential indication of the relevancy of a database for a specific task or information need. A total of 1,865 distinct MeSH terms were extracted. These MeSH terms were stored in the textual documents created for representing the 18 databases. We indexed these documents using a search engine called Zettair11. Zettair is a search engine that implements a set of information retrieval models like cosine [24] and Okapi BM25. We used the Okapi BM25 ranking function because this function is one of the most important ranking functions of information retrieval, considered as baseline for evaluating new ranking functions [27]. The aim is to use the search engine for retrieving the databases (represented by the documents) more adequately for a specific task. 10 11

http://compbio.cs.queensu.ca/F-SNP/ http://www.seg.rmit.edu.au/zettair/

In this sense, considering population genetic study as a task, we used the term population (this term is related to a population genetic task) as a query argument for searching the collection of documents indexed in the search engine. In the same way, we used the term disease for searching the collection of documents and retrieve databases (represented by documents) relevant for disease-related studies. In this experiment, we evaluated the results using precision, an evaluation measure that is commonly used in information retrieval systems. In the context of information retrieval systems the precision measure the fraction of retrieved documents being relevant for a given query [26]. Thus, we evaluated the precision up to 5th position (based on 2 experts’ evaluation). Table IV shows the principal results of this preliminary experiment. The results of disease-related studies were worse than population genetic studies. However, in the former, it is worthy to mention that the 1th database retrieved was OMIM, which is one of the most used genetic disease databases. TABLE IV. Tasks Disease-related studies Population genetic studies

SEARCH RESULTS Term Disease Population

Precision at 5th 0.6 0.8

B. Using Papers’ Keywords for Classifying Databases The number of genomic databases has been increasing constantly and the database catalogs do not keep a perfect track of all available databases. New databases can be finding out by users using typical search engines like Google or Yahoo (depending on when they were delivered, chances are that they were not cataloged yet). In a near future, some focused crawlers could be used for locating these databases [14]. Take into account these facts; sometimes it will be necessary to classify these new databases in a database taxonomy like the NAR taxonomy. Besides, focused database crawlers need a set of terms related to the dominium of the databases in order to identify the databases related to a specific dominium. In [14], for example, is defined a module, called page classifier, predicting whether a given Web page (a database Web form) belongs to a specific topic or not. To do so, the page classifier use a set of terms related to a particular domain. A similar module (database classifier) is presented in the architecture of the proposed system (Section III). One possibility for the identification of the relevant terms required for the classification process could be the utilization of the MeSH terms used for indexing papers related to these databases. Thus, we describe one experiment where the aim is the automatically classification of databases in the NAR categories of databases. After, we describe another experiment where the aim is to identify the MeSH terms that better characterize a specific category of databases. 1) Classifing databases automatically. It was tested the possibility of using MeSH terms of papers for the classification of the databases in the NAR categories. Thus,

we extracted the terms of papers citations from two distinct NAR categories: General Polymorphism databases and Protein Structure databases. In this experiment, we used WEKA - The Waikato Environment for Knowledge Analysis, a collection of machine learning algorithms and data preprocessing tools [28]. Firstly, we extracted the MeSH terms from PubMed papers using PubMed utilities. We extracted MeSH terms from papers describing databases and from papers that citing these papers. We compared if the best result is reached using terms extracted from database papers or using terms from papers that cite these papers. All MeSh terms were stored in a text file, where for each database there is a set of MeSH terms organized according to the WEKA format (the string attribute). Next, we applied a filter of WEKA (StringToWordVector 12 ) to convert the string attribute into a set of attributes that represent word occurrence. After, we used classifiers of WEKA (e.g. J48 and DMNBtext) to perform the experiments. We used 10fold cross validation method (each dataset was break in 10 parts, 9 parts were used as training test and 1 was used as a test) [29]. Table V shows the best results with papers that describe databases (Kappa = 0.9663). Table VI shows the best results with paper that cite papers that describe databases (Kappa = 0.8961). The best result was obtained using DMNBtext classifier. TABLE V.

CLASSIFICATION RESULTS – DATABASE PAPERS Results

Database Category

Correctly classified

Incorrectly classified

Total

Polymorphisms

17

94%

1

6%

18

100%

Protein

97

100%

0

0%

97

100%

Total

114

99.13%

1

0.87%

115

100%

TABLE VI.

CLASSIFICATION RESULTS – PAPERS CITATIONS

Database Category

Results Correctly classified

Incorrectly classified

Total

Polymorphisms

16

89%

2

11%

18

100%

Protein

81

98.8%

1

1.2%

82

100%

Total

97

97%

3

3%

100

100%

We repeated this experiment using only the 18 Polymorphisms databases and considering the sub-categories Medical Clinical, Functional and Population Genetic databases. Table VII (Kappa = 0.3333) and Table VIII (Kappa = 0.3333) present the results. In this experiment, the best result was obtained using J48 classifier.

12 http://weka.sourceforge.net/doc/weka/filters/unsupervised/attribute/String

ToWordVector.html

TABLE VII.

CLASSIFICATION RESULTS – DATABASE PAPERS Results

Database Category

Correctly classified

Incorrectly classified

Total

Population

4

66.67%

2

33.33%

6

100%

Medical Clinic

2

33.33%

4

66.67%

6

100%

Functional

4

66.67%

2

33.33%

6

100%

Total

10

55.56%

8

44.44%

18

100%

TABLE VIII.

CLASSIFICATION RESULTS – PAPERS CITATIONS Results

Database Category

Correctly classified

Incorrectly classified

Total

Population

4

66.67%

2

33.33%

6

100%

Medical Clinic

2

33.33%

4

66.67%

6

100%

Functional

4

66.67%

2

33.33%

6

100%

Total

10

55.56%

8

44.44%

18

100%

The results of the first experiment were better than the second. One possible reason is related to the size of the training set. Besides, it is important to register that between the two experts, who defined manually this classification; initially there is no consensus for 6 databases. As one can observe the final results were quite similar for both datasets. Indeed, such behavior can be partially due to the fact that such approach usually takes into account one word term frequency rather than co-occurrence of words. Furthermore, these categories also share a good number of terms. These observations guide us to explore the power of co-occurrences of correlated terms in both categories. 2) Identifying the best database descriptors. In this experiment, the aim is to find out sets of MeSH terms that allow describing databases of a specific category. When searching for a particular database to solve a particular biological task, one would be suggested to explore combinations of words in order to obtain a database, which could be more interesting to this task. Therefore, we set another machine learning experiment to explore the power of discrimination one could be able to achieve by using cooccurrence of MeSH terms. We have used a frequent pattern mining approach to highlight these co-occurrence patterns in databases related to the Protein Structure category in NAR catalog. We also extracted MeSH terms from papers describing databases and from papers citing these papers. Even if the set of databases were pretty much the same, they were quite different with respect to the set of terms (MeSH terms) obtained on each dataset. The former has 180 and the latter 3,966 terms. The most frequent two terms in both datasets were Database-Protein and Proteins. However, going deep in the search for more specialized terms, we observe that the latter provides more power to characterize protein databases rather than former. Table IX contains the top-5 most frequently terms (or sets of terms like Databases-Protein=>Protein) in papers

describing databases and Table X contains the top-5 terms in papers citing these papers. TABLE IX.

TOP-5 MESH TERMS – DATABASE PAPERS

MeSH Terms (Sets) Support Database-Protein 0.9175258 Proteins 0.6701031 Databases-Protein=>Protein 0.6391753 Proteins Internet 0.6288660 Databases-Protein=>Internet 0.6082474 Internet TABLE X. TOP-5 MESH TERMS – PAPERS CITATIONS MeSH Terms Proteins Databases-Protein Protein Conformation Proteins Models-Molecular Databases-Protein=>Protein Proteins Internet

Support 0.8902439 0.8902439 0.8414634 0.8414634 0.8048780

In the Tables XI and XII, we present a set of a few rules (co-occurrence terms in databases) obtained by each experimental study. For this analysis we use the statistical R language with the arules package13. RULES – DATABASE PAPER

TABLE XI.

Support

Confidence

Lift

{ProteinConformation} => {Databases-Protein}

Rules

0.43

0.87

0.95

{User-ComputerInterface} => {Databases-Protein}

0.54

0.98

1.06

In the future, we intend to go further in the analysis of the users’ tasks. In this sense, the results of the experiments using MeSH terms are promising to deduce proper databases for specialized biological tasks. We are going to deepen in the analysis of co-occurrence of MeSH terms too. We also intend to consider a bigger number of database categories and to identify a bigger set of experts to evaluate the results. How to aggregate the distinct quality dimensions in the evaluation process is another issue that we also plan to explore in the future. Although this experimental study is preliminary, to the best of our knowledge, regarding to database selection from catalogs; the proposed approach is one of the first works to perform empirical analysis comparing the experts’ evaluation with an automatic evaluation approach. ACKNOWLEDGMENT This work is partially supported by CNPq, Conselho Nacional de Desenvolvimento Científico e Tecnológico, Brazil and CAPES, Coordenação de Aperfeiçoamento de Pessoal de Nível Superior, Brazil. The work of A. Levin and O. Pastor has been developed with the support of MICINN and GVA under the projects PROS-Req TIN2010-19130-C02-02 and ORCA PROMETEO/2009/015, and co-financed with ERDF. R. Alves was supported by the national postdoctoral program (PNPD) from CAPES-Brazil. REFERENCES [1]

RULES – PAPERS CITING DATABASES

TABLE XII. Rules

Support

Confidence

Lift

{Sequence Homology- Amino Acid} => { Proteins}

0.52

0.97

1.09

{Protein Folding} => { Protein Structure- Secondary}

0.52

0.82

1.38

[2]

We observed that co-occurrence of MeSH terms in latter (using papers citations) will provide fine word terms for better indexation of new databases related to the category Protein Structure. In fact, by taking into account such information, a new database could be ranked as a hot hit while using more specialized MeSH terms. VII. FINAL REMARKS Data quality is an essential aspect for correct management of genome-based information. We have shown how important the information about papers citations is, in the context of measuring distinct quality dimensions of genomic databases. This work is part of an ongoing research aiming to define a set of quality indicators to guide the selection of the best genomic databases.

[3]

[4]

[5]

[6]

[7]

[8] 13

http://www.r-project.org/

P. Gaudet, A. Bairoch, D. Field, S. Sansone, C. Taylor, T. Attwood, A. Bateman, J. Blake, C. Bult, J. Cherry et al., “Towards BioDBcore: a community-defined information specification for biological databases,” Nucleic acids research, vol. 39, no. suppl 1, 2011, p. D7. J. E. Hirsch, “An index to quantify an individual’s scientific research output,” Proc. of the National Academy of Sciences of the United States of America, vol. 102, no. 46, 2005, pp. 16569–16572. M. D. Devignes, P. Franiatte, N. Messai, A. Napoli, and M. SmailTabbone, “Bioregistry: automatic extraction of metadata for biological database retrieval and discovery,” Proc. of the 10th International Conference on Information Integration and Web-based Applications & Services, ser. iiWAS ’08. New York, NY, USA: ACM, 2008, pp. 456–461. M. Galperin and G. Cochrane, “The 2011 nucleic acids research database issue and the online molecular biology database collection,” Nucleic Acids Research, vol. 39, no. suppl 1, 2011, p. D1. G. Cochrane and M. Galperin, “The 2010 nucleic acids research database issue and online database collection: a community of data resources,” Nucleic Acids Research, vol. 38, 2010, pp. D1–D4. D. Smedley, P. Schofield, C. Chen, V. Aidinis, C. Ainali, J. Bard, R. Balling, E. Birney, A. Blake, E. Bongcam-Rudloff et al., “Finding and sharing: new approaches to registries of databases and services for the biomedical sciences,” Database: the journal of biological databases and curation, vol. 2010, 2010, p. baq014. R. Balakrishnan and S. Kambhampati, “Sourcerank: relevance and trust assessment for deep web sources based on inter-source agreement,” Proc. of the 19th international conference on World wide web, ser. WWW ’10. New York, NY, USA: ACM, 2010, pp. 1055–1056. F. Naumann, U. Leser, and J. C. Freytag, “Quality-driven integration of heterogenous information systems,” Proc. of the 25th International

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

Conference on Very Large Data Bases. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 1999, pp. 447–458. S. Cohen-Boulakia, O. Biton, S. Davidson, and C. Froidevaux, “Bioguidesrs: querying multiple sources with a user-centric perspective,” Bioinformatics, vol. 23, no. 10, 2007, pp. 1301–1303. B. Pernici and M. Scannapieco, “Data quality in web information systems,” Proc. of the 21st International Conference on Conceptual Modeling. London, UK: Springer-Verlag, 2002, pp. 397–413. R. Y. Wang and D. M. Strong, “Beyond accuracy: what data quality means to data consumers,” J. Manage. Inf. Syst., vol. 12, no. 4, 1996, pp. 5–33. F. Bry and P. Kruger, “A computational biology database digest: Data, data analysis, and data management,” Distributed and Parallel Databases, vol. 13, no. 1, 2003, pp. 7–42. D. Lichtnow, A. Levin, R. Alves, I. M. Castello, J. Dopazo, O. Pastor, and J. P. M. de Oliveira, “Using metadata and web metrics to create a ranking of genomic databases,” Proc. of IADIS WWW/Internet 2011 conference, in press. L. Barbosa, S. Tandon, and J. Freire, “Automatically constructing a directory of molecular biology databases,” Proc. of the 4th international conference on Data integration in the life sciences, ser. DILS’07. Berlin, Heidelberg: Springer-Verlag, 2007, pp. 6–16. F. Naumann and C. Rolker, “Do metadata models meet iq requirements,” Proc. of the International Conference on Information Quality (IQ), 1999, pp. 99–114. J. Golbeck and J. Hendler, “Accuracy of metrics for inferring trust and reputation in semantic web-based social networks,” Engineering Knowledge in the Age of the Semantic Web, 2004, pp. 116–131. S. Brin and L. Page, “The anatomy of a large-scale hypertextual web search engine,” Comput. Netw. ISDN Syst., vol. 30, no. 1-7, 1998, pp. 107–117. B. Amento, L. Terveen, and W. Hill, “Does “authority” mean quality? predicting expert quality ratings of web documents,” Proc. of the 23rd annual international ACM SIGIR conference on Research and

[19] [20]

[21]

[22] [23]

[24]

[25]

[26] [27] [28]

[29]

development in information retrieval. New York, NY, USA: ACM, 2000, pp. 296–303. S. Siegel, Nonparametric Statistics for the Behavioral Sciences. McGraw-HIll, Inc., 1956. D. Ballou, R. Wang, H. Pazer, and G. Kumar, “Modeling information manufacturing systems to determine information product quality,” Management Science, 1998, pp. 462–484. P. Buneman, A. Chapman, and J. Cheney, “Provenance management in curated databases,” Proc. of the 2006 ACM SIGMOD international conference on Management of data, ser. SIGMOD ’06. New York, NY, USA: ACM, 2006, pp. 539–550. F. Naumann, Quality-driven query answering for integrated information systems. Springer Verlag, 2002. H. H. Do, S. Melnik, and E. Rahm, “Comparison of schema matching evaluations,” in Revised Papers from the NODe 2002 Web and Database-Related Workshops on Web, Web-Services, and Database Systems. London, UK, UK: Springer-Verlag, 2003, pp. 221–237. G. Salton, A. Wong, and C. S. Yang, “A vector space model for automatic indexing,” Commun. ACM, vol. 18, no. 11, 1975, pp. 613– 620. C. Bizer, Quality-Driven Information Filtering- In the Context of Web-Based Information Systems. Saarbrücken, Germany, Germany: VDM Verlag, 2007. C. Manning, P. Raghavan, and H. Schutze, Introduction to information retrieval. Cambridge Univ Pr, 2008. S. Robertson, “On the history of evaluation in ir,” J. Inf. Sci., vol. 34, 2008, pp. 439–456. M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. H. Witten, “The weka data mining software: an update,” SIGKDD Explor. Newsl., vol. 11, 2009, pp. 10–18. I. Witten and E. Frank, Data mining: practical machine learning tools and techniques with Java implementations. Morgan Kaufmann Pub, 2000.

An Empirical Framework for Automatically Selecting the Best Bayesian ...