26 Pane et al Enabling dengue outbreak predictions ODRS16.pdf ...

Viewer
Transcript

2016 Open Data Research Symposium, 5 October 2016, Madrid, Spain

Enabling dengue outbreak predictions based on open data Juan Pane ∗ , Julio Paciello† , Verena Ojeda ‡ , Natalia Valdez

§

Facultad Polit´ecnica- Universidad Nacional de Asunci´on P.O.Box: 2160 SL, San Lorenzo - Central - Paraguay Email: ∗ [email protected], † [email protected], ‡ [email protected], § [email protected] In Paraguay, DGVS publishes morbidity information2 in the form of graphs and historical trends3 by region, or heat maps in their weekly epidemiological reports (Direcci´on General de Vigilancia de la Salud de Paraguay, 2016a). However, this information is presented in a static format (PDF, JPG) and without a standard data model, which prevents automatic processing of the information. This model of publications is similar in many other countries that were studied. This lack of a standard model is a burden for the creation of tools that can automatically import and analyse current and historical data related to dengue. The DGVS does not have an automated predictive tool that can be used to foresee dengue disease outbreaks. Furthermore, there is no open source tool for predicting dengue disease outbreaks that could be easily adapted to DGVS needs. Finally, to the best of our knowledge, there is no research that analyzes the correlation of external variables, such as weather and population, with the dynamics of dengue outbreaks in Paraguay. This paper presents a standard data model for reporting dengue cases based on analysis of supply and demand of data related to dengue. Base on the study of all dimensions and variables correlated to dengue, the model includes the basic information that is needed to study dengue outbreaks. The model is based on open data principles, seeking to minimize the effort for publishing data while maximizing the ability the use and reuse this data. As proof of concept of the usefulness of the proposed data model, we obtained the historical data from DGVS and built an open source tool that can aid in the publication of the data. With this data, we trained a classification model of dengue cases, using also other related variables (or covariables) such as demographic, climatic, geographic data, to determine the occurrence of a dengue outbreak with one week in advance. According to the World Health Organization (WHO), an outbreak is defined as “the occurrence of cases of a disease in excess of what would normally be expected in a defined community, geographical area or season” (World Health Organization, 2016). The open source tool allows the user to browse and download the data and to visualize the risks map for Paraguay at the level of departments and cities for each epidemiological week that is loaded using the proposed data model. The tool also integrates the trained classification model, showing the prediction for the following week, based on the epidemiological week selected.

Abstract—Dengue is one of the fastest growing diseases in the world and has become an increasing problem both geographically and in the number and severity of reported cases, with nearly half of the world’s population at risk. According to the National Health Surveillance Department of Paraguay, dengue is endemic in Paraguay since 2009. Currently, the data needed to enable research and to develop applications for understanding and managing disease epidemics -such as dengue- are highly variable, difficult to reuse and not standardized. Like many other Latin American countries, Paraguay does not have an automated predictive tool that can be used to foresee dengue disease outbreaks. Furthermore, to the best of our knowledge, there is no research that analyzed the correlation of external variables, such as weather and population, with the dynamics of dengue outbreaks in Paraguay. This paper presents a new data model and the creation of an open source early warning model for dengue outbreaks. The proposed data model is based on the analysis of supply and demand of dengue related data. Additionally, the early model system implemented can detect disease outbreaks with one week in advance. Our preliminary results show that the model was able to predict dengue outbreaks for the next week with an accuracy of 94.78% and F-measure of 90.47%. Keywords—Data Models, Data mining, Dengue, Classification, Information management, Medical conditions, Medical Expert systems, Medical information systems, Monitoring, Open data, Public healthcare, Standards development, Surveillance

I. I NTRODUCTION Dengue is one of the fastest growing diseases in the world and has become a growing problem both in the number of reported cases and in its geographical distribution. The Direcci´on General de Vigilancia de la Salud DGVS is the National Health Surveillance Department of Paraguay and is responsible for the prevention and control of epidemiological diseases in the country. According to DGVS, dengue is an endemic disease in Paraguay since 2009 (Cabello et al., 2014). Currently, there is an increasing trend towards the use of data mining techniques to discover correlations and possibly perform classifications or predictions based on the analysis of available data (Lowe, Bailey, Stephenson, Graham, et al., 2011)(Lowe, Bailey, Stephenson, Jupp, et al., 2013)(Ahmad Tarmizi et al., 2013). These techniques are based on the use of high quality historical data. Open data initiatives1 at national and international levels play a crucial role in providing such data in reusable formats, fostering the creation of the predictive models based on data mining tools.

2 Morbidity, in this context, is the relative incidence (i.e., number of cases and other characteristics for a given geographical area) of a disease 3 Proportion of people falling ill in a place and time.

1 Iniciativa

Latinoamericana por los Datos Abiertos (idatosabiertos.org), Global Open Data Initiative (globalopendatainitiative.og)

1

2016 Open Data Research Symposium, 5 October 2016, Madrid, Spain reporting format using Excel (Pan American Health Organization, 2016). However, the information collected by the central office is not published consistently and with the granularity level as is reported by member countries. Furthermore, even if this information was available, it might be insufficient for specific research areas such as the creation of early warning systems given the variability in the specificity of the reported data between regions. National organizations such as the DGVS, publish information related to dengue cases as weekly newsletters in PDF format or as images, which does not allow automated processing (Direcci´on General de Vigilancia de la Salud de Paraguay, 2016a).

This work aims to promote the use of open data principles and standards, encouraging reusability of data, for the health domain, specifically for dengue, based on current needs and issues encountered when trying to analyse historical data and to generate new knowledge using data mining techniques. The standard data model for reporting morbidity information related to dengue dengue can be useful for international organizations such as the WHO and Pan American Health Organization (PAHO), research groups, government agencies and civil society alike. The publication of the dengue data in open data format can enable collaboration, research and innovation based on data from official sources such as DGVS. This collaboration could foster the creation of innovative tools, that could be reused and adapted to several countries or regions. This reusability of tools is important for countries with less research and development resources, that can benefit from the incremental development of reusables open source tools. The use of the classification model based on the published data is one of the first steps towards generating an early warning system for dengue in Paraguay. The classification model aims to be a support tool in the process of dengue control and prevention. In addition, this classification model provides insights about the correlation between variables and co-variables used and reveals the importance of each variable and co-variable when trying to predict the outbreak of the disease. This paper is organized as follows: • Section 2 gives a brief description of the state of the art relating to the problem. • Section 3 describes the proposed solution for this work, which includes: the standard data model, the classification model and the prototype web application. • Sections 4 and 5 shows the datasets and the evaluation methodology used. • Sections 6 and 7 describes the results of the evaluation. • Finally, sections 8 and 9, presents the conclusions and future works.

B. Data analysis of dengue There is a tendency toward using data mining techniques to gain new knowledge from specific data sets. Classification and prediction models are techniques that are commonly used in data mining for purposes similar to the ones seeked in the work. Some methods that are normally used in the literature to train these models are Linear Regression such as generalized linear models (Lowe, Bailey, Stephenson, Graham, et al., 2011), Bayesian Networks (Lowe, Bailey, Stephenson, Graham, et al., 2011; Lowe, Bailey, Stephenson, Jupp, et al., 2013), Decision Trees (Ahmad Tarmizi et al., 2013), Neural Networks highlights (Ahmad Tarmizi et al., 2013), Fuzzy Logic (Buczak et al., 2012), among others. A study published in 2012 (Buczak et al., 2012), describes an innovative prediction method using fuzzy association rule mining. The study involves clinical, meteorological and socioeconomic variables. Another study, published in 2013 (Lowe, Bailey, Stephenson, Jupp, et al., 2013), aimed to make a prediction of epidemics of dengue in the brazilian cities where matches of the World Cup were played. The study was based on previous research that showed the existence of significant associations between dengue and climate variations, enabling the creation of early warning systems based on climate; furtermore, it involves non-climate co-variables that affect the behavior of the disease. In the study by (Ahmad Tarmizi et al., 2013), the authors used data related to the occurrence of dengue cases such as week, number of cases, geographical location and other, to made a comparison among different classification techniques. In this study the use of decision trees showed results comparable to other studies found in the literature. Fuertermore, it was noted that classifiers were useful to determine the occurrence of outbreaks, using data similar to the data proposed in the standard data model. Finally, there are different types of analysis performed once dengue data is obtained. A widely used tool are the maps, for example in (Dicker et al., 2006; Munasinghe, Premaratne, and Fernando, 2013) the information is presented in graphical form; however, most of the maps are images and in very few cases are interactive and intuitive tools, this can be associated to both the lack of data and the lack of a standard publication format that enhances the development of such tools.

II. S TATE OF THE ART This work is based on the analysis of the available data related to dengue and the use of this data for the creation of a classification model for predicting dengue ourbreaks and the creation of reusable tools to show the information to end users in an intuitive and interactive manner. The following subsections anayze the state of the art on these issues. A. Report and publication of dengue cases Dengue is a notifiable disease as defined by WHO in the International Health Regulations (World Health Organization, 2006). All notifiable diseases are required to be reported to government authorities and from there, to international organizations such as WHO and PAHO. In the Americas region (Americas and the Caribbean) 17 out of 30 considered countries do not publish morbidity information about dengue cases at their official websites. The other countries publish information in highly summarized formats, not standardized even for different periods at the same region. PAHO encourages their member countries to report their data. For the specific case of dengue, it defines a unified

C. Main research related to dengue in Paraguay In Paraguay, two research gruops tackled the issue of dengue from different perspectives, bot using Information and Communication Tecnologies (ICT). The Predictive Model of dengue focus applied to Geographic Information Systems (Gonz´alez, 2014) proposes a model that predicts possible focus of dengue risk based on the analysis of the geographical 2

2016 Open Data Research Symposium, 5 October 2016, Madrid, Spain ◦ Administrative Division6 Level 1; ◦ Administrative Division Level 2; ◦ Administrative Division Level 3; ◦ Altitude. • the time in which the phenomenon develops(Temporal Dimension) ◦ Year, ◦ Month, ◦ Epidemiological week. • the characteristics of diseases or cases ◦ Serotype: defines circulating virus serology (e.g., DEN-1, DEN-2, DEN-3, DEN-4). ◦ Clinical classification: Dengue Fever (DF), Dengue Hemorrhagic Fever (DHF) and Dengue Shock Syndrome (DSS). The set of co-variables include the variables used for the study and analysis of the epidemic directly or indirectly affecting the increase or decrease in dengue cases. For example, population density, relative poverty, access to running water, health services, vegetation index, larval infestation sites, etc. 2) Available data: Data availability was analyzed from 3 different perspectives: • data published online by countries, WHO and PAHO • data collected by national health surveillance systems using Madatory notification forms to comply with International Heatl regulations for notifiable diseases. This data is avilable at each monitoring organization. This data is not normally published, but could published given that the data already exists in each country; • data reported by countries to the PAHO. The study of the available data, using these three perspective, covered the 30 countries that are part of the Region of the Americas according to WHO (San Mart´ın and BrathwaiteDick, 2006), integrating countries in the Americas and the Caribbean. The research is based on information found on the official websites of the Ministries and Institutions of Health Surveillance in each country. For each of the 30 countries an inventory was made including7 : • the website of the Ministry of Health or equivalent institution, • the website for the direction, department or Health Surveillance unit, • the website where dengue epidemiological data is published, • the format of the published data, • the form of obligatory notification that workers of the health institutes have to fill for patients diagnosed with dengue. It was found that not all Ministries and Surveillance Organizations have official websites and most of the published data are in weekly epidemiological newsletters in PDF format, United States being the only country that uses automatically machine-processable formats: CSV 8 , JSON 9 , XLS10 and

spread of the dengue vector. The Geographic Information System in real time based on messages USSD for the Identification of Epidemiological Risk Zones (Ochoa, Talavera, and Paciello, 2015) proposed a model to streamline the process of collecting geographic data of suspected or confirmed cases of the disease, using mobile technologyUnstructured Supplementary Service Data (Ochoa, Talavera, and Paciello, 2015). These works differ with this research in that they do not have a standard report model of dengue cases, nor a prediction model based on historical data outbreaks of dengue morbidity. D. Online tools There are several online tools that display information about the dengue historic advance, mostly interactive maps that show the geographic expansion of an indicator such as risk, incidence, presence or number of dengue cases. For example, the risk map of the “National Education Program Map”4 page of Argentina shows the environmental risk of dengue in the Argentine territory, for the current year, with the option of adding more layers like “Possible dengue transmission days” and “Educational establishments”. Another type of map is presented in DengueMaps of HealthMap5 , this map is based on online news publications regarding dengue to indicate the degree of presence of the disease in each country in the world. It is emphasized that none of the tools found has the functionality to visualize the spread of the disease over time, for example by epidemiological week, nor publish the data set used. This does not allow other institutions or researchers to reproduce the results or use the data to develop new tools. III. P ROPOSED SOLUTION This section describes the main components of the proposed solution. The data analysis related to dengue and the proposed standard data model are described first. Then, the algorithms selected to train the classification model are described and finally a prototype web application, developed as proof of concept, is presented. A. Standard data model The data model proposed in this section is based on the analysis of supply and demand for data related to dengue. 1) Data demand: Some research projects and applications were selected to analyze the need for data in various dimensions. The variables were divided into two sets: i) epidemiological variables and ii) co-variables. The set of epidemiological variables are those describing aspects that allows to identify and understand the epidemiological phenomenon, these are: i) the affected population, ii) the place, iii) the time in which the phenomenon develops (Dicker et al., 2006) and iv) the characteristics of the disease or cases. The resulting dimensions of these aspects are: • the affected population (Demographic dimension) ◦ Age: The age group of those affected. Different studies and systems use different groupings, ◦ Sex • the place (Geographical dimension) ◦ Region; ◦ Country;

6 Generic term for territorial divisions that countries have at any level. For example, in Paraguay the level 1 corresponds to “Departments”, the level 2 to “Districts” and the level 3 to “Localities”. 7 See Table3 at http://bit.ly/DengueTables 8 https://tools.ietf.org/html/rfc4180 9 http://www.json.org/ 10 http://www.alegsa.com.ar/Dic/xls.php

4 http://www.mapaeducativo.edu.ar/mapserver/aen/socioterritorial/dengue riesgo/index.php 5 http://www.healthmap.org/dengue/en/

3

2016 Open Data Research Symposium, 5 October 2016, Madrid, Spain XML 11 . The set of variables taken into account in the analysis of the data published online12 is grouped in dimensions and they are: • Epidemiological dimension: number of cases, total amount, confirmed cases, discarded, suspicious and in study, amount of deaths by dengue, serotype and clinical classification, • Temporal dimension: year, month, epidemiological week and day, • Demographic dimension: sex and age group, • Geographical dimension: type of area, region, country, administrative level 0, 1 and 2, coordinates, • Other: source type, occurrence identifier. Analyzing this information we can see that there is great variability in how countries and even international organizations report morbidity information about dengue. In most cases only information with high level of aggregation is published, for example: number of cases and deaths by country, number of cases by year and in very few cases by epidemiological week. Forthermore, even if information about the number of cases by epidemilogical week is found, this information is not dissaggregated by clinical classification, which prevents the analysis of the information using several dimensions dinamically. We analyzed the data collected by national health surveillance systems13 using notification forms to determine which variables are published and which are collected but are not published. Among the collected variables are: variables of epidemiological significance such as clinical classification or serotype, geographical variables such as locality or district, temporary variables such as year, month or epidemiological week, clinical variables like symptoms, conducted studies, laboratory tests or initiated treatments, variables indicating the data of the declarant establishment and variables that identify the patient. Comparing the quantity, quality and granularity of the data collected with the published data, significant differences can be observed. In Paraguay, for example, only 12 of the 34 variables collected are published and in Argentina 16 of the 42 variables collected are published. Finally, the data reported to PAHO by member countries was analyzed. The data is reported weekly in XLS spreadsheets that have three tables14 . The first contains variables related to the number of cases reported to administrative level one, that is, state, department or province. The number of cases of dengue and severe dengue15 occurred in the week is included, the amount of deaths, as well as the amount of accumulated cases, the cumulative incidence, the amount of accumulated cases confirmed by laboratory, the reason for severe dengue and the accumulated deaths by dengue, lethality, circulating serotypes and the population at risk. The second table presents a summary of cases by sex and age group. The name of the state, department, or province, and distribution as male or female of the amounts of dengue cases and severe dengue by age group is included. The third table contains a summary of the

above tables indicating the week, description of the outbreak, the number of cases reported and serotypes identified. Since the data about different dimensions are reported independently in different tables, an dinamic analysis using at the same time several dimensions is not possible with this reporting model. 3) Proposed data model: The data model proposed in table I is an effort to include all necessary dimensions and variables that would be needed in order to enable the analysis of the morbidity data using multiple dimensions (time, geographical, demographical, disease characteristics) at the same time. When defining the data model, we considered not only variables already published, but also variables that were collected but not yet published. This would allow most of the countries in the Americas Region to adopt the data model considering that they already potentially have the data. With this analysis, we intend to close the existing gap between the demand for data requiring minimal effort of potential data publishers. Based on the variables discussed in section III-A1 two sets of variables were obtained, epidemiological variables and related variables or co-variables. In the set of epidemiological variables four important dimensions were identified: temporal, geographical , demographic and case characteristics. The temporal dimension contains year, month, day and epidemiological week. The geographic dimension includes several scales from region, country, administrative division level 1, administrative division level 2 up to administrative division level 3. The use of administrative divisions reflects the fact that not all countries use the same administrative division’s scheme, for example, in Paraguay the first administrative division are named departments while in Argentina they are named provinces. Demographic dimension included the variables sex and age. For the age group, we used the age groups as were used in the reports sent to PAHO. The case characteristic group include origin, final state, clinical classification and serotype. The variable “origin” is of utmost importance to non-endemic countries since it specifies whether the case was originated in the country (native) or outside (imported). The variable “final status” indicates whether the case was suspected, confirmed, discarded or fatal (death). The variable “serotype” takes its value from the list of possible dengue serotypes circulating: DEN-1, DEN-2, DEN-3, DEN-4 and DEN-5. The variable “clinical classification” can take values from the two existing classifications of the WHO: 1997 and 2009. According to the 1997 classification the possible values are: DF (Dengue Fever), DHF (Dengue Hemorragic Fever) and DSS (Dengue Shock Sindrome). According to the 2009 clasification the possible values are: group a) dengue without warning signs, group b) dengue with warning signs, and group c) severe dengue. A comprehensive comparison of both classifications and compatibility can be found at (Tsai et al., 2013; Barniol et al., 2011). Once the measurable variables to be published are defined, the quantification itself is the number of cases specified by the variable “quantity”. Finally, the variable “source” is intended to identify the institution from which the report or data comes and it must contain the name of the reporting source or URL to the document, report or data set from where the original data was obtained. Each row of the report is an aggregation of cases that meet the values of the other variables. For example a row of the

11 https://www.w3.org/TR/xmlschema-2/ 12 For the complete list of published variables please refer to Table1 available at http://bit.ly/DengueTables 13 See Table2, containing 285 variables, at http://bit.ly/DengueTables 14 We obtained the format of the table with no data from DGVS for analysis 15 According to WHO: threatening complication that causes plasma extravasation, fluid accumulation, difficulty breathing, severe bleeding or organ failure.

4

2016 Open Data Research Symposium, 5 October 2016, Madrid, Spain

Case characteristics Demographic

Geographic

Temporal

Grupo

quantity source

Name year month day week

region country adm1 adm2 adm3 age sex

Label year Month Day Epidemiological week

Description Year Month of the year Day of the year Standardized variable used by surveillance systems

Data Type xsd:gYear xsd:gMonth xsd:gDay xsd:decimal

Region Country Administrative level 1 Administrative level 2 Administrative level 3 Age group Sex

Continent or continent area Country Eg. In Paraguay it corresponds to department in Argentina to province Eg. In Paraguay corresponds to district in Argentina to department Eg. In Paraguay it corresponds to neighborhood in Argentina to township Ages group Group of people with the same organic condition If the disease was contracted within the national territory ’reporting or if it was out of it Case determination

AdministrativeArea AdministrativeArea AdministrativeArea

Clinical classification of the manifestations of the virus according to WHO Subpopulation of the microorganism. Known so far are 1, 2, 3, 4 and 5 Sum of the number of cases grouped by the other variables Institution from which the data come

xsd:string

division division division

origin

status

Final status

classification

Clinic classification

serotype

Serotype Quantity or amount of cases Reporting source

Restrictions Values between 1 and 12 Values between 1 and 31 Values between 1 and 53, starts sunday and ends saturday

AdministrativeArea AdministrativeArea xsd:string xsd:string

<5, 5-9, 10-19, 20-59, >=60 Female, Male

xsd:string

Imported, Native

xsd:string

Confirmed, Suspected, Discarded, Dead {DF, DHF, DSS}WHO’97 o {A,B,C}WHO’09 DEN-1, DEN-2, DEN-3, DEN-4, DEN-5

xsd:string xsd:decimal xsd:string

Table I: Standard report model for dengue cases.

report would be the number of cases with the state “confirmed” in the age group “<5”, sex “female”, origin “autochthonous”, with serotype “DEN-1”, in the country “Paraguay”, in the department “Central”, in the capital “Asunci´on”, in the neighborhood “Las Mercedes”, in the year 2015, month 1, day 1, week 1. The data type of each variable is defined using the primitive data types of the XML standard created by the W3C. A new class was defined for the variables of the geographical dimension: AdministrativeArea. This class contains three attributes: • name: name of the geographical location. This attribute is required. • id: geographic location identifier according to the coding system defined in the attribute ref. This attribute is of type xsd:string. This attribute can also take URI type values that can be de-referenced to enrich report information with other variables of an administrative division such as its geographical coordinates, shape, population, among others. This attribute is not required. • ref: url or reference to the standard or coding system used in id. This attribute is not required. The use of international standards to reference geographical locations enables the integration and interoperability of the reported data. For the variables “country” and “adm1” the use of the standard ISO-3166 is proposed. For example, the country variable for the “United States” based in the ISO-3166 could be defined as follows:

country or administrative level 1 (province, state, etc.) could be used. Related variables or co-variables are not included in the proposed data model since they can be derived from the reported data. For example, access to running water can be extracted from the annual statistics, given that the year and the region are known (to some scale). Similarly, climatological variables such as temperature, precipitation, humidity can be extracted from other services available on internet. The more accurate the reported data related to time and geographical dimensions are, the more accurate the dependent co-variables would be. The by including several dimensions in the proposed data model presents, off-the-shelf tools such as Business Intelligence tools or other web based tools could be used to quickly analyse the data. This can help streamline the analysind and understanding of dengue disease dinamics. 4) Personal information protection: The data collected by the Health Surveillance Systems contain essential information that can be used by researchers, health policymakers, experts or early warning systems. However, before selecting the data to be published, privacy issues must be considered. Two legal frameworks that serve as data privacy examples are: i) Health Insurance Protability and Accountability Act (HIPAA)16 from United States and ii) Directive for Individual Data Protection17 form the EU. In both cases, the publication or transmission of any information related to an identified or identifiable natural person is forbiden. While HIPAA applies specifically to information relating to the state of health of a c o u n t r y . i d : US person, the EU Directive has a wider application that among c o u n t r y . name : U n i t e d S t a t e s c o u n t r y . r e f : h t t p : / / d a t a . o k f n . o r g / d a t a / c o r e / c o u n t r y − l i s t other things includes data related to people’s health. HIPAA allows the publication of health reports for the The values of administrative divisions 2 and 3 (adm2 purpose of disease monitoring. This data can be published and adm3) are more dynamic than country and administrative 16 http://www.hhs.gov/ocr/privacy/index.html level 1 and they are not a part of the ISO-3166. To refer to 17 http://eur-lex.europa.eu/legal-content/en/TXT/?uri=CELEX:31995L0046 these administrative levels local standards maintained by each 5

2016 Open Data Research Symposium, 5 October 2016, Madrid, Spain as long as de-identification (or anonymization or privacypreserving) methods are applied before any release of data. The process of de-identification defined by HIPAA aims to prevent individual identification of persons on published data18 . HIPAA defines two methods: i) formal determination of data privacy by a qualified expert, and ii) removal of personal identifiers or other knowledge that would allow published data, either alone or in conjunction with other data, to identify people. The proposed model in section III-A3 considers the following methods of anonymization, in the following order: 1) Deleting variables that identify people, for example: names, birth dates, number identifiers such as social security number or passport number or identity, among others. 2) Generalization of variables: i) using predefined age groups, ii) generalizing the dates of reported cases in epidemiological weeks. 3) Data aggregation: cases are grouped by the remaining variables and a row of information is published for each group, with the sum of the number of cases per group. The application of these anonymization’s techniques will allow publishing data on the standard data model defined considering the protection of personal information. The proposed model thus prevents the publication of individual cases preserving the privacy and protecting personal information. 5) Data serialization: Reusability and interoperability of data depends on the format in which they are published. Open formats “tend to promote a wide range of uses” (Sunlight Foundation, 2016). These data are easily readable, searched and managed by machines and when distributed properly can maximize access, use and quality of published information (Sunlight Foundation, 2016). Some open data formats are JSON, CSV, XML (structured), HTML (semi-structured). The five-star standard promoted by Tim Berners Lee (Berners-Lee, 2016) defines a classification level to determine the quality of the format in which data are published. The proposed model aims to publish data at least three stars using two structured data formats: CSV and JSON. Listing 1 shows the structure of the CSV file with a row of sample data. CSV columns correspond to the variables proposed in the previous section. Note that Listing 1 includes only the columns “*.name” for the types AdministrativeArea, while the other attributes, id and ref, are optional.

”...” : ”...” , ” q u a n t i t y ” : ”25” } ]

The examples shown in the listings 1 and 2 represent the minimum effort required to publish geographic data. However, this minimal effort implies that the data may have semantic ambiguities. For example, the name of one place can be written in different languages (chile (es) vs cile (it)) or even different alphabets. Another problem is that many places have the same name, for example, there are several cities with the name Roma. These semantic ambiguities can be resolved with the use of the other attributes of the AdministrativeArea class. The listings 3 and 4 show examples of how they could publish data from the variable “country” based on the ISO 3166. In CSV format for each variable type is divided into 3 sub-variables, which are name, id and ref. Listing 3: CSV format example using AdministrativeArea. y e a r , week , . . . , c o u n t r y . name , c o u n t r y . i d , country . ref , . . . , quantity 2 0 1 5 , 1 , . . . , ”PARAGUAY” , ”PY ” , ” h t t p : / / e x a m p l e . com / c o u n t r i e s ” , . . . , 25

In JSON format each variable of type AdministrativeArea is composed of a sub JSON with 3 attributes (name, id and ref) as shown in listing 4. Listing 4: JSON format example using AdministrativeArea. [ { ” year ” : ”2015” , ” week ” : ” 1 ” , ”...”:”...” , ” c o u n t r y . name ” : ”PARAGUAY” , ” c o u n t r y . i d ” : ”PY ” , ” country . ref ”: ” h t t p : / / d a t a . o k f n . o r g / d a t a / c o r e / c o u n t r y −l i s t ” , ”...”:”...” , ” q u a n t i t y ” : ”25” } ]

B. Classification model The discovery data process is an iterative process of five phases: cleaning and data integration, data selection and processing, data mining, evaluation of patterns in the data and presentation of knowledge. Data mining consists of extracting patterns from data. In this phase, a classification model is trained with a classification algorithm. In this work we used a decision tree algorithm including two sub-phases: training and validation (Han, Pei, and Kamber, 2011). In the training phase, the classification algorithm is responsible for finding correlations between data and generate from these a classification model based on a target variable and a set of sorting variables. Specifically, the decision tree algorithm is executed on a certain percentage of data of the whole dataset, this data partition is called training set, which normally accounts for 60% of the whole dataset D. In the validation phase, the generated classification model is tested. This phase is executed on the remaining data, that is, 40% of the data set D, which is called the validation set. Finally, for each tuple, the value obtained by the classification model is compared against the true value so as to determine whether the classification is correct or not.

Listing 1: CSV format example. y e a r , week , . . . , c o u n t r y , adm1 . name , . . . , q u a n t i t y 2 0 1 5 , 1 , . . . , ”PARAGUAY” , ”CENTRAL” , . . . , 25

Listing 2 shows the structure of the JSON file. As for CSV, JSON attributes correspond to the variables proposed in the previous section. Listing 2: JSON format example. [ { ” year ” : ”2015” , ” week ” : ” 1 ” , ”...” : ”...” , ” c o u n t r y ” : ”PARAGUAY” , ” adm1 . name ” : ”CENTRAL” , 18 http://goo.gl/jcVXUb

6

2016 Open Data Research Symposium, 5 October 2016, Madrid, Spain 1) Data mining tools: For the development of training and validation phases, the open source tool KNIME19 was selected. KNIME provides, among others, the implementation of algorithms for decision tree and decision tree ensemble. Each mining process (data selection, training and validation) is defined as a workflow in KNIME. In what follows, we describe each algorithm, its components and configuration parameters used. Decision tree: The decision tree is a classification model that predicts the value of a target variable according to various input variables. A decision tree has the structure of a flow chart, where each internal node denotes a test to a variable, each branch represents a test result and each leaf node(or terminal node) has a class label (Han, Pei, and Kamber, 2011). The decision tree algorithm has a greedy approach in which trees are built recursively from top to bottom applying the strategy “divide and conquer”. The test data set is partitioned into smaller sets as the tree is constructed. Figure 1 presents the workflow designed using for the execution of training a validation phases. The components and configuration parameters are described below.

its configuration parameters, the input variables, the target variable and additional parameters are described below. Attribute selection method: the method by which the best attribute for the partition is selected. The available selection methods are: Gain ratio and Gini index. Reduced Error Pruning: simple pruning method executed in the post-processing stage. It traverses the tree beginning at the leaves, and replaces each node by its most popular class if the prediction accuracy does not decrease. • Decision Tree Predictor: this component (figure 1-4) is responsible for implementing the classification model on the validation set and obtaining a classification of the target variable for each tuple of the set. The input is the validation set and the output is the same set with an additional column indicating the predicted value. • Scorer: this component(figure 1-5) is responsible for validating the classification obtained by the Decision Tree Predictor, comparing it against the true value of the target variable in the validation set. The input is the validation set with the classification made by the Decision Tree Predictor and the output is a confusion matrix indicating the number of hits and errors of the classifier. Decision tree ensemble: The Decision Tree Ensemble is a classification method that creates a set of decision trees and selects the result of the classification by a simple majority vote. Each decision tree is created from a different set of records and/or a different set of attributes. This model also provides statistics on the attributes used in the created trees. These statistics helps to understand what attributes are critical for optimal solutions (Han, Pei, and Kamber, 2011). In figure 2 the workflow of training and validation phases of a decision tree ensemble is presented. The Database reader, Partitioning and Scorer components are the same and serve the same purpose as in the flow of the decision tree. However, the components corresponding to the algorithm have different characteristics, these components are described below.

Figure 1: Decision tree workflow (KNIME). •

•

•

Database reader: this component (figure 1-1) is responsible for obtaining data from a specific data source. In our case, the data we obtained from DGVS and published using the proposed data model, and also other co-variables such as humidity, temperature and rainfall. Partitioning: this component (figure 1-2) is responsible for creating two data partitions, one will be used as training set and the other as a validation set. The partitioning component must specify, among its configuration parameters, the sampling method (or selection method) by which the tuples that make up each partition are selected. The available sampling methods are 20 : ◦ Take from top sampling ◦ Linear sampling ◦ Random sampling ◦ Stratified sampling Decision Tree Learner: this component (figure 1-3) is responsible for implementing the algorithm (decision tree) on the training set and for creating a classification model. The input is the training set and the output is the classification model generated by the algorithm. The learning component must specify, among

Figure 2: Decision tree ensemble workflow (KNIME). •

19 https://www.knime.org/ 20 https://www.knime.org/files/nodedetails/ manipulation row row transform Partitioning.html

7

Tree Ensemble Learner: this component (figure 2-3) is responsible for implementing the algorithm (decision tree ensemble) on the training set and create a classification model. The input is the training set and the output is the classification model generated by the algorithm. The learning component must specify which inputs are variables or sorters, the target variable, and additional parameters are described below: Attribute selection method: the method by which the best attribute for the partition is selected. The available

2016 Open Data Research Symposium, 5 October 2016, Madrid, Spain

•

selection methods are: Gain ratio and Gini index. Number of models: the number of decision trees to generate. Attribute Sampling: the sampling of the data rows for each individual tree. Depth limit: number indicating the depth limit to which the decision trees can grow. Tree Ensemble Predictor: this component (figure 14) is responsible for implementing the classification model on the validation set and obtaining a classification of the target variable for each tuple in the set. The input is the validation set and the output is the same set with an additional column indicating the predicted value.

C. Application Prototype The availability of standardized data enables the development of tools that improve the management of potential epidemiological risks. An application that relies on the use of anonymized data was developed as proof of concept. The anonymized data was obtained experimentally from the DGVS, through a research agreement between the DGVS and the Facultad Polit´ecnica de la Universidad Nacional de Asunci´on (FPUNA).

Figure 4: District-level details of a department. 1) Interactive maps: Interactive maps allow you to visualize the temporal and geographical behavior of the disease, thus acting as a tool of analysis with help of extra information such as the quality of health services or socioeconomic level of the geographical areas. THe developed interactive maps include: • Risk map by epidemiological week at the level of departments, districts and Asunci´on’s neighborhoods (see figure 3). • Heatmap based on the number of notifications of dengue cases by epidemiological week at the departmental level with filtering by Sex and Classification. • Incidence map by epidemiological week at the level of departments, districts and Asunci´on’s neighborhoods. • Alert Map at level of departments, districts and Asunci´on’s neighborhoods. All maps provide the ability to change or add transparency to the background layer, which allows to display features such as rivers, roads, buildings, etc. For risk maps, incidence and alert, the user can see the details at the level of districts or Asunci´on’s neighborhoods clicking on a department (figure 4). Using maps to present information allows you to see the geographical spread of the disease per epidemiological week. By doing this, the user can intuitively identify the place or region where outbreaks usually begin and how it spreads to its adjacent regions or travels to other regions as time progresses. This feature is very useful for focusing prevention efforts in areas that are known to have a higher risk of dengue outbreaks each year. The variety and specificity of filters applicable to all maps are directly related to what epidemiological variables are reported following the proposed data model. Furthermore, with the dynamic exploration of the various types of maps, it is intended to streamline the current manual process at DGVS for publishing weekly risk maps related to dengue. The tool is ready to consume the data in the standard format, and to automatically generate the maps based on this data. 2) Data publication: The user can download the data based on the proposed data model in CSV and JSON format. Fuerthermore, we tool contains a funtionality for visualizing the raw data in tabular form, allowing users to filter this

Figure 3: Application Prototype. The open source prototype, called “DengueMaps” is composed by a server21 and a client22 module with the following features: • List of notifications per year, epidemiological week, date of notification, department, district, sex, age and result. • Data dictionary, this includes: attributes, description, data type and value domain. • Download data in CSV and JSON formats per year according to available variables described in the previous item. • Load historical data based on proposed data model defined in section III-A. • Load data for outbreak alert generation with one week in advance. • Interactive maps that present data in different ways. A detailed description of each of the features is shown below. 21 https://gitlab.com/opendata-fpuna/denguemaps-server 22 https://gitlab.com/opendata-fpuna/denguemaps

8

2016 Open Data Research Symposium, 5 October 2016, Madrid, Spain data using any of the available variables, e.g., epidemiological week, year, administrative area. The complete description of the variables of the data set is at the option of the “Diccionario” menu, which is the description of the standard data model. Finally, in the upper left corner of the page it indicates that all provided data is licensed under the CC-BY-SA (AttributionShareAlike) Creative Commons23 . This license allows data reuse provided it is attributed to the original source. 3) Predictions: The application provides two types of predictions: individual prediction and multiple predictions. These predictions use the previously trained and validated model. The classification model requires a minimum of variables necessary to perform the prediction, which are: department, district, year, week, incidence of the current week, the number of cases of the current week, the number of cases of the previous week, amount of cases before the previous week and the outbreak of the current week. The individual prediction displays a form in which the previously mentioned data must be loaded; then the prediction is presented as a result YES or NO on the screen. Multiple prediction, however, can make predictions for more than one department, district and neighborhoods in a given week and year; in order to load the required data a CSV file can be uploaded. This CSV file format structure can be downloaded from the application. The uploaded data set can be selected on the alerts map, to display the resulting prediction. 4) Load case reports: In order to update the data, the “Datos” option is available. This allows the user to upload a CSV file with new cases reported or collected by surveillance systems. The CSV file must have the structure of standard reporting model proposed by this work. Utility and possible extensions of the prototype: As mentioned previously, different tasks of prevention and reaction can be carried out when such tools are available. The tasks can be done in specific areas to use the limited resources more efficiently (especially for developing countries). Tasks may include fumigation work, cleaning vacant patios or areas where there is an accumulation of garbage. Although initial maps correspond to Paraguay, the tool could easily be extended to consume data from other countries and show the same types of maps. Several other extensions are possible for the tool, for example, new types of maps and graphs to analyze data could be implemented, among others, in (Criscioni, 2011): • • • • • • •

IV.

P REPROCESSING , PROCESSING AND DATA INTEGRATION

In order to build the training and validation data sets, historical data sets of dengue cases, climate variables, population and the Paraguay’s cartography were integrated in the tool. A. Base data sets This section describes the datasets used for the experiments and the training of the classification model. 1) Historical data of cases of Dengue in Paraguay: The historical records of dengue cases were obtained from records collected by DGVS. The initial dataset was obtained in excel format for the years 2009 to 2015 (until week 43). We first performed a cleaning and unification of the values corresponding to the administrative levels one (department), two (district) and three (neighborhood), sex and age groups. After this, we converted the data using the proposed data model. 2) Cartography: The cartographic data24 including shapefiles with the administrative divisions of the country (departments, districts and localities) was obtained from official maps managed by the Statistics, Surveys and Censuses Department. 3) Population: The demographic data25 was obtained from the data collected in the Census 2002 by the Statistics, Surveys and Censuses Department. The data represents the projected population by department, district and neighborhoods from 2002 until 2020. 4) Climate: The dataset of climate variables was obtained from daily measurements by the Meteorology and Hydrology Department26 . The dataset includes measurement for temperature (°C), humidity (%) and precipitation (mm) for each climate measuring station available in Paraguay. The data were interpolated to obtain measurements for all points of the geographic regions of Paraguay, and then zoning techniques were applied to obtain measurements for each zone (departments, districts and localities). 5) Quartiles: Based on the literature, two methodologies are used to determine the occurrence or nonoccurrence of outbreaks: the epidemic curve and the endemic corridor (also called endemic channel) (de la Salud, 2002). The second methodology is adopted in this paper based on consultations with experts in the area and the DGVS. An endemic corridor is a representation of the frequency of a disease. Specifically, it describes the frequency distribution of the disease for a period of one year based on the observed behavior of the disease during previous years. In this paper, the computation of values for the boundary curves of the endemic corridor are represented by quartiles, a couple of variations to the original methodology were added: every year of study was included instead of the previous years, and the third quartile was used as the threshold for the occurrence or nonoccurrence of an outbreak. The computation was made based on the confirmed cases in the years 2009, 2010, 2011, 2012, 2014 and 2015 (2013 was excluded for being a highly epidemic year in Paraguay). The purpose of these quartiles is to specify if a given week had or not an outbreak of dengue. A district A has an outbreak if the number of cases in the week exceeds the value of quartile Q3 (third quartile). Example: for week 1 (one) of the year, in

viral circulation map by serotype incidence rate map (per 10,000 or 100,000 inhabitants) map of endemic corridors per serotype, map of existence of dengue cases per ranges of time, usually the last three weeks, charts number of cases by geographic regions and different time scales, serotype and the type of laboratory result (suspected vs. confirmed) percentage of change in number of cases reported by epidemiological weeks, graphics of hospitalized and deaths cases per year, epidemiological week, geographic region, age range and clinical classification.

24 http://geo.stp.gov.py/user/dgeec/datasets 25 http://www.dgeec.gov.py/ 23 https://creativecommons.org/licenses/by-sa/3.0/

26 http://www.meteorologia.gov.py/

9

2016 Open Data Research Symposium, 5 October 2016, Madrid, Spain Classifier Result Belogns Not belongs TP FN FP TN

Asuncion, 20 cases were recorded. If quartile Q3 for Asuncion is 25 cases, there there is no outbreak.

Classes

B. Training data set From basic data sets, a set of training and validation data was generated. The most representative variables of each dataset were chosen and integrated considering the time and the geographical dimensions for each set. Table II shows the variables of the final dataset, for each variable, the name, description and data type are indicated. In order to have a more complete representation of the case in a record, each variable has values for the current week (t), the previous week (t-1), two weeks before (t-2) and, for the climate variables, the prediction for the following week (t+1). Being (t) the current week, the value to be predicted is an outbreak for the following week, i.e., (outbreak t+1). Except for the outbreak attribute, all other attributes were integrated into the final data set as they were described in the previous section. The outbreak attribute was computed using the quartiles methodology described in previous section. For the whole historical dataset, for each epidemiological week (t), department, city and neighborhood we computed whether there was an outbreak (YES) or no outbreak (NO). The task of the classification model is to predict if given a particular record for t with its associated data, in the following week t+1 there would be an ourbreak or not.

Table III: Types of classifications for a class.

Real Classification

Belogns Not belongs

Using this information, the classification performance is evaluated in terms of precision, recall, f-measure and accuracy. Precision is a measure of the relevance of the outcome, whether positive (YES) or negative (NO), while the recall is a measure of completeness, i.e., how many of the expected results were found. A system with high precision and low recall have the opposite behavior, returns very few positive results, but most of their classified labels are correct. A system with high recall and low precision returns many positive results, but most of their classified labels are incorrect. An ideal system with high precision and recall will return many positive results, all labeled (classified) correctly. P recision =

Recall =

TP TP + FP

TP TP + FN

(1)

(2)

The accuracy rate indicates the class correctly classified in general, that is, including positive and negative classes.

V. E VALUATION M ETHODOLOGY This section describes the metrics used to evaluate the results and the methodology used for testing.

Accuracy =

TP + TN TP + TN + FP + FN

(3)

The f-measure is the weighted average of precision and recall. An f-measure reaches its best value 1 and worst value 0 and is used to measure the overall performance of the different categories. Given a training set with two categories (YES, NO) as possible results, the general formula of f-measure is defined as:

A. Performance evaluation The classification model was trained over a subset of the whole dataset, which contained the outbreak target variable with (YES, NO) for the following week. Once the model was trained, we evaluated the performance of this model using f-measure to analyse the performance of the classification model. For each record in the evaluation dataset, we compared the result of the predicted value pv as computed by the classification model, with respect to the expected value ev as computed using the quartiles methodology over the whole dataset (the ground truth). The results of the classification model with respect to the expected values can be organized as follows (see table III): • TP (True Positive): indicates the number of records correctly classified as positive, i.e., the model predicted an outbreak when in fact there was one (pv = Y ES, ev = Y ES). • TN (True Negative): indicates the number of classes correctly classified as negative, i.e., the model predicted no outbreak when in fact there was no outbreak (pv = N O, ev = N O). • FP (False Positive): indicates the number of classes incorrectly classified as positive, i.e., the model predicted an outbreak when in fact there was no outbreak (pv = Y ES, ev = N O).. • FN (False Negative): indicates the number of classes incorrectly classified as negative, i.e., the model predicted no outbreak when in fact there was an outbreak (pv = N O, ev = Y ES).

F -measure = 2 ×

P recision × Recall P recision + Recall

(4)

B. Configuration parameters and classification algorithms In order to assess which attributes are more or less important for the construction of the decision tree, more than 30 combinations of attributes were performed (see table IV). For each execution of the algorithm the entire dataset was used, which has 8017 records for districts and 5027 records for Asuncion’s neighborhoods. For the tests with districts, the overall dataset was divided into districts of endemic departments of Paraguay: Capital, Central, Alto Paran´a, Cordillera, Paraguar´ı, Caaguaz´u, Canindey´u, Amambay and Concepci´on, and districts of non-endemic departments: Boquer´on, Alto ˜ Paraguay, Presidente Hayes, San Pedro, Neembuc´ u, Misiones, Guair´a, Caazap´a and Itap´ua (Direcci´on General de Vigilancia de la Salud de Paraguay, 2016b). The combinations of attributes were generated from combinations of different dimensions that comprise the dataset: temporal, geographic, climatic, demographic and quantitative. For the temporal dimension we used: current week, last week, the week before last and next week. For the quantitative dimension we used: quantity, acceleration and incidence. 10

2016 Open Data Research Symposium, 5 October 2016, Madrid, Spain Attribute week administrative division 1 administrative division 2 administrative division 3 population population t-1 population t-2 quantity t quantity t-1 quantity t-2 incidence t incidence t-1 incidence t-2 temperature t temperature t-1 temperature t-2 temperature t+1 humidity t humidity t-1 humidity t-2 humidity t+1 precipitation t precipitation t-1 precipitation t-2 precipitation t+1 acceleration t acceleration t-1 outbreak t outbreak t+1

Description epidemiological week department code district code neighborhood code estimated population for current week: department, district, city and year estimated population for last week: department, district, city and year estimated population for the week before last: department, district, city and year quantity of cases for current week quantity of cases for last week quantity of cases for the week before last incidence for current week (quantity * 10000/population) incidence for last week (quantity * 10000/population) incidence for the week before last (quantity * 10000/population) average temperature for current week average temperature for last week average temperature for the week before last average temperature for next week (metheorological prediction) average humidity for current week average humidiy for last week average humidity for the week before last average humidity for next week (metheorological prediction) cumulative rainfall for current week cumulative rainfall for last week cumulative rainfall for the week before last cumulative rainfall for next week (metheorological prediction) acceleration for current week compared to last, using incidence acceleration for last week compared to the week before last, using incidence there was o no an outbreak in the current week, if quantity >Q3 then YES, otherwise NO there was o no an outbreak in the next week, if quantity >Q3 then SI, otherwise NO

Data Type number string string string number number number number number number number number number number number number number number number number number number number number number number number string string

Table II: Attributes of the training and validation dataset.

1) Decision Tree: Different combinations of configuration parameters were selected in the decision tree algorithm using KNIME. These variations of parameters respond to partitioning method used to select the set of training data (Random Sampling, Linear Sampling, Take from top Sampling), the method of selection of partition attributes (Gain Ratio y Gini Index) and the use or not of pruning methods on the generated tree. In all cases, 60% of the data set was used for training and the 40% remaining was used for validation. In total, twelve parameter settings for the decision tree were used (see tables V and VI). Considering these parameters, we tried to identify the conbinations that would produce the best results in terms of fmeasure and accuracy. The same metrics were used for finding the combitanations of attributes that produced the best results. 2) Decision Tree Ensemble: In tests with the decision tree ensemble, all attributes were used, in order to allow the algorithm to generate the greatest possible variety of trees at each run. In this phase, Linear Sampling was used as partitioning method and Gini Index was use as a method of selection of partition attributes; these two techniques were selected based on the results of the decision tree. The same set of attributes was used for generating each tree node. There were also variations made in the number of models, ranging from 100 to 500 models at intervals of 50. The sampling method used was an absolute value with the value of 5. This means that in each generation, at least five attributes of all the available attributes were used. The value 5 was chosen based on the results of the decision tree.

are defined in (Simsion and Witt, 2004): completeness, no redundancy, data reuse, application of business rules, stability and flexibility, elegance and communication. The standard report model demonstrated meets all of these criteria. Regarding completeness, the model includes all the minimum variables, based on the supply of data related to dengue. Regarding nonredundancy, the model restricts the addition of duplicate records. Regarding the application of business rules, the model meets all the requirements necessary given its purpose, which is the publication of dengue standardized data on the web. As mentioned in previous sections the model was ready to be reusable and extensible, so these two criteria are also satisfied. And finally, since all variables in the model are well defined in terms of data type and its domain of values, it meets the criteria of communication and elegance. In many cases, these criteria conflict with each other, that is, completely satisfy one of them may involve not meeting others. A smart data model can be difficult to communicate; a data model that tries to satisfy too many business rules can become unstable if you change any of the rules; a data model easy to understand may not be reusable or interoperable. The main objective is to provide a data model with the best balance in terms of satisfied criteria against not satisfied criteria. VII.

R ESULTS AND EVALUATION OF THE CLASSIFICATION MODEL

In this section we present the results obtained with the two different classification methods, decision tree and decision tree ensemble, descibed in III-B.

VI. E VALUATION OF THE STANDARD REPORT MODEL The standard report model combines a set of desirable features analyzed from the supply and demand for data related to dengue in countries of the Region of the Americas. The following general criteria for evaluating data models

A. Decision Tree Table V shows the parameters combinations with the corresponding values obtained for f-measure in the executions for districts of endemic departments, non-endemic departments 11

2016 Open Data Research Symposium, 5 October 2016, Madrid, Spain Number. Combination 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37

All variables Without outbreak Without administrative divisions Without population Without administrative divisions and population (Without geographical dimension) Without week Without week and administrative divisions Without week, administrative divisions and population Without climate variables Without temperature (on the grounds that the temperature could be a distractor) Without quantities Without incindence Without acceleration Without quantities and population Without quantities, population and administrative divisions Without incidence and acceleration Only climate variables, quantity and outbreak Only administrative divisions, quantity and outbreak Only week, quantity and outbreak Only week, administrative divisions, quantity and outbreak Only climate variables, incidence and outbreak Only administrative divisions, incidence and outbreak Only week, incidene and outbreak Only week, administrative divisions, incidence and outbreak Only climate variables, acceleration and outbreak Only administrative divisions, acceleration and outbreak Only week, acceleration and outbreak Only week, administrative divisions, acceleration and outbreak Only climate variables, incidence, acceleration and outbreak Only administrative divisions, incidence, acceleration and outbreak Only week, incidence, acceleration and outbreak Only week, administrative divisions, incindece, acceleration and outbreak Only current week data Without variables of the week before last Without variables of the next week Variables of the current and next week Variables of the current and last week

Att. amount 28 27 26 25 23

Configuration

27 25 22

Config 2

Config 1

Config 3

16 24

Config 4

25 25 26 22 20

Config 5 Config 6

23 16 6 5 7

Config 7 Config 8

16 6 5 7

Config 9 Config 10

15 5

Config 11

4 6

Config 12

18

Description

Gini index, without prunning, random partitioning Gini index, without prunning, linear partitioning Gain ratio, without prunning, take from partitioning Gain ratio, without prunning, random partitioning Gain ratio, without prunning, linear partitioning Gini index, without prunning, take from partitioning Gini index, with prunning, random partitioning Gini index, with prunning, linear partitioning Gini index, with prunning, take from partitioning Gain ratio, with prunning, random partitioning Gain ratio, with prunning, linear partitioning Gain ratio, with prunning, take from partitioning

Districts of endemic dep. 76.93%

F-Measure Districts of nonendemic dep. 59.46%

77.26%

55.30%

72.28%

76.67%

61.17%

74.50%

78.83%

56.31%

73.65%

78.86%

57.00%

72.51%

80.07%

58.47%

76.54%

79.92%

66.93%

74.98%

80.50%

59.14%

74.02%

83.76%

62.61%

83.02%

80.34%

67.47%

74.76%

81.16%

57.89%

75.05%

83.29%

60.84%

83.02%

Asunci´on’s neighborhoods 74.32%

top

top

top

top

8

Table V: Decision Tree Results: F-Measure

7 9 12 22 25 15 19

Table IV: Attribute combinations. and Asunci´on’s neighborhoods. Table VI shows the values obtained for accuracy measure. Figure 6: Comparison of Decision Tree Results: Accuracy. endemic departments. The main cause of this difference is the quality of the data set, given that endemic departments have a historical record data that is larger and more varied. Once the best parameter settings was found, the best combination of attributes was sought. To accomplish this, and considering that the dataset of endemic departments produces better results, it was decided to limit the test to the following departments: Asuncion, Central, Alto Paran´a and Itap´ua. The inclusion of Itap´ua responds to the fact that in 2014 the city of Encarnacion was also declared endemic and also it has one of the highest population densities in the country after Central and Alto Parana. Regarding attribute combinations, a comparisson in terms of f-measure and accuracy was performed using the best parameters configuration. As shown in figures 7, 8, 9 and 10, the distribution of results between both metric follows the

Figure 5: Comparison of Decision Tree Results: F-Measure. As shown in figures 5 and 6, the best parameter configurations are configurations 9 (Gini index, with prunning and take from top partitioning) and 12 (Gain ratio, with prunning and take from top partitioning), comparing both f-measure and accuracy. Moreover, with the dataset of endemic departments greater results were obtained than with the data set of non12

2016 Open Data Research Symposium, 5 October 2016, Madrid, Spain Configuration

Config 1 Config 2 Config 3 Config 4 Config 5 Config 6 Config 7 Config 8 Config 9 Config 10 Config 11 Config 12

Description

Gini index, without prunning, random partitioning Gini index, without prunning, linear partitioning Gain ratio, without prunning, take from partitioning Gain ratio, without prunning, random partitioning Gain ratio, without prunning, linear partitioning Gini index, without prunning, take from partitioning Gini index, with prunning, random partitioning Gini index, with prunning, linear partitioning Gini index, with prunning, take from partitioning Gain ratio, with prunning, random partitioning Gain ratio, with prunning, linear partitioning Gain ratio, with prunning, take from partitioning

Districts of endemic dep. 88.50%

F-Measure Districts of nonendemic dep. 83.79%

88.73%

86.75%

85.92%

87.50%

87.67%

82.78%

88.39%

87.62%

84.98%

89.14%

86.75%

86.13%

88.14%

86.59%

83.68%

89.45%

89.71%

86.93%

90.20%

87.71%

87.13%

90.61%

87.25%

86.93%

89.62%

88.79%

87.03%

90.20%

87.86%

87.32%

90.30%

84.64%

87.81%

Asunci´on’s neighborhoods 85.67%

top

Figure 9: Accuracy for attributes combinations 1 to 19.

top

top

Figure 10: Accuracy for attributes combinations 20 to 37. same pattern, but take different range of values. For the metric f-measure results vary between 81% to 90%, while for the accuracy metric values ranging from 91% to 95%. Based on these results, it is observed that: • With combinations 6, 7, 8, 17, 21 and 29 the lowest results are obtained, 81.56% for f-measure and 90.83% for accuracy. Combination 6 does not include the attribute week, combination 7 does not include administrative divisions and week, combination 8 does not include administrative divisions, week and population. Combination 17 uses climate variables and number of cases only, while combination 21 uses climate variables and acceleration only. Finally, combination 29 includes only climate variable, incidence and acceleration. • Combination 9 obtains the highest result, 90.47% for f-measure y 94.78% for accuracy. This combination includes all attributes that do not correspond to the climate dimension: temperature, humidity and precipitation of current week, last week, the week before last and next week. • Combinations 22 and 30 are the second best combinations with 90.28% for f-measure and 94.72% for accuracy. Combination 22 uses administrative divisions, incidence and outbreak only, while combination 30 uses the same combination adding acceleration attribute. It is important to note, that a given combination with n number of attributes, not all of them are used in the decision tree. The attributes of administrative divisions, for example, are in the best combinations but are not used in decision trees generated for such combinations. Combinations 22 and 30 are equal because they generate the same decision tree, i.e., the

top

Table VI: Decision Tree Results: Accuracy

Figure 7: F-Measure for attributes combinations 1 to 19.

Figure 8: F-Measure for attributes combinations 20 to 37.

13

2016 Open Data Research Symposium, 5 October 2016, Madrid, Spain acceleration attribute is not used in the tree, but is present in combination 30. The worst combinations used attributes of the climate dimension and the best result explicitly removes these attributes. Analyzing these results it can be concluded that climatic attributes of the previous two weeks act as a distractor, this could be attributed to the fact that Paraguay presents optimal climatic conditions for the development and transmission of disease most of the year. The best combination of attributesconfiguration (Gini index with prunning, take from top partitioning and all attributes less climate) generates the decision tree in figure 11. Figure 12: Decision Tree: Execution for Asunci´on, Central, Alto Paran´a and Encarnaci´on. Comparing the tree in figure 12 with the tree in figure 11 shows that the tree in figure 12 does not consider the population attribute. Given that usually districts with low population density do not have a large number of cases, we can say that the population attribute provides geographic information of cases and that the tree in figure 11 uses this attribute in an attempt to discriminate the cases that belong to the non-endemic districts of Itap´ua from other cases. We can also note that, unlike the tree in figure 11, the quantity attribute provides more useful information than incidence, which is used in the first level. The temporal relevance of the attributes is also observed, as they have first tested the attributes of the current week, week (t), then the amount of the previous week, week (t - 1), and finally the incidence of the week before last, week (t - 2). Based on the tests and subsequent analysis of the results we can conclude that: • The most important attributes are the quantity and those derived from the number of cases: outbreak and incidence; outbreak is the most important attribute. That is, an important presence in the current week of the disease is a major criterion used by the model to classify. • The week, administrative division 1 and administrative division 2 do not influence the classification model. • The population attribute influences the model when no endemic districts are included. This shows that it is important to discriminate between endemic and nonendemic districts. • The condition of endemicity of a department influences the quality of the data and therefore the quality of the results obtained. Endemic departments have more and better variety of data than non-endemic departments, and this influences the quality of the model. • The attributes of the climate dimension acted as a distraction because when they are used in the model it does not provide improved results, considering Paraguayan weather conditions. • And finally, we observe that the definition used for the outbreak is not adequate because it does not allow to obtain sufficient information for the model to be able to identify situations of an outbreak start. This conclusion was reached because the model could not find a pattern where from a current week with no

Figure 11: Decision Tree: Best attributes combination. Analyzing the tree in figure 11 shows that the model assumes that if there are no outbreaks in the current week there will be no outbreaks in the following week. Through the branch where the outbreak is “YES” the attributes taken into account are: incidence (t), quantity (t 1) and population (t - 2). It can be summarized as follows: • If the outbreak is “YES” and incidence t is greater than 2.477, the following week will have an outbreak. • If not, if incidence (t) is less or equal to 2.477 and quantity (t - 1) is greater than 48, the following week will have an outbreak. • If not, if quantity (t - 1) is less or equal to 48, attribute population (t - 2) is observed, and if exceed a given threshold, the following week will have an outbreak. To understand why the classifier considers the attribute population (t - 2) instead of the population (t) two tests were performed: the first, removing the attribute population (t - 2) and the second, using only population t. Both tests returned the same result as the original test (F1: 90.47% Accuracy: 94.78%). From this result, it was concluded that the 3 attributes have the same gain ratio and therefore the population attribute (t - 2) is indiscriminately selected upon population (t). Given that all departments used in the tests are endemic except Itap´ua, which is not endemic but has an endemic district (Encarnaci´on), the same test was performed with the 4 departments, but including only data corresponding to district Encarnaci´on. An F-measure of 91.36% and accuracy of 95% was obtained. Demonstrating the importance of discriminating data for quality and likeness again. The decision tree generated in this test is shown in figure 12. 14

2016 Open Data Research Symposium, 5 October 2016, Madrid, Spain Attribute sampling size 6 8 10 12

Models amount 350 200 250 300

F-Measure 85.36% 86.73% 86.18% 84.84%

Accuracy 92.78% 93.33% 93.00% 92.33%

Attribute outbreak t incidence t incidence t-1 quantity t incidence t-2 quantity t-1 acceleration t quantity t-2 acceleration t-1 administrative division 2 humidity t temperature t-2 humidity t+1

Table VII: Models amount vs. attribute sampling size for districts. outbreak, there is an outbreak in the next week. This condition occurs only 5.6% of records used, so it may not be enough to find a pattern in the data.

Root 59 50 30 21 10 9 7 5 3 2 2 1 1

Candidate 59 66 52 64 53 53 70 55 61 50 57 57 56

Table VIII: Number of times the attribute was selected as root in districts tests.

B. Decision Tree Ensemble We used data from Asuncion, Central, Alto Paran´a and Itap´ua departments to train and test the decision tree ensemble. We used ombinations of 6, 8, 10 and 12 attributes and produced 100 to 500 models with intervals of 50.

C. Comparison: Decision Tree and Decision Tree Ensemble The decision tree allows a comprehensive analysis of the attributes or combinations of attributes that are more or less relevant in the generation of the predictive model. Each combination can be evaluated and the generated tree can be examined individually. However, it requires extra effort in the preparation of consistent combinations of attributes to be used. An alternative is to try all possible combinations of attributes, but the latter requires more effort in terms of time and computing power. The decision tree ensemble allows the user to work without worrying about finding the right combination of attributes. This is because the algorithm generates many trees (using the parameters) and chooses the best ones automatically. This method minimizes the effort regarding the preparation of combinations of attributes and does not require a high level of knowledge of the data set. However, it is impossible to generate all possible combinations, because these are limited by the attribute sampling size. Furthermore, as the algorithm generates a tree for each model, the greater the number of models to generate the more complex will it be to analyze the generated trees to asses the importance of each attribute. Considering our selected performance metrics, f-measure and accuracy, we can observe that the decision tree obtained results of 90.47% for f-measure and 94.78% for accuracy for the best case, while the decision tree ensemble obtained results of 86.73% for f-measure and 93.33% for accuracy. The difference in accuracy is 1.45%, and for f-measure is 3.74%; in both cases, the decision tree is better. This might be due to the fact that the desision tree ensemble does not generate trees with good enuog performace given the parameters that we used. From the results obtained with the decision tree, we can observe that attributes such as the outbreak were more relevant for the generation of the tree. These results are consistent with the statistics regarding the attribute selection at the top of the generated trees for decision tree ensemble, as seen in table VIII, where the most commonly used attributes were the outbreak, the quantity and the incidence. It is noteworthy that all the attributes that resulted as the most important for the prediction model are included in the standard report model described in this work, which increases the relevance and need for adoption of the proposed model.

Figure 13: F-measure relationship between number of models by varying the attribute sampling size for districts.

In figure 13 we can see that results of combinations with attribute sampling size over 12 return a f-measure less than 85%. We observed that there is no pattern or apparent correlation between the attribute sampling size and the size of the combination. Unlike what one would expect, increasing the number of produced models does not necessarily result in greater accuracy. Table VII shows that the highest results are produced with 200 models, reaching 86.73% for f-measure and 93.33% for accuracy, while the lowest scores are produced with 300 models, reaching 84.84% for f-measure and 92.33% for accuracy. The tree ensemble algorithm uses no pruning method, which can result in overfitting. However, this can can be compensated by the fact that the prediction is based on the vote of different trees and not in only one. Table VIII shows statistics indicating which attributes are more important, i.e., those selected as the root partition or as attributes in the first few levels of the tree. The results for 200 models with a sampling rate of 8 attributes indicate that the most important attributes are those related to the number of cases: outbreak, incidence, acceleration and quantity. The candidate column indicates how many times the attribute was part of a training set. The attribute outbreak, for example, was selected as root of the tree all times that was candidate, i.e., 59 times.

VIII. C ONCLUSIONS The international adoption of the standard report model would allow to tackle the problem of diversity in the data publication mechanisms used by different countries, this is of 15

2016 Open Data Research Symposium, 5 October 2016, Madrid, Spain •

great value not only for international organizations such as the WHO and PAHO, but also for groups interested in research, government agencies and civil society. The publication of dengue data in open data format enables collaboration, research and innovation based on data from official sources. There are several challenges for the adoption of the proposed model, such as: i) a legal framework regarding data privacy might not allow the publication of data with the defined model, ii) the adoption by different countries of a single clinical classification dengue that allows the aggregation of historical data, iii) the lack of mechanisms to collect epidemiological data and systems to manage these data at country level, iv) the lack of appropriate technical infrastructure (people and hardware) for creating data publishing tools that can maintain published data up to date, v) the lack of political will for releasing of the data, since it could give a poor image (region in epidemic) that could affect tourism and other economic activities in the affected regions. Regarding the classification model, the analysis of the results showed that certain attributes, such as the existance outbreak or not in the past (given the quartile definition), number of cases and incidence, are more relevant than others for the training of the most accurate models. The classification model also showed that the definition of outbreak currently used as the target variable for the model is not the most appropriate choice, since it can not be applied to non-endemic departments as it does not provide enough information to identify the onset of an outbreak. However, conidering f-measure and accuracy, the model obtains good results, reaching values of 90.47% for f-measure and 94.78% for accuracy for endemic regions. As proof of concept, a prototype open source web application that makes use of the data in the proposed data model was developed. The prototype displays maps and dynamic lists that replicate the maps currently available in images that are manually generated by the DGVS. With this, we demonstrate that although an additional effort is required for publishing the data in standard formats, later this effort is paid back when generating the maps automatically, and also by enabling the training of predictive models. The prototype can be easily extended to consume data from any source of publication and may be extended further with new features such as new types of maps, graphs and generation of early warnings based on other trained models. Another major advantage of this kind of open and collaborative tools is that countries that cannot invest in the creation of these kind of tools can benefit from them. Since the tool also allows the user to browse the data in tabular form and allows the download in the form of open data, researchers around the world can use it as a source for dengue reseach, with direct benefits for all the regions affected by the disease. We can summarize the contribution of this work as follows: • •

•

And a prototype of open source web application reusable and extensible for the publication of data on the web.

IX. F UTURE WORKS The standard model currently includes information about time, place, demographics and case characteristics for Dengue. The model could be easily extended to including case characteristics of other diseases such as Zika and Chikungunya, wich are also transmited by the same vector as dengue. Fuerthermore, the data model could also be used to describe the characteristics of other vector and non-vector borne diseases. If needed, the model could also be extended to include clinical variables that could also describe the modical conditions of the disease. Another important issues is the automation of the preprocessing of data so that institutions that do not use the standard model can, without much effort, transform their data into the standard model. In addition, data collection for variables and co-variables such as weather, geographic or demographic data could be automated. Given the issue of the definition of ”outbreak” as the target variable fo the classification model, we need to analyze the use of different target variables such as number of cases, incidence or increase rate. This is, instead of trying to predict outbreaks, we could predict the number of cases directly. Moreover, we could apply other data mining techniques, such as neural networks, bayesian networks, regression, and more. Furthermore, the granularity of the model could be improved, using daily data (instead of weekly) of quantity and climatic variables. Finally, more co-variables such as socioeconomic index, vegetation index, oceanic index or variables of the entomological dimension as vector population, larval infestation sites, among others may be included. R EFERENCES Ahmad Tarmizi, N. D. et al. (2013). “Malaysia Dengue Outbreak Detection Using Data Mining Models.” In: Journal of Next Generation Information Technology 4.6. Barniol, J. et al. (2011). “Usefulness and applicability of the revised dengue case classification by disease: multi-centre study in 18 countries”. In: BMC infectious diseases 11.1, p. 1. Berners-Lee, T. (2016). 5-star Open Data. URL: http : / / 5stardata.info/en/ (visited on 04/10/2016). Buczak, A. L. et al. (2012). “A data-driven epidemiological prediction method for dengue outbreaks using local and remote sensing data”. In: BMC medical informatics and decision making 12.1, p. 1. ´ et al. (2014). Plan de Acci´on para la Prevenci´on Cabello, A. y el Control del Dengue. http://www.vigisalud.gov.py/. Criscioni, I. A. (2011). II Muestra Nacional de Epidemiolog´ıa. URL: http : / / www . vigisalud . gov . py / images / documentos / muestra / 2da muestra epidemiologica / 4 % 20Salon % 20Convenciones / Jueves 24 / Dengue % 20en % 20Perspectiva.pdf. De la Salud, O. P. (2002). M´odulo de Principios de Epidemiolog´ıa para el Control de las Enfermedades. Dicker, R., F. Coronado, D. Koo, and R. G. Parrish (2006). Principles of Epidemiology in Public Health Practice, 3rd Edition. 3rd. CDC. URL: https://www.cdc.gov/ophss/csels/ dsepd/ss1978/ss1978.pdf.

A standard report model for the publication of dengue cases based on open data standards with international scope. A classification model that determines the occurrence of dengue outbreaks based historical data of dengue cases, climate variables, geographic and demographic variables in Paraguay. An analysis of the correlation of climatic variables with the occurrence of outbreaks in Paraguay. 16

2016 Open Data Research Symposium, 5 October 2016, Madrid, Spain Direcci´on General de Vigilancia de la Salud de Paraguay (2016a). Boletines semanales epidemiol´ogicos. URL: http: / / vigisalud . gov . py / index . php / boletin - epidemiologico/ (visited on 04/10/2016). Direcci´on General de Vigilancia de la Salud de Paraguay (2016b). Dengue. Sitio web de la DGVS de Paraguay. URL: http : / / vigisalud . gov. py / index . php / dengue/ (visited on 04/10/2016). Gonz´alez, M. B. (2014). Modelo predictivo de focos de dengue aplicado a Sistemas de informaci´on geogr´afica. Han, J., J. Pei, and M. Kamber (2011). Data mining: concepts and techniques. Elsevier. Lowe, R., T. C. Bailey, D. B. Stephenson, R. J. Graham, et al. (2011). “Spatio-temporal modelling of climate-sensitive disease risk: Towards an early warning system for dengue in Brazil”. In: Computers & Geosciences 37.3, pp. 371– 381. URL: http : / / www. arca . fiocruz . br / bitstream / icict / 3862/1/Spatio-temporal%20modelling%20of%20climatesensitive%20disease%20risk.pdf. Lowe, R., T. C. Bailey, D. B. Stephenson, T. E. Jupp, et al. (2013). “The development of an early warning system for climate-sensitive disease risk with a focus on dengue epidemics in Southeast Brazil”. In: Statistics in medicine 32.5, pp. 864–883. Munasinghe, A., H. Premaratne, and M. Fernando (2013). “Towards an Early Warning System to Combat Dengue”. In: International Journal of Computer Science and Electronics Engineering 1.2, pp. 252–256. Ochoa, S. N., J. Y. Talavera, and J. M. Paciello (2015). “Applying a geospatial visualization based on USSD messages to real time identification of epidemiological risk areas in developing countries”. In: eDemocracy & eGovernment (ICEDEG), 2015 Second International Conference on. IEEE, pp. 1–5. Pan American Health Organization (2016). Dengue, datos estad´ısticos y epidemiolog´ıa. URL: http://www.paho.org/ hq / index . php ? option = com topics & view = readall & cid = 3274&Itemid=40734&lang=es (visited on 04/10/2016). San Mart´ın, J. L. and O. Brathwaite-Dick (2006). “Delivery issues related to vector control operations: a special focus on the Americas”. In: Geneva: World Health Organization. Simsion, G. and G. Witt (2004). Data modeling essentials. Morgan Kaufmann. Sunlight Foundation (2016). Lineamientos para pol´ıticas de datos. URL: http : / / sunlightfoundation . com / opendataguidelines/es/ (visited on 04/10/2016). Tsai, C.-Y. et al. (2013). “Comparisons of dengue illness classified based on the 1997 and 2009 World Health Organization dengue classification schemes”. In: Journal of Microbiology, Immunology and Infection 46.4, pp. 271– 281. World Health Organization (2016). Disease outbreaks. URL: http://www.who.int/topics/disease outbreaks/en/ (visited on 04/10/2016). World Health Organization et al. (2006). Reglamento sanitario internacional (2005). Ginebra: Organizaci´on Mundial de la Salud.

17

26 Pane et al Enabling dengue outbreak predictions ODRS16.pdf ...

Oct 5, 2016 - B. Data analysis of dengue. There is a tendency toward using data mining techniques. to gain new knowledge from specific data sets.

Download PDF

847KB Sizes 3 Downloads 222 Views

Report

26 Pane et al Enabling dengue outbreak predictions ODRS16.pdf ...

Recommend Documents