CTrace: Semantic Comparison of Multi-Granularity Process Traces Qing Liu
Kerry Taylor
Xiang Zhao
Intelligent Sensing and Systems Lab CSIRO, Australia
Information Engineering Lab CSIRO, Australia
University of New South Wales Australia
Xuemin Lin
Corne Kloppers
Information Engineering Lab CSIRO, Australia
University of New South Wales Australia
Intelligent Sensing and Systems Lab CSIRO, Australia
[email protected] Geoffrey Squire
[email protected]
[email protected]
[email protected]
ABSTRACT
[email protected]
in fine-grained form may generate a large amount of very low level information, but can rarely suit the analysis purposes. Second, different users may require different levels of understanding. For example, a data analyst may require a more detailed process trace to analyse performance than an auditor who wants to check if some regulations are followed. Furthermore, traces could be generated by heterogeneous systems in which data and processes could be captured and/or represented at different granularities using different semantics. Without appropriate understanding of both semantics and granularity, it is hard for users to compare and make use of process traces for their purposes. The existing solutions to process trace representation and similarity measurement are not sufficient. A typical solution is to present process traces with multi-granularity. However, the granularities are determined by user-specific heuristics. In other words, different users could define different granularities given the same process trace. This leads to the problem that a specific multi-granularity process trace may not be understood by others and cannot be generalised for use with other applications. Furthermore, to compare two process traces, the existing solutions only examine the structural similarity. The semantic similarity embedded within process traces is completely ignored. To address the above issues, we develop the CTrace system that has the following novelties and core capabilities:
A process trace describes the processes taken in a workflow to generate a particular result. Given many process traces, each with a large amount of very low level information, it is a challenge to make process traces meaningful to different users. It is more challenging to compare two complex process traces generated by heterogenous systems and have different levels of granularity. We present CTrace, a system that (1) lets users explore the conceptual abstraction of large process traces with different levels of granularity, and (2) provides semantic comparison among traces in which both the structural and the semantic similarity are considered. The above functions are underpinned by a novel notion of multigranularity process trace and efficient multi-granularity similarity comparison algorithms. Categories and Subject Descriptors: H.4.m [Information Systems Applications]: Miscellaneous Keywords: Similarity; Multi-granularity; Semantics.
1.
[email protected]
INTRODUCTION
Scientific research often involves in-silico experiments in which a complex set of analyses is conducted to answer a particular science question. Provenance is a record that describes the people, data, and processes, involved in producing, influencing, or delivering a result. A process trace is the part of provenance that describes the processes taken and their temporal relationships. Making sense of process trace is critical to verifying data, enabling its re-use and making appropriate decisions. Furthermore, understanding the similarity and difference between process traces can lead scientists to identify the traces that generate “better” results for advanced scientific discovery. Process trace similarity has been investigated for many other applications, such as provenance storage size reduction, trustworthiness measurement and clone detection of executable code. Several challenges remain to make process traces more understandable and usable. First, process traces captured
• Multi-Granularity Process Trace Generation: Given a process trace, its corresponding multi-granularity trace can be generated dynamically to represent different levels of conceptual abstractions. It is underpinned by a Process Trace Ontology that provides a specification of different levels of conceptual structure. • Semantic Comparison: Two multi-granularity process traces can be compared semantically. The similarity measurement computes not only the structural similarity but also the semantic similarity embedded in the traces. To cope with the computational complexity involved in semantic comparison, CTrace incorporates non-trivial generalisation of our previous work [3] and [4] respectively based on the sizes of traces to be compared. We devise additional computation sharing and filtering strategies by exploiting
This paper is authored by employees of CSIRO and is copyrighted by the Government of Australia. All rights reserved. SIGMOD’13, June 22–27, 2013, New York, New York, USA. Copyright 2013 CSIRO or Crown of Australia ACM 978-1-4503-2037-5/13/06.
1121
the semantic relationships between levels. In addition, we present a user-friendly front-end that allows users to select and view a process trace at different granularities using various graph layouts, and to visualise the similarity or difference between two traces.
2.
CTrace OVERVIEW
CTrace comes with a novel concept of semantic multigranularity comparison. The system is implemented in Java , with its architecture design shown in Figure 1. The basic infrastructure of CTrace is built on top of [1] in which provenance can be harvested from distributed systems. In particular, customised Harvesters are developed to extract provenance from heterogenous systems and then provenance is modelled in JSON documents. These documents are pushed into the RESTful Data Conversion Service and are assembled into RDF based on the Process Trace Ontology (PTO). All provenance is then stored in the RDF Provenance Store. The Multi-granularity Trace Generator extracts process traces from the RDF store and generates multi-granularity traces. Process traces can be compared at various granularities using the Semantic Comparison. The results are visualised by the visualisation tools. Next we introduce the details of the three key components in the architecture. Depth Provenance Store
PTO Mul ti -granul ari ty Trace G enerator
Vi sual i zati on
Data Conversi on Servi ce
Harvester
...
Harvester
Semanti c Matri x Re-compute?
Figure 2: Example of Process Trace Ontology which the vertices are labelled by the executed processes and edges describe the temporal relationships. Given a PTO and a process trace, its multi-granularity process trace is defined as a list of graphs, M = (I, CId , CId−1 , ..., CI0 ), where the first element I is an original process trace and the rest of the elements are the concept traces with decreasing depth from d to 0. d is the maximum depth of the class that instances in a process trace can be mapped to in PTO. A concept trace is a directed graph, in which the vertices are labelled by the classes in PTO, and describe the semantics of the executed processes and capture the conceptual sub-structure embedded. A concept trace at depth i (0 ≤ i ≤ d) is generated by traversing the concept trace at depth i + 1 or process trace and replacing the labels of every vertex by its parent class at depth i in PTO respectively. Neighbouring vertices are converged if they share the same label. Therefore, all the vertices in a concept trace at depth i have labels whose depths are less than or equal to i. queryDB
No
Yes Exact Algorithm Approximate Algorithm
Si mi l ari ty Semanti c Compari son
Process Trace Ontology
2.2
fetchData
cl eanData
CCAM
HydroM odel
saveDB
TransferResul t
Figure 3: Trace
Since an ontology is a specification of a conceptualisation, it can be used as a tool for abstraction. We define a Process Trace Ontology (PTO) that extends the class opmo:Process from the Open Provenance Model Ontology (OPMO) [2]. In this paper, we use Figure 2 to depict a running example of PTO incorporated by CTrace. Note the ontology to be demonstrated for our real use case is more complex. Sub-classes (e.g., rectangles in Figure 2) are defined to describe different levels of (1) semantics of executed processes, and (2) conceptual abstractions embedded in process traces. For example, “Pre-processing” could be a conceptual abstraction of data retrieval (e.g., “FetchData” in Figure 2) and quality control (e.g., “CleanData”). Each executed process in a process trace is an instance (e.g., circles in Figure 2) of the corresponding class. An annotation property, hasDepth, is defined for opmo:Process class and its sub-classes to describe the granularity level of abstraction. The smaller the depth, the coarser the abstraction. opmo:Process always hasDepth 0 and its direct children have depth 1, etc.
fetchData
removeHD
(a) P rocess Trace
Figure 1: CTrace System Architecture
2.1
ftp
2
(b) C I
P reP rocessi ng Anal ysi s P ostP rocessi ng 1 (c) C I
P rocess 0 (d) C I
Example of Multi-granularity Process
Example 1 Figure 3 shows a multi-granularity process trace M = (I, CI2 , CI1 , CI0 ) generated based on the PTO in Figure 2. Figure 3(a) shows a process trace. The vertices represent the executed processes. By mapping the executed processes into the corresponding classes in PTO, Figure 3(b),(c) and (d) show the concept traces at depth 2, 1 and 0 respectively. In CI2 , the conceptual sub-structure “FetchData → CleanData” is encoded in “Pre-Processing” in CI1 . Similarly, “Pre-Processing → Analysis → Post-Processing” in CI1 is represented by the conceptual abstraction using “Process” in CI0 . All the vertices in CI2 have labels with depth ≤ 2. By the above example, we can see that a multi-granularity process trace is able to describe a process trace at various levels of granularity. The larger the depth, the more detailed information a process trace can present; the smaller the depth, the more abstract information a process trace can present. The vertices in a concept trace represent the semantic building blocks of a process trace and may capture the conceptual sub-structures embedded.
2.3
Semantic Comparison of Process Traces
Given two multi-granularity process traces, M1 = (I1 , CId11 , ..., CI01 ) and M2 = (I2 , CId22 , ..., CI02 ), their similarity is de-
Multi-granularity Trace Generator
A process trace can be described by a directed graph in
1122
0+1 is captured by its weight. Therefore, SC 1 = 2+2−(0+1) = 31 . This similarity provides an indication that the two traces share semantic similarity at finer granularity even when their structures are different at the current depth.
fined as a list of values, S = (SI , SC d , SC d−1 , ..., SC i , ..., SC 0 ), where d = min(d1 , d2 ), SI = Similarity(I1 , I2 ) and SC i = Similarity(CIi1 , CIi2 ) (0 ≤ i ≤ d). By definition, two concept traces are only comparable at the same depth. Since each trace is a vertex-labeled graph, a straightforward way to measure the trace similarity is to compute the Maximum Common Edge Subgraph (MCES), a subgraph consisting of the largest number of edges common to both graphs. The similarity is defined as: |EMCES(C i SC i =
I1
2
|EC i | + |EC i | − |EMCES(C i I1
where |EMCES(C i CIi2 )
, CIi ) |
I1
I2
,CIi ) | 2
I1
, CIi ) |
The similarity between process traces, SI , is computed under w(v) = 0. By the defined semantic similarity, we can guarantee that the similarity of multi-granularity process traces is monotonically non-decreasing with the depth decreasing. This property gives users a better understanding of the coverage and the trace structure. We exploit two existing algorithms both in non-trivial manner to compute the similarity of weighted multi-granularity process traces at different depths. We present a best-first search based exact algorithm generalised from [3] to compare traces with small size. It runs much faster than other alternative exact algorithms. To handle traces with large size that may involve 30+ vertices, we extend the approximation algorithm in [4] by incorporating the vertex weights available in the multi-granularity process traces. In particular, we design tailored anchor selection and mapping procedure to achieve good matching result. It runs in polynomial time, providing comparable result to the optimal. Furthermore, the semantic matrix is developed to minimise the number of similarity computations required for the upward comparisons. It explores the semantic relationships among process traces with different depths and filters unnecessary computations that cannot generate new edges of MCES. When similarity computation cannot be avoided, a sharing strategy is established in which only the none MCES edges are considered for upward MCES computations.
,
2
is the number of edges of MCES(CIi1 ,
at depth i and |EC i | is the number of edges in CIij . Ij
However, the above equation only describes the structural similarity between two traces and ignores the semantic similarity involved, given the fact that each vertex in a concept trace represents a conceptual abstraction and may have the conceptual sub-structures embedded. An ideal measure evaluates not only structural but also semantic similarity. Furthermore, as traces appear more abstract when they move towards smaller depths, we expect the similarity measure is non-decreasing along with the decreasing depth, when comparing multi-granularity process traces. Thus, the above similarity does not suffice under our scenarios. To tackle this issue, we need to capture the “shadow edges”, which have contributed to the MCES of compared concept traces at any depth ≥ i but have been encapsulated by the converged vertices. Therefore, we construct weighted multigranularity process traces in which the weight of a vertex is the number of “shadow edges” encapsulated by the vertex. Given two weighted multi-granularity process traces, the semantic similarity is defined as: P |EMCES(C i , C i ) | + v∈C i ω(v) I1 I2 I1 P , SC i = |EI1 | + |EI2 | − (|EMCES(C i , C i ) | + v∈C i ω(v)) I1
where |EMCES(C i
I1
,CIi ) | 2
I2
3.
I1
is the number of edges of MCES(CIi1 ,
3.1
CIi2 ) at depth i, |EIk | is the number of edges in process trace Ik , and ω(v) is of v as defined above, P Pthe vertex weight v ∈ CIi1 (Note v∈C i ω(v) = v∈C i ω(v)). I1
FetchData
Cl eanData
Cl eanData
I2
HydroModel
TransferResul t
2 CI2 2 2 )=(FetchData Cl eanData) (a) MCES(C I 1 , C I 2 2
CI1
PreProcessi ng Anal ysi s
1
PreProcessi ng
Introduction and Datasets
The first part of the demonstration will be an overview of recent research of multi-granularity process traces, similarity computation and their extensive applications. To illustrate the capabilities enabled by CTrace, our demo runs through a scenario involving data collected from the Hydrological Forecasting and Warning System (FEWS) operated by Bureau of Meteorology Australia. CTrace harvests 193, 452 traces from the configuration files in FEWS, and puts them in the triple store.
By the above equation, the structural similarity is captured by MCES and the conceptual sub-structure similarity is reflected by the accumulated vertex weights. Therefore, we are able to semantically compare two process traces. FetchData
DEMONSTRATION
In the demonstration, we will (1) motivate the multigranularity process trace problem and the semantic comparison problem in the context of scientific in-silico experiments; (2) showcase pairwise multi-granularity comparison, and (3) demonstrate the performance of the proposed algorithms.
1
3.2
PostProcessi ng
1
1 CI2 1 1 )=NULL (b) MCES(C I 1 , C I 2
Demonstrating Semantic Multi-granularity Comparison
Since the RDF representation of FEWS provenance is very large, Figure 5 shows a small portion of the representation. The rectangles represent the processes involved and the circles describe the data. The directed edges represent the property relationships. Given such a graph, it is very difficult for users to understand the semantics of process traces, let alone comparisons between them. We shortly see how CTrace alleviates the problem. Given a collection of process traces, CTrace first carries out a similarity join over the dataset, providing a candidate set of similar trace pairs. Thus, users only select the pair
CI1
Figure 4: Example of Semantic Similarity Example 2 In Figure 4(a), the edge “FetchData → CleanData” contributes to the MCES at depth 2. However, the similarity at depth 1 is 0 (see Figure 4(b)) if only structural similarity is considered. The semantic similarity “FetchData → CleanData” encapsulated in vertex “Pre-Processing”
1123
Figure 5: RDF Visualisation of FEWS Provenance of similar traces for further analysis, which saves time on discovering the promising results. After users choose a process trace pair (free to select traces other than the join results) and similarity algorithms from the menu bar, CTrace dynamically produces the trace graphs at the specified depth with their similarity. Figure 6 shows a screenshot of a comparison between two concept traces at depth 4. There is a control panel on the bottom left which includes: (1) Trace Granularity, (2) Zoom, (3) Graph and Ontology Layout, and (4) Similarity Score. PTO is on the bottom right. When examining process meanings in concept traces, users can identify the corresponding class in the ontology based on the colour. Moreover, selecting a class on the ontology highlights all corresponding vertices in both traces. This also works the other way around when users select a vertex in a trace. At the top of Figure 6, the two concept traces at depth 4 are partially presented on the screen according to the Zoom scale selected. Compared with Figure 5, all the processes are extracted enabling users to see trace structures clearly. The vertexes (e.g., AWRAL in Trace 1 vs. CCAM and AWBM in Trace 2) that do not belong to MCES are enlarged to show the structural difference between the two traces.
Figure 7: Concept Trace Comparison at Depth 2 their structures are the same which implies although the two traces are different at a finer level, their semantic intentions are the same. The similarity at depth 2 is improved from 0.63 to 0.98 but not 1. It effectively reflects the conceptual difference between the two original process traces. When the mouse hovers over an abstract vertex, it shows the actual vertex identifiers (e.g., thin yellow bar under ConventionModule) encapsulated. Users will be invited to try the “Pry into Semantics” function by which a substructure of the process trace encapsulated by an abstract vertex can be examined. This helps users understand the “importance” of the vertex and comprehend why the similarity increases while the structures shrink.
3.3
Demonstrating System Performance
We will demonstrate the effectiveness and efficiency of the semantic comparison of multi-granularity process traces through three aspects: (1) the monotonic property of similarity; (2) the performance of the MCES exact algorithm, the approximate algorithm and the McGregor’s algorithm; and (3) the effect of ontology depth which may lead to the increasing number of MCES computation. Acknowledgement. The ISSL is assisted by a grant from the Tasmanian Government which is administered by the Tasmanian Department of Economic Development, Tourism and the Arts. X. Lin is supported by ARC DP0987557, DP110102937, DP120104168, NSFC61232006 and NSFC61021004.
4.
ADDITIONAL AUTHORS
Additional authors: Richard Miller (Intelligent Sensing and Systems Lab, CSIRO, email:
[email protected]).
5.
REFERENCES
[1] Q. Liu, Q. Bai, C. Kloppers, P. Fitch, Q. Bai, K. Taylor, P. Fox, S. Zednik, L. Ding, A. Terhorst, and D. McGuinness. An ontology-based knowledge management framework for a distributed water information system. Journal of Hydroinformatics, 2012, doi:10.2166/hydro.2012.152. [2] L. Moreau, B. Clifford, J. Freire, J. Futrelle, Y. Gil, P. Groth, N. Kwasnikowska, S. Miles, P. Missier, J. Myers, B. Plale, Y. Simmhan, E. Stephan, and J. Van den Bussche. The open provenance model core specification (v1.1). Future Generation Computer Systems, 27(6):743 – 756, 2011. [3] X. Zhao, C. Xiao, X. Lin, and W. Wang. Efficient graph similarity joins with edit distance constraints. In ICDE, pages 834–845, 2012. [4] Y. Zhu, L. Qin, J. Yu, Y. Ke, and X. Lin. High efficiency and quality: large graphs matching. The VLDB Journal, pages 1–24, 2012, doi: 10.1007/s00778-012-0292-8.
Figure 6: Concept Trace Comparison at Depth 4 It may not be easy for users to compare the traces at depth 4. Figure 7 shows the comparison of the two traces in a coarser level at depth 2. Neighbouring vertices at depth 4 are converged if they share the same semantics at depth 2 in PTO. For example, vertexes CCAM, AWBM are both abstracted as SimulationModule at depth 2 and therefore, they are converged and represented by SimulationModule. From the visualisation, users can see that for both traces, the four data sources are retrieved, cleaned and transformed as the inputs of the simulation model. This enables users to have an abstract view of complex trace structures. Through examining the two traces, users can see that
1124