Design and Optimization of Scientific Workflows By DANIEL ZINN Dipl.-Inf. (Ilmenau University of Technology, Germany) 2005 M.S. (University of California, Davis) 2008 DISSERTATION Submitted in partial satisfaction of the requirements for the degree of DOCTOR OF PHILOSOPHY in COMPUTER SCIENCE in the OFFICE OF GRADUATE STUDIES of the UNIVERSITY OF CALIFORNIA DAVIS Approved: Professor Bertram Lud¨ascher (Chair)

Professor Todd J. Green

Professor Zhendong Su

Committee in Charge 2010

i

Daniel Zinn January 2010 Computer Science

Design and Optimization of Scientific Workflows Abstract This work considers the problem of design and optimization of scientific workflows. Progress in the natural sciences increasingly depends on effective and efficient means to manage and analyze large amounts of data. Scientific workflows form a crucial piece of cyberinfrastructure, which allows scientists to combine existing components (e.g., for data integration, analysis, and visualization) into larger software systems to conduct this new form of scientific discovery. We propose VDAL (Virtual Data Assembly Lines), a dataflow-oriented paradigm for scientific workflows. In the VDAL approach, data is organized into nested collections, much like XML, and flows between components during workflow execution. Components are configured with XQuery/XPath-like expressions to specify their interaction with the data. We show how this approach addresses many challenges that are common in scientific workflow design, thus leading to better overall designs. We then study different ways to optimize VDAL execution. First, we show how to leverage parallel computing infrastructure by exploiting pipeline, task, and data parallelism exhibited by the VDAL paradigm itself. To this end, we compile VDAL workflows into several Map-Reduce tasks, executed in parallel. We then show how the cost of data-shipping can be reduced in a distributed streaming implementation. Next, we propose a formal model for VDAL, and show how static analysis can provide additional design features to support the scientist during workflow creation and maintenance, namely, by displaying actor dependencies previewing the structure of the results, and explaining how output data will be generated from input data. Consequently, certain design errors can be detected prior to the actual workflow execution. Finally, we investigate the fundamental question of how to decide equivalence of VDAL workflows. We show that testing the equivalence of string-polynomials, a new problem, reduces to workflow equivalence when an ordered data model is used. Here, our preliminary work defines several normal forms for approximating the equivalence of string polynomials. ii

To my family.

iii

Acknowledgments Being a Ph.D. student at UC Davis was a wonderful experience. I want to express my deepest gratitude to all people that made this chapter of my life as fantastic as it was. I want to thank my major advisor Professor Bertram Lud¨ascher. He not only provided excellent guidance and inspired me, but also helped to organize, refine and structure my ideas. He taught me, how important it is to present your ideas well, and how to do so. I deeply appreciate that even under tight schedules, Bertram always had time for me: be it for discussing research ideas, revising papers, or even to provide guidance in nonresearch-related fields of my life. Bertram truly lives up to all expectations associated with a Doktorvater1 : he has not only become my role model for being a scientist, but also a very good friend. I am deeply thankful for all his advice, patience and help. I further want to thank the members of my dissertation committee Prof. Zhendong Su and Prof. Todd J. Green. Zhendong has not only provided valuable feedback for this work, but was always open for my questions. Special thanks to Zhendong, for his extra support after I broke my elbows. I want to thank T.J. for his detailed feedback on this work as well as his support while working on Chapter 7. Technical discussions with T.J. are not only enjoyable but also very fruitful. I thank him for his patience, especially while listening to my sometimes very nice, but wrong “proofs”. I further want to thank Shawn Bowers and Timothy McPhillips for their valuable help, discussions and suggestions with this dissertation. In particular, I want to thank them for their work on Comad, which this dissertation builds on and extends. Further, I want to thank Xuan Li, for his great help on the Kepler-PPN project in Chapter 5. I also want to thank Prof. Michael Gertz for co-advising me in the early stages of my Ph.D. In addition, I want to thank my advisors during my wonderful time at Google: Rebecca Schultz during my first internship with the platforms group, and Jerry Zhao and Jelena 1

doctorial father

iv

Pjesivac-Grbovic during my second one with the MapReduce team. It was an honor for me to work with such very nice and very smart people. The experience interacting with the systems staff at the Department of Computer Science, most notably Babak Moghadam and Ken Gribble, could not have been better. It is great to work with such approachable, knowledgeable, and open-minded people. Without their help, the experimental evaluations would not have been possible. I also want to thank the staff in the CS department office. Their friendliness made every single visit to the office enjoyable. Special thanks goes to Mary Reid, our first CS graduate advisor for her fantastic help in my early phases as a Ph.D. student. A very important part of my graduate life was interacting with fellow graduate students. I am very lucky to have found a very best friend and collaborator Michael Byrd. Thank you for the always great times we had! Our lunches at Sophia’s, Raja’s and almost every other restaurant in walking distance will be unforgettable to me. I can only say: pants, pants, pants! I am further very thankful to all the people of the database research group, who are not only very nice people and became good friends, but also provided valuable feedback during presentations and practice talks. I especially want to thank Zijie Qi, for trying to teach us some Chinese; Sven K¨ohler, not only for being a great collaborator in fighting Hadoop for Chapter 3, but also for being a great friend for more than 10 years. I thank Dave Thau for helping me a lot with my talk at the EDBT Ph.D. Workshop; and Manish Anand for, among other things, saving my life in Portland with delicious cookies. Also, thank you, David Welker for some legal advice; and Haifeng Zhao, whose friends showed me around in Shanghai. For always inspiring interactions, I want to thank Earl T. Barr and Jedidiah R. Crandall. I also want to thank fellow graduate students outside the Database lab. These are all great people and I am very grateful for my very pleasant interaction with them: James Shearer, Yuan Niu, Ananya Das, Till Stegers, and Jim Bosch. I further want to thank Prof. Kai-Uwe Sattler, my thesis Advisor during my Master’s studies in Ilmenau. Without his support, I would not have been able to come to Davis. Special thanks also go to Prof. Horst Salzwedel—without his generous offer to live at his v

house in Palo Alto during an internship, I would not have been infected with the California virus. I also thank Colin K. Mick and Ulla Mick for their help and support during these first 5 months I spent in California. I am also deeply thankful for the many friends I made here in Davis: Abhijeet Kulkarni, Tina Sch¨ utz, Francesca Martinez, Jeff Stuart, Tony Bernadin, Dan Alcantara, Mauricio Hess, Jay Jongjitirat, Jeff Wu, and many more. I especially thank Zach Grounds for being such a good friend and apartment-mate. I also thank Conny Franke, who was up for the adventure to apply for a Ph.D. program in California and accompanied me in the early phases of my Ph.D. Finally, and most importantly, I have deep gratitude for my family. I thank my mum, Christel Zinn, for her deep love, dedication, open-mindedness, and all she did for me throughout my life. Without her extensive care during my early childhood, I would not have lived to see my first day at school. She encouraged me to pursue higher education and stay in school, since “there is no hurry to get out of school, you will have to work afterwards for the rest of your life”. I am also deeply thankful to my dad, Gerhard Zinn, who besides many other things, introduced me to computers and the joy of math. I am very thankful for my dad’s patience and excellence in explaining technical and logical matters, even when he helped me to take my fist steps programing BASIC. I also thank his wife Ursel Zinn, especially, for being brave enough to read over my Master’s thesis written in English. I further thank my wonderful brother, Enrico Zinn, for being the best brother ever. I also thank Ines Greiner-Hiero for being a wonderful friend since I was eleven years old. Moreover, I want to thank my loving grandparents, Karl and Lonny Sesselmann. My grandpa was a magnificent person who was not as fortunate as me to have had the possibility of a good education. He nevertheless was a great teacher for me. During the last two years, I was able to talk with my grandma almost every day. I will miss her support, and I wish she could have lived to see me accomplish this goal. Last, but not least, I would like to thank my amazing girl-friend, Tu Anh Ngoc Huynh. Her understanding, encouragement, patience, and love are my endless sources of energy and happiness. vi

Contents List of Figures

xi

List of Listings

xiii

List of Tables

xiv

Structure and Contributions

1

1 Introduction 1.1 Problem Statement . . . . . . . . . . . . . . . . . . 1.2 Script-based Approaches . . . . . . . . . . . . . . . 1.3 Scientific Workflow Approach . . . . . . . . . . . . 1.3.1 Examples . . . . . . . . . . . . . . . . . . . 1.3.2 Scientific Workflow Terminology . . . . . . 1.3.3 Advantages . . . . . . . . . . . . . . . . . . 1.3.4 Limitations . . . . . . . . . . . . . . . . . . 1.4 Collection-Oriented Modeling and Design (Comad) 1.4.1 Advantages . . . . . . . . . . . . . . . . . . 1.4.2 Limitations . . . . . . . . . . . . . . . . . . 1.5 Towards Virtual Data Assembly Lines . . . . . . . 1.5.1 Research Questions . . . . . . . . . . . . . . 1.6 Detailed Description of Contributions . . . . . . . 2 Improving Scientific Workflow Design with Virtual Data Assembly Lines 2.1 Workflow Design Challenges . . . . . . . . . . . 2.1.1 Parameter-Rich Functions and Services 2.1.2 Maintaining Data Cohesion . . . . . . . 2.1.3 Conditional Execution . . . . . . . . . . 2.1.4 Iterations over Cross Products . . . . . 2.1.5 Workflow Evolution . . . . . . . . . . . 2.2 Virtual Data Assembly Lines (VDAL) . . . . . 2.2.1 Inside VDAL . . . . . . . . . . . . . . . 2.2.2 VDAL Components and Configurations 2.2.3 Example: VDAL Actor Configurations . vii

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . . . . .

2 3 5 7 8 11 12 14 17 19 22 23 24 25

. . . . . . . . . .

30 31 32 34 37 39 40 41 42 45 50

2.3

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

50 50 52 54 55 56 56 58

3 Optimization I: Exploiting Data Parallelism 3.1 Introductory Example . . . . . . . . . . . . . 3.2 MapReduce . . . . . . . . . . . . . . . . . . . 3.3 Framework . . . . . . . . . . . . . . . . . . . 3.3.1 XML Processing Pipelines . . . . . . . 3.3.2 Operations on Token Lists . . . . . . . 3.3.3 XML-Pipeline Example . . . . . . . . 3.4 Parallelization Strategies . . . . . . . . . . . . 3.4.1 Naive Strategy . . . . . . . . . . . . . 3.4.2 XMLFS Strategy . . . . . . . . . . . . 3.4.3 Parallel Strategy . . . . . . . . . . . . 3.4.4 Parallel Strategy in Detail . . . . . . . 3.4.5 Summary of Strategies . . . . . . . . . 3.5 Experimental Evaluation . . . . . . . . . . . . 3.5.1 Comparison with Serial Execution . . 3.5.2 Comparison of Strategies . . . . . . . 3.6 Related Work . . . . . . . . . . . . . . . . . . 3.7 Summary . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

59 60 62 64 65 67 69 70 70 72 80 81 84 85 87 89 91 94

4 Optimization II: Minimizing Data Shipping 4.1 ∆-XML: Virtual Assembly Lines over XML Streams 4.1.1 Types and Schemas . . . . . . . . . . . . . . 4.1.2 Actor Configurations . . . . . . . . . . . . . . 4.1.3 Type Propagation . . . . . . . . . . . . . . . 4.2 Optimizing ∆-XML Pipelines . . . . . . . . . . . . . 4.2.1 Cost Model . . . . . . . . . . . . . . . . . . . 4.2.2 X-CSR: XML Cut, Ship, Reassemble . . . . . 4.2.3 Distributor and Merger Specifications . . . . 4.3 Implementation and Evaluation . . . . . . . . . . . . 4.3.1 Experimental Setup . . . . . . . . . . . . . . 4.3.2 Experimental Results . . . . . . . . . . . . . 4.4 Related Work . . . . . . . . . . . . . . . . . . . . . . 4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

96 97 98 101 104 105 106 106 110 111 112 113 118 119

2.4 2.5

Design Challenges Revisited . . . . . . 2.3.1 Parameter-rich Black Boxes . . 2.3.2 Maintaining Data Cohesion . . 2.3.3 Conditional Execution . . . . . 2.3.4 Iterations over Cross Products 2.3.5 Workflow Evolution . . . . . . Related Work . . . . . . . . . . . . . . Summary . . . . . . . . . . . . . . . .

viii

. . . . . . . .

. . . . . . . .

. . . . . . . .

5 Implementation: Light-weight Parallel PN Engine 5.1 General Design Decisions . . . . . . . . . . . . . . . 5.1.1 Workflow Setup . . . . . . . . . . . . . . . . . 5.1.2 Workflow Run . . . . . . . . . . . . . . . . . 5.2 VDAL-specific Functionality . . . . . . . . . . . . . . 5.3 Kepler as a PPN GUI . . . . . . . . . . . . . . . . . 5.3.1 Architecture . . . . . . . . . . . . . . . . . . 5.3.2 PPN Monitoring Support in Kepler . . . . . 5.3.3 Communication with Kepler Actors . . . . . 5.3.4 Demonstration: Movie Conversion Workflow 5.4 Summary and Related Work . . . . . . . . . . . . . .

. . . . . . . . . .

6 Static Analysis I: Supporting Workflow Design 6.1 Design Use Cases . . . . . . . . . . . . . . . . . . . . . 6.2 Well-formed Workflows . . . . . . . . . . . . . . . . . . 6.2.1 Review: Virtual Assembly Lines . . . . . . . . 6.2.2 Notions about Well-formed Workflows . . . . . 6.3 Compilation of VDAL to FLUX . . . . . . . . . . . . . 6.3.1 Necessary FLUX Extensions . . . . . . . . . . 6.3.2 Rewriting VDAL to FLUX . . . . . . . . . . . 6.4 Static Analysis for FLUX-compiled VDAL Workflows 6.5 Discussion and Related Work . . . . . . . . . . . . . . 6.6 Future Work: Workflow Resilience . . . . . . . . . . . 6.6.1 Input Resilience . . . . . . . . . . . . . . . . . 6.6.2 Resilience against Workflow Changes . . . . . . 6.6.3 Inserting Actors . . . . . . . . . . . . . . . . . 6.6.4 Deleting Actors . . . . . . . . . . . . . . . . . . 6.6.5 Replacing Actors . . . . . . . . . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

120 120 123 126 127 128 129 130 131 133 134

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

135 136 138 138 139 141 141 143 145 147 148 149 149 149 150 150

7 Static Analysis II: Towards Deciding Workflow Equivalence 7.1 Relation to Conventional Regular Expression Types . . . . . . 7.2 General Notions . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2.1 Data Model . . . . . . . . . . . . . . . . . . . . . . . . . 7.2.2 Core XQuery Fragment XQ . . . . . . . . . . . . . . . . 7.2.3 Expressive Power of QX . . . . . . . . . . . . . . . . . . 7.2.4 Regular Expression Types . . . . . . . . . . . . . . . . . 7.2.5 Conventional Type Propagation . . . . . . . . . . . . . . 7.3 General Idea of Possible-Value Types . . . . . . . . . . . . . . . 7.4 Possible-Value Types . . . . . . . . . . . . . . . . . . . . . . . . 7.4.1 Propagation Rules for XQ . . . . . . . . . . . . . . . . . 7.4.2 From PV-Typings to Query Equivalence . . . . . . . . . 7.5 Equality of String Polynomials . . . . . . . . . . . . . . . . . . 7.5.1 Restriction to a two-letter alphabet . . . . . . . . . . . . 7.5.2 Reduction to Restricted Equivalence . . . . . . . . . . . 7.5.3 Simple Normal Form . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

152 154 155 155 156 157 159 159 161 162 168 176 179 180 181 181

ix

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

7.6 7.7 7.8 7.9

7.5.4 Alternating Normal Form . . . . . . . . . . . . 7.5.5 Collecting Exponents into Big Polynomials . . 7.5.6 Comparing Lists of Monomials . . . . . . . . . 7.5.7 Towards Distributive Alternating Normalform . 7.5.8 Deciding M1P ≡≥c M2Q for dist-minimal Mi . . 7.5.9 Summary of Findings and Future Steps . . . . Undecidability of Value-Difference for PV-Types . . . Undecidability of Query-Equivalence for XQdeep-EQ . Related Work . . . . . . . . . . . . . . . . . . . . . . . Summary . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

182 185 187 188 191 191 192 195 196 198

8 Concluding Remarks

200

Bibliography

203

x

List of Figures 1.1 1.2 1.3 1.4 1.5 1.6

Promoter identification workflow (from [ABB+ 03]) . . . . . . . . . Monitoring workflow (from [PLK07]) . . . . . . . . . . . . . . . . . Snapshot of a Comad execution . . . . . . . . . . . . . . . . . . . Conceptual model of Comad-actor execution . . . . . . . . . . . . Workflow dependency on input structure (adapted from [MBL06]) Three-layer Comad architecture . . . . . . . . . . . . . . . . . . .

2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 2.10 2.11 2.12 2.13 2.14 2.15 2.16

. . . . . .

. . . . . .

. . . . . .

. . . . . .

9 11 18 19 20 23

Parameter-rich service, record assembly and disassembly . . . . Record-handling shims . . . . . . . . . . . . . . . . . . . . . . . Maintaining nesting structure using array tokens and additional Dataflow model for “if (Test1 or Test2) then A” . . . . . . . . . Conventional cross-products . . . . . . . . . . . . . . . . . . . . Architectural differences of conventional versus VDAL . . . . . VDAL Actor Anatomy . . . . . . . . . . . . . . . . . . . . . . . Dataflow inside VDAL Actor . . . . . . . . . . . . . . . . . . . Example grouping via binding expression in γ . . . . . . . . . . Syntax for FLUX (adapted from [Che08]) . . . . . . . . . . . . Blackbox and VDAL actor configuration . . . . . . . . . . . . . Linear workflows . . . . . . . . . . . . . . . . . . . . . . . . . . Hierarchical data used in phylogenetic workflow . . . . . . . . . Maintaining nesting structure . . . . . . . . . . . . . . . . . . . Localizing if-then-else routing via XML attributes . . . . . . . Cross-products in VDAL . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . loops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

32 35 36 38 39 41 43 44 47 49 49 52 52 53 55 55

3.1 3.2 3.3 3.4 3.5 3.6 3.7

XML Pipeline Example . . . . . . . . . . . . . . . Example splits and groups . . . . . . . . . . . . . . Image transformation pipeline . . . . . . . . . . . . Processes and dataflow for the three parallelization Parallel re-grouping . . . . . . . . . . . . . . . . . Serial versus MapReduce-based execution . . . . . Runtime comparison of compilation strategies . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

60 65 69 71 81 88 90

4.1 4.2 4.3

Simple schema with examples for several concepts . . . . . . . . . . . . . . 101 X-CSR overview: Standard versus optimized with schemas . . . . . . . . . . 108 X-CSR experiments standard versus optimized . . . . . . . . . . . . . . . . 115 xi

. . . . . . . . . . . . . . . . . . strategies . . . . . . . . . . . . . . . . . .

. . . . . . .

. . . . . . .

. . . . . .

. . . . . . .

. . . . . . .

5.1 5.2 5.3 5.4 5.5 5.6

Detailed workflow execution times (collection structures only) Detailed workflow execution times (collections and data) . . . General Architecture of Kepler-PPN Coupling . . . . . . . . . Kepler-PPN Coupling . . . . . . . . . . . . . . . . . . . . . . Demonstrating Communication between Kepler and PPN . . Kepler-PPN Coupling in action . . . . . . . . . . . . . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

126 126 129 132 132 133

6.1 6.2 6.3 6.4

Components and dataflow inside VDAL actor . . . Example for VDAL actor configuration . . . . . . . FLUX-Code corresponding to VDAL actor given in Generating Required-For Relation . . . . . . . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

138 144 145 147

7.1 7.2 7.3 7.4 7.5 7.6 7.7 7.8

Semantics of XQ . . . . . . . . . . . . . . . . . . . . . . . Commutativity diagram for PV-Typings . . . . . . . . . . Semantics [[ . ]]v for XML pv-types without free indexes . . Propagation Rules for XQ (constraint-free pv-types) . . . Lift Algorithm for XML pv-types . . . . . . . . . . . . . . Deciding query equivalence with input restrictions for XQ Algorithm to transform into alt-NF . . . . . . . . . . . . . Helper Algorithms for alt-NF . . . . . . . . . . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

158 161 165 170 177 178 186 186

xii

. . . . . . . . . . Fig. 6.2. . . . . . . . . . . . . .

List of Listings 3.1 3.2 3.3 3.4 3.5 4.1 5.1 5.2 5.3 5.4 5.5 5.6 5.7

Split, Map, Reduce for Naive strategy . . . . . . . . . . . . . . . . . Split and Map for XMLFS . . . . . . . . . . . . . . . . . . . . . . . . Split for XMLFS & Parallel . . . . . . . . . . . . . . . . . . . . . . . Map and Reduce for Parallel . . . . . . . . . . . . . . . . . . . . . . Group and sort for Parallel strategy . . . . . . . . . . . . . . . . . . X-CSR algorithm for statically computing distributor specifications Actor class declaration . . . . . . . . . . . . . . . . . . . . . . . . . . Port class declaration . . . . . . . . . . . . . . . . . . . . . . . . . . Sample PPN workflow setup script . . . . . . . . . . . . . . . . . . . Sample schema declaration . . . . . . . . . . . . . . . . . . . . . . . Sample signature declaration . . . . . . . . . . . . . . . . . . . . . . Sample description of synthetic data . . . . . . . . . . . . . . . . . . Kepler configuration file of existing PPN actors . . . . . . . . . . . .

xiii

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

72 78 79 83 84 110 122 123 124 127 127 128 130

List of Tables 3.1

Main differences for compilation strategies . . . . . . . . . . . . . . . . . . .

85

4.1

X-CSR standard versus optimized: Savings on shipping and execution time

114

6.1 6.2

Actor definitions: Reading versus blind . . . . . . . . . . . . . . . . . . . . . 139 Actor definitions: Feeding versus starving . . . . . . . . . . . . . . . . . . . 140

xiv

1

Structure and Contributions In Chapter 1, we motivate and introduce scientific workflows and provide necessary background information. Chapters 2–5 describe Virtual Data Assembly Lines (VDAL) and present optimization strategies for their execution; Chapters 6 and 7 present the more theoretical part of this work, where we consider static analysis of VDAL workflows. In particular, Chapter 2 presents the VDAL paradigm. We outline concrete design challenges and show, based on example scenarios, how VDAL workflows address these challenges and are thus easier to design, maintain and evolve than workflows based on plain dataflow network primitives. In Chapters 3 and 4, we describe approaches to enhance execution efficiency: Chapter 3 analyzes how the execution of VDAL workflows can be enhanced by exploiting data parallelism. Here, we show how a MapReduce framework can be used to execute VDAL workflows in a cluster environment. In Chapter 4, we present a type-system for workflows operating on ordered collections, and show how we can optimize data shipping by analyzing data-dependencies based on information provided by the actors. Chapter 5 describes the workflow execution engine developed for this dissertation. Then, Chapter 6 describes how VDAL workflows can be translated into XML-processing programs written in the XML update language FLUX. We further define concepts related to workflowwellformedness and show how existing type systems for XML languages can be used to answer important design questions. In Chapter 7, we extend existing typing approaches for XML languages by developing a sound and complete type system for a core language for XML processing. We further introduce the problem of “String-Polynomial-Equality”, and show that it is at the core of deciding query equivalence for XQuery with an ordered data model. Our current results towards solving this problem conclude this chapter. Chapter 8 summarizes this work and outlines future research opportunities. This dissertation is based on the following publications: Chapter 1: [Zin08, MBZL09]; Chapter 2: [ZBML09a]; Chapter 3: [ZBKL09]; Chapter 4: [ZBML09b]; Chapter 5: [ZLL09]; and Chapter 6: [ZBL10].

2

Chapter 1

Introduction Imagination is more important than knowledge. For while knowledge defines all we currently know and understand, imagination points to all we might yet discover and create. Albert Einstein

Progress in computing technology has contributed greatly to accelerating scientific discovery [HTT09]. Along with experiments conducted in the field or in the lab, computation and data analysis have been established as new sources of ideas and inspiration, and as vital techniques to validate scientific hypotheses. For example, discovering the relationships of living organisms via phylogenetic analysis is based on large gene- or protein-sequence databases, algorithms for aligning sequences from different species, and applying sophisticated models for evolution, which are then solved computationally. Furthermore, computer simulations are used in particle physics to validate hypotheses, and in earth sciences to, for example, forecast our weather. In observational physics, such as astronomy, global virtual observatories are build to collect, process and analyze the massive amounts of data being captured. In short, experiments done in silico are an equitable asset next to in vitro and in vivo experimentation. The term e-Science [HT05] is used to describe such data- or computational intensive sciences. Here, to make scientific discoveries, algorithms need to be created, composed

1.1. Problem Statement

3

to larger studies, e.g., analysis pipelines, which need to be deployed and executed. Consequently, in addition to traditional data management tasks, problems that are common to software engineering, such as writing robust and bug-free code, creating re-usable and easy-to-maintain software, etc., need to be addressed in these application domains.

1.1

Problem Statement

When conducting computational science experiments, or e-Science, we can distinguish two main parts: (1) the scientific challenge of having the right data analysis or simulation idea, and (2) the engineering challenge to go from this idea, or conceptual plan, to an actual implementation. Of course, these two tasks are intertwined: scientists might develop new hypotheses once more data has been analyzed, and then want to test these hypotheses via actual data analysis or simulations, creating more ideas for possibly new hypotheses. The engineering challenge is to transform their ideas into executable programs. In the case of scientific data analysis, this often requires the integration of multiple domain-specific tools and specialized applications for data processing. This work considers this problem of integrating existing software components to build complex scientific analysis systems. Considering human and machine time as valuable resources, we present approaches that use them as efficiently as possible. In particular, to support scientists and developers, the systems should be easy to build, evolve and maintain, as well as easy to share and re-use. To save computing resources and to provide the scientists with the results quickly, the system should execute efficiently while utilizing available computing resources. This component-integration problem, embedded into an e-Science context, poses the following challenges: Designing complex systems is hard. Software systems for e-Science are often complicated, large-scale applications. Typically, several complex steps, e.g., sophisticated algorithms, need to be used together. Therefore, problems common to complex soft-

1.1. Problem Statement

4

ware systems in general emerge: In particular, it is not clear how to structure and develop components of such systems in a way that they can be easily re-used, that the dependencies among them are minimized, or that components can be developed independently by different groups. Managing data is hard. In e-Science, the nature of the data that is processed poses additional challenges. For example, particle physics simulations often generate large amounts of data that need to be monitored or analyzed. In the Center for Plasma Edge Simulation project [CPE], for example, experiments produce typically around 40 GBytes of data to be analyzed every hour. In the Large Synoptic Survey Telescope project [LSS], the telescope under construction will generate on the order of 15 TBytes of data every night! In addition to being large, scientific data can also be semantically rich (such as data in the life sciences) and organized into complex hierarchies. In phylogenetic workflows, for example, phylogenetic trees are inferred from sets of gene sequences of different living organisms; managing these collection structures also poses challenges. Furthermore, data lineage and other provenance information are becoming increasingly important, e.g., to validate results and ensure their reproducibility. Effective and efficient data provenance management poses new challenges [ABML09]. Building distributed systems is hard. E-Science problems exhibit many characteristics that require building them as distributed or parallel applications. They deal with very large amounts of data, or are computationally very expensive, or both. Thus, using clusters or supercomputers is often the only way to keep execution times feasible. Furthermore, the systems themselves are usually inherently distributed. Scientists, located all over the world, want to access and process data gathered at different locations, such as in astronomical observatories, or during experiments conducted on very expensive machinery (e.g., particle accelerators) available at only a few locations in the world. In the biological sciences, access to remote databases containing gene and

1.2. Script-based Approaches

5

protein sequences or data about biological pathways is essential. As is well-known in computer science, building distributed systems is hard. Since large-scale distributed systems can seldom be built using one single programming language, problems arise due to different representation of data, or function calling conventions. Having separate address spaces makes parameter passing or global variables expensive and complex. Since there are separate processes, there is no common control flow, which then raises the need for synchronization between them. Single components, (e.g., cluster hosts) can fail due to hardware errors. The more components a system is built of, the less likely it is that all of the components will work properly. Therefore, distributed systems usually employ special algorithms to tolerate partial failures, which again increases their complexity. Moreover, using different hardware usually results in heterogeneous data representations. Furthermore, problems arise from the communication between components themselves: data transport is more expensive (in time, and possibly also in money) than in a local environment, and due to open communication ways, security issues mat also arise.

1.2

Script-based Approaches

A common approach to integrate scientific software is to use scripting languages that “glue” together already existing components. In the Large Synoptic Survey Telescope project [LSS], for example, complex analysis steps are developed in C++, while Python is used to orchestrate components at a higher level. For the bio-sciences, an extensive toolkit of Perl modules called BioPerl [SBB+ 02, Bio09] has been built for the purpose of making it easy to combine components for data retrieval (e.g., from remote biological databases) with data analysis steps (e.g., via local programs). While scripting languages are broadly applicable, they have shortcomings: Programming expertise in C++/Perl/Python necessary. Even if the overall software system is very well structured, a high level of expertise in these languages (C++,

1.2. Script-based Approaches

6

Perl, or Python) is often necessary to build scientific applications. No specific component model. General purpose languages and scripting languages in particular do not define specific component models. While a general object-oriented paradigm is usually applied, it is often not clear how to structure the components into classes and what their interactions should be. This ambiguity often prevents scientists that are not trained in writing in the scripting language from extending the libraries or to take full advantage of the framework. A more constrained model of computation could give more guidelines for component creation and re-use. Such a model of computation could also provide more support for designing applications form components, and for application evolution. No automatic provenance. Since data is handled explicitly by the scientist, it is often hard to track data provenance (also called lineage) within these scripting languages. For a scientist, however, it is very important to record how and from which raw data some final data product has been derived. While such a feature could theoretically be added to a general purpose scripting language, for example based on Perl’s taintchecking mechanism, it is not clear how well such a provenance system would integrate with existing libraries. Hard to utilize distributed or parallel resources. In scripting environments it is often hard to utilize distributed resources, e.g., when distribution has to be explicitly programmed by the scientist. Even if high-level libraries are used, additional pieces of code clutter the scientific logic and can lead to solutions that are hard to adapt to different resources. Low-level textual representation. There is no standard on how to write scientific applications in scripting languages. Especially in Perl, with its motto “There’s more than one way to do it”, equivalent programs can be written and structured in many different ways. While this could be desirable for a general purpose language, it does

1.3. Scientific Workflow Approach

7

not directly facilitate an easy understanding of the program semantics. Programs that look very different can essentially compute the same products1 , which can complicate sharing and peer-reviewing of programs. Furthermore, certain dependencies (e.g., dataflow) are easier to comprehend in graphical than in textual form. Little or no automated design support. When writing in a scripting language, the system’s support for checking correctness of applications is often very limited. For example, BioPerl provides no compile-time type-checking: errors are therefore found only at run-time which greatly hampers productivity, especially in complex projects. Scripting languages also provide little help to create new applications from existing ones. In scripting languages, it is tempting (and therefore common) to copy-paste code fragments when evolving the script over time2 . However, these languages provide little mechanisms to ensure that the pasted code fragment will work in the new environment. Other than checking if variables are declared prior to their use, no dataflow analysis is performed to detect programming errors resulting from copypaste.

1.3

Scientific Workflow Approach

In view of these shortcomings, scientific workflow systems [FG06, TDGS07, DGST08, LAB+ 09] have emerged in recent years as tools for domain scientists to integrate existing components into larger analysis systems. That is, they support the integration of components for data acquisition, transformation, and analysis to build complex data-analysis frameworks from existing building blocks, including algorithms that are available as locally installed software packages or globally accessible web services. Compared to script-based approaches, scientific workflow systems promise to offer a number of advantages, including built-in support for recording and querying data prove1

In fact, as it is well-known, program-equivalence is undecidable for general-purpose programming languages due to Rice’s theorem. 2 In fact, even in object-oriented programming, copy-and-paste is common [KBLN04].

1.3. Scientific Workflow Approach

8

nance [DBE+ 07], and for deploying workflows on cluster or Grid environments [Dee05]. In many kinds of scientific workflows, data is of primary concern. Computation is often used to derive new data from old one, e.g., to infer phylogenetic trees from gene sequences; or to remove artifacts from astronomical images that can then be used to detect new objects in the sky. Therefore, many scientific workflow systems are data-driven or even adopt dataflow-oriented models of computation to describe scientific applications. Here, computational steps are represented as nodes and dataflow between these tasks is made explicit using edges between them. A scientific workflow tool then allows the scientist to select certain methods (actors), place them on a canvas and connect the output of one node with inputs of others. The generated graph then represents the executable application for performing the (complex) scientific analysis.

1.3.1

Examples

We will now have a closer look at two example workflows to describe basic concepts and abstractions in dataflow-oriented workflow systems. Both workflows have been created using the Kepler system [LAB+ 06]. Promoter identification workflow. Fig. 1.1 shows a screen shot of the Kepler workflow system displaying the Promoter Identification Workflow [ABB+ 03]. This workflow is used by a biologist to identify likely transcription factor binding sites in a series of genes. The process involves a series of tasks, such that performing the same series manually for each of a few dozen genes can be quite a repetitive and time-consuming process [Kep04a]. The screen shot shows major components of most current scientific workflow systems. In the main area the components (or actors) are placed on a workflow-design canvas. They are connected with each other via channels through which data objects flow during workflow execution. Each actor can have multiple input and output ports, similar to functions that can have multiple input parameters. Many scientific workflow tools including Kepler also support nested workflows: Fig. 1.1 shows the model of the top-level workflow and the model

1.3. Scientific Workflow Approach

Figure 1.1: Promoter identification workflow (from [ABB+ 03])

9

1.3. Scientific Workflow Approach

10

for the actor GeneSequenceProcessing. Nested subworkflows (a.k.a. composite actors) contain formal ports as their defined interface. These are shown in the top-right part of the image: the three black connection-endpoints on the far right are formal output ports, and the one black connection endpoint on the left is a formal input port. Accordingly, there is one actual input port and three actual output ports on the composite actor instance (here: GeneSequenceProcessing) in the main workflow. This approach defines clear interfaces to the rest of the system. The interaction between different actors is performed through data that is flowing through their ports. By typing input and output ports, the actor developer can specify additional restrictions on the data, similar to the types used in function signatures. On the left-hand side of Fig. 1.1, a library of actors and data sources is shown. From here, the user can select available algorithms and suitable data to be placed onto the canvas. In addition to input ports, actors can also have parameters as a form of configuration. Values for parameters are typically set by double-clicking on the actor instances and are valid for a complete workflow run. Once a workflow has been built, it can be executed to perform the computations (data integration, analysis, visualization, etc.) as defined by the workflow graph and model of computation. CPES Workflow for processing simulation data and archival.

Fig. 1.2 shows a

subworkflow of a larger “monitoring workflow” that is used in the Center for Plasma Edge Simulation for processing simulation data and archival [PLK07]. Among other things, this workflow is used to automate the submission of jobs to a supercomputer. The overall workflow controls data transport from a supercomputing center to other sites, e.g., to scientists’ home universities and backup storage. This workflow is used to automate tasks that scientists did manually before. In contrast to the first workflow that actually performs data analysis, this “plumbing” workflow orchestrates the simulation and data analysis on a supercomputer and automates data transport [LAB+ 09].

1.3. Scientific Workflow Approach

11

Figure 1.2: Monitoring workflow (from [PLK07])

1.3.2

Scientific Workflow Terminology

A scientific workflow is a description of a process, usually in terms of scientific computations and their dependencies [LBM09], and can be visualized as a directed graph, whose nodes (also called actors) represent workflow steps or tasks, and whose edges represent dataflow and/or control-flow [DGST09]. A basic formalism is to use directed, acyclic graphs (DAGs), where an edge A→B means that actor B can start only after A finishes (a control-flow dependency). With such a DAG-based workflow model (e.g., used by Condor/DAGMan [CKR+ 07]) one can easily capture serial and task-parallel execution of workflows, but other data-driven computations, such as data streaming, pipeline-parallelism, and loops, cannot be represented directly in this model. In contrast, dataflow-oriented computation models, like those based on Kahn’s process networks [Kah74], incorporate pipeline-parallel computation over data streams and allow cyclic graphs (to explicitly model loops). This model underlies most Kepler workflows, and mutatis mutandis, applies to other scientific workflow systems with data-driven models of computation as well (e.g., Taverna [OGA+ 02], Triana [TSWH07], Vistrails [BCC+ 05, FSC+ 06], etc.). In these dataflow oriented workflow models, each edge represents a unidirectional channel, which connects an output port of an actor to an input port of another actor. Channels can be thought of as unbounded queues (FIFO

1.3. Scientific Workflow Approach

12

buffers) that transport and buffer tokens that flow from the token-producing output port to the token-consuming input port. For workflow modeling and design purposes, it makes sense to distinguish different kinds of ports: a data port (the default) is used by an actor A to consume (read) or produce (write) data tokens during each invocation (or firing) of A. In contrast, a control port of A is a special input port whose (control) token value is not used by A’s invocation, but which can trigger A’s execution in the first place. An actor parameter can be seen as a special, more “static” input port from which data usually is not consumed upon each invocation, but rather remains largely fixed (except during parameter sweeps). While actor data ports are used to stream data in and out of an actor, actor parameters are typically used to configure actor behavior, set up connection or authentication information for remote resources, and so forth. A composite actor encapsulates a subworkflow and allows the nested workflow to be used as if it were an atomic actor with its own ports and parameters. While the Kahn model is silent on the data types of tokens flowing between actors, practical systems often employ a structured data model. However, in practice, when actors implement web service calls or external shell commands, data on the wire is often of type string or file, even if the data conceptually has more structure. Kepler natively employs

a model with structured types (inherited from Ptolemy [BLL+ 08]), including records and arrays. When sending data from one actor to another, this creates many options. For example, a list of data can be sent from one actor to another in a single array token, or as a stream of tokens corresponding to the elements of the list. Similarly, large record tokens can be assembled and later broken into smaller fragments. These choices can in fact complicate workflow design (see below), whereas the proper use of a serializable, semistructured model such as XML allows a more flexible and uniform data treatment.

1.3.3

Advantages

Scientific workflow systems already provide a number of advantages over pure traditional, e.g., script-based approaches:

1.3. Scientific Workflow Approach

13

Component model. Using the abstraction of actors, scientific workflow systems provide uniform access to different, already existing software. There are, for example, actors that represent web-services, actors that call local applications as well as actors that invoke R or Matlab scripts. Once the components have been wrapped as actors, complex systems can in principle be built from these by simply placing them on a canvas and defining connections betweem them. To transport data from one component to the other, only a wire has to be “drawn” in the scientific workflow user interface. Data source discovery. Scientific workflow systems often support the discovery of relevant data sources. Also here, a unified access to databases, files, or web-services can be provided. While such a feature could also be provided to scripting languages by using a separate tool, the integrated environment of a workflow tool can be more convenient for the scientist. Provenance framework. Since dataflow is modeled explicitly in scientific workflow tools, provenance frameworks attached to the system themselves are able to record and provide provenance information. There is no need for the scientist to explicitly record or store provenance information as part of the workflow. Instead this functionality is completely provided by the workflow system [ABJF06, BML+ 06, BML08, ABML09, ABL09]. Semantic types. Workflow systems often provide additional features to support the scientist in building workflows. An approach for adding semantic types to workflow systems has been proposed and implemented in the Kepler system [BL04]. This helps to integrate data sources and actors based on semantic information as opposed to only structural types such as strings or array of integers. Users can leverage semantic type information by checking if actors are compatible with each other, or to find actors that operate on certain data in a large library. Parallelism. Since a dataflow paradigm is employed, workflows naturally exhibit pipeline

1.3. Scientific Workflow Approach

14

and task parallelism. Lee et al. show in [LP95] that the process network model of computation is deterministic with few synchronization requirements. The system can therefore execute the workflows with a high level of parallelism without requiring the scientist to think about synchronization.

1.3.4

Limitations

Despite the fact that workflow systems provide the mentioned advantages over script-based solutions, several problems and challenges remain. In the following, we present workflow requirements that are not or only partially met by current workflow systems. These are based on several years of experience with real-world use cases (e.g., see [KLN+ 07, LAB+ 06, PLK07]). For a more detailed explanation of them see [MBZL09]. Little Support for Data Modeling Scientific workflows usually operate on large amounts of often structured and nested data. The various formats which are in use for storing structured data (NetCDF [Net], Nexus [MSM97], XML, etc.) reflect this observation. Therefore, a scientific workflow tool should support nested data types and should assist the scientist with data modeling. In conventional workflow systems, such as Kepler [AJB+ 04] and Triana [TSWH07], however, capabilities to model scientific data are rather limited. Simple, basic types are often used on the channels between actors. To represent, for example, a list of gene sequences one can create an array token in Kepler which contains the gene sequences. Not only is there an actor needed to create the array token, but also is it necessary to introduce other special actors once a particular operation is to be performed on each element of the array. The workflow presented in Fig. 1.1, for example, makes use of this technique. Low-level components such as the SequenceToArray and ArrayToSequence actors “clutter” the workflow. These data assembly and disassembly actors are not of the same level as “scientific actors” such as RunClustalW. Workflows mixing in such data-management actors therefore tend to lose their self-explanatory character. In many applications, however, it is necessary to

1.3. Scientific Workflow Approach

15

group together one sort of data for providing it as a whole to the next component: to infer a phylogenetic tree, for example, a list of aligned sequences is needed. Scientists familiar with the specific domain should be able to quickly grasp which tasks and methods are used within a workflow. Furthermore, self-explanatory workflows could be used as means of communication between scientists. Just as UML is used as a “unified” way to communicate about object-oriented design, self-explanatory workflow graphs could be used to support communications about data-driven scientific procedures. In fact, webbased repositories, like myExperiment [GDR07], already provide a place to store, discuss and share scientific workflows. Workflow Designs may be Brittle Scientific workflows should tolerate certain changes in the structure of its input data, i.e., they should exhibit a certain degree of input resilience. A workflow that was, for example, created to work on a single data set of type T should also be usable if the scientist wants to apply the workflow to a series of data sets of type T . Also, a scientist might have many of these data sets, that could themselves be structured by the projects they belong to, by the methods that were used to derive them or by any other criteria. In practice, the directory structure on the hard disk represents such an organization. Here, a reasonable desideratum for a scientific workflow tool is to take such a whole structure as input and perform the workflow on the data sets without destroying the organizational structure. Furthermore, scientific workflows should be easy to modify. Adding new components, removing (possibly non-vital) components, or replacing components by structurally equivalent ones should be possible and easy-to-do for the user. The workflow tool should be able to tolerate certain changes and predict consequences of other changes that invalidate the workflow design. In the workflows shown in Fig. 1.1 and Fig. 1.2 it is for example hard to determine where to add new components or which components are not vital for the workflow run (e.g., the Display actor) due to the complex wiring. Unfortunately, all wires in the

1.3. Scientific Workflow Approach

16

workflow are carefully placed and necessary once it is modeled at this level of abstraction. We will therefore argue to raise the level of abstraction for the workflow graph in order to allow easier modifications. Optimization is not Performed Automatically The workflow system should be able to optimize workflow execution performance. Much of the impetus for developing scientific workflow systems derives from the need to carry out expensive computational tasks efficiently using available and often distributed resources. Workflow systems are used to launch, distribute and monitor jobs, move data, manage multiple processes, and recover from failures. One approach often taken today is to specify these tasks within the workflow itself as shown in Fig. 1.2. The result is that scientific workflow specifications can become cluttered with job-distribution constructs that hide the scientific intent of the workflow. Workflows that confuse systems management with scientific computation are difficult to design in the first place and extremely difficult to re-deploy on a different set of resources. Even worse, requiring users to describe such technical details in their workflows excludes many scientists who have neither experience nor the interest in playing the role of a distributed operating system. Systems should not require scientists to understand and avoid concurrency pitfalls (e.g., deadlock, data corruption due to concurrent access, race conditions) to take full advantage of available parallel computing infrastructure. Rather, workflow systems should safely exploit as many concurrent computing opportunities as possible, without requiring users to understand them. Ideally, workflow specifications would be abstract and employ principles and metaphors appropriate to the domain rather than including explicit descriptions of data routing, flow control, and pipeline and task parallelism. As we will see in Chapter 3 and 4, the approach presented in this dissertation can satisfy this requirement.

1.4. Collection-Oriented Modeling and Design (Comad)

1.4

17

Collection-Oriented Modeling and Design (COMAD)

Collection-Oriented Modeling and Design (Comad) [MB05, MBL06], a special way of developing scientific workflows, has been proposed to address many of the shortcomings described in the previous section. Since our approach will extend Comad, we now briefly describe the Comad idea, its advantages and drawbacks. As mentioned in Section 1.3, the majority of scientific workflow systems represent workflows using dataflow languages. The specific dataflow semantics used, however, varies from system to system [YB05]. Not only do the meaning of nodes and of connections between nodes differ, but the assumptions about how an overall workflow is to be executed given a specification can vary dramatically. Kepler makes an explicit distinction between the workflow graph on the one hand, and the model of computation used to interpret and enact the workflow on the other, by requiring workflow authors to specify a director for each workflow. It is the director that specifies whether the workflow is to be interpreted and executed according to a process network (PN), synchronous dataflow (SDF), or other model of computation [LSV98]. Most Kepler actors in PN or SDF workflows are data transformers. Such actors consume data tokens and produce new data tokens on each invocation; these actors operate like functions in traditional programming languages. Other actors in a PN workflow can operate as filters, distributors, multiplexors, or otherwise control the flow of tokens between other actors; however, the bulk of the computing is performed by data transformers. Assembly-line metaphor. In Comad, the roles of actors and of connections between actors are different from those in PN or SDF. Instead of assuming that actors consume one set of tokens and produce another set on each invocation, Comad is based on an assemblyline metaphor: Comad actors (coactors or simply actors below) can be thought of as workers on a virtual assembly-line, each contributing to the construction of the workflow product(s). In a physical assembly line, workers perform specialized tasks on products that pass by on a conveyor belt. Workers only “pick” relevant products, objects, or parts

1.4. Collection-Oriented Modeling and Design (Comad)

18

Proj

(a)

Data token >

Trial

Trial

<

Collection opening-delimiter token Collection closing-delimiter token New data token produced by step

Seqs

Almts

M7:1

... S1

S10

A1 M3:1 A2

T1 T2 T3 M5:2

(b)

M6: Check exit condition

T7 T8

<

T5 T4

DNA Sequence Sequence Alignment

T

Phylogenetic Tree

T3 T2 T1

A2

A1

> <

S10

...

S1

> > >



<



>



<







>



S20 S11 T6 ... > > <

S A

M7: Compute consensus



> <

T9



A3

M:I Dependency & module invocation

Trees A3 M3:2 A4



A4

Insertion-event metadata token

... S11 S20



<

Trees M5:3

M2:2

M5:4



>



T8 T7



<






>

T9

T6 M7:1 Trees

T4 T5

M5: Find MP trees

< < <

Almts

Seqs

Trees M5:1

M2:1

Figure 1.3: Snapshot of a COMAD execution. An intermediate snapshot of a run of a Comad phylogenetics workflow: (a) the logical organization of data at an instant of time during the run; and (b) the tokenized version of the tree structure in which three modules (i.e., actors) are being invoked concurrently on different parts of the data stream. In Comad, nested collections are used to organize and relate data objects that instantiate domain-specific types (e.g., denoting DNA sequences S, alignments A, and phylogenetic trees T). A Proj collection containing two Trial sub-collections is used here to pipeline multiple sets of input sequences through the workflow. In Comad, provenance events for data and collection insertions, insertion dependencies, and deletions (from the stream) are added directly as metadata tokens to the stream (b), and can be used to induce provenance data-dependency graphs (a). From [MBZL09]. thereof, and let all irrelevant parts pass by. Coactors work analogously, recognizing and operating on data relevant to them, adding new data products to the data stream, and allowing irrelevant data to pass through undisturbed (see Fig. 1.3). Thus, unlike actors in PN and SDF workflows, actors are data preserving in Comad. Data flows through serially connected coactors rather than being consumed and produced at each stage. Streaming nested data collections. One advantage of adopting an assembly-line approach is that one can put information into the data stream that could be represented only with great difficulty otherwise. For example, Comad embeds special tokens within the data stream to delimit collections of related data tokens. Because these delimiter tokens are paired, much like the opening and closing tags of XML elements, collections can be nested to arbitrary depths, and this generic collection-management scheme allows actors to operate on collections of elements as easily as on single data tokens. Similarly, annotation

1.4. Collection-Oriented Modeling and Design (Comad)

π

δ

19



Figure 1.4: Conceptual model of Comad-actor execution tokens can be used by coactors to represent metadata for collections or individual data tokens. The result is that coactors effectively operate not on isolated sets of input tokens, but on well-defined, information-rich collections of data organized in a manner similar to the tree-like structure of XML documents. Fig. 1.4 shows the conceptual model of a Comad actor. The input, structured tree, or XML data is shown on the left. Determined by the actor’s read scope, certain parts of the input are selected and used for the scientific computation performed by the actor. Parts of the input that are not within the read scope are simply ignored by the actor and passed on. After the scientific computation has been done, the results are merged back into the stream. Also here, the write scope configuration determines where the results are put in the output stream.

1.4.1

Advantages

The current Comad framework, which is also implemented in Kepler, employs these principles to encourage building self-explanatory and re-useable workflows. In Comad workflows, coactors can work on different parts of the input stream. For this purpose, each actor is configured via a read scope expression. The read scope specifies what data is picked from the stream. Example 1.1 (COMAD-workflow resilience to input data changes). In the following example, we show how a simple workflow modeled as a conventional dataflow network, as a Comad workflow, and as a functional program has to be modified to cope with changes in the input structure (see Fig. 1.5). We included the functional representation to (1) show

1.4. Collection-Oriented Modeling and Design (Comad) Scenario Input (a)

(b)

[α]

[[α]]]

Conventional Workflow

“Functional Workflow” W = A . B A :: [α] → [β] B :: [β] → [γ]

α

β

A

[α]

[β]

*

(d)

[[α] | [ϕ]]

[[α | ϕ]]

W = A’’ . B’’ A’’ = map A’ B’’ = map B’ A’ = Switch(Id,A) B’ = Switch(Id,B) A’’ :: [[α|ϕ]] → [[β|ϕ]] B’’ :: [[β|ϕ]] → [[γ|ϕ]]

α

α

β

[γ]

* β

A

(c)

γ

B

Collection-Oriented Workflow A

β

[α]

γ

A α→β

B

γ

β→γ

α→β

W = A’ . B’ A’ = map A B’ = map B A’ :: [[α]] → [[β]] B’ :: [[β]] → [[γ]]

W = Switch(Id, (A’.B’)) A’ = map A B’ = map B A’ :: [[α]] → [[β]] B’ :: [[β]] → [[γ]]

20

[β]

B

[γ]

β→γ

B [ϕ]

[α]|[ϕ] *

[α]

[β]

*

α

β

β

A

[α|ϕ]

[β|ϕ] β|ϕ

[γ]|[ϕ]

[α]|[ϕ]

γ

*

A

[β]|[ϕ]

α→β

[γ|ϕ]

β|ϕ γ|ϕ

S α

*

B

*

α|ϕ

[γ]

*

[α | ϕ]

A α→β

B

[γ]|[ϕ]

β→γ

[β | ϕ]

B

[γ | ϕ]

β→γ

S β

A

γ

β B

Figure 1.5: Workflow dependency on input structure for different modeling approaches (adapted from [MBL06]) type signatures of the actors exhibited for the specific input structures, and to (2) contrast the graphical workflow versions with a textual representation. For the sake of presentation, assume a very simple workflow that is composed of two actors A and B, each implementing some scientific functionality. In particular, actor A takes as input tokens of type α, and transforms it to tokens of type β which are then further analyzed yielding the final data products of type γ. From a functional point of view, the actors A and B have the following signatures: A :: [α] → [β] B :: [β] → [γ] Viewing the workflow as the function composition of these two functions, the workflow implements a function that has the type WF = A . B :: [α] → [γ]. As a first variation, not a single list of α tokens is given as input (a), but a collection

1.4. Collection-Oriented Modeling and Design (Comad)

21

of lists each containing α tokens (b). In the conventional workflow we would need to add additional iteration actors labeled with “*” here. Then, the left “*” actor passes each α token to A, collects the β results and group them back together into the original collection structure, which is picked up by the right “*” actor that iterates over the β lists. In the functional workflow this lift is done by the map function. In contrast, since collection opening and closing tokens are kept on the stream and only α and β tokens are “picked up” by A and B, respectively, the Comad workflow does not need to be changed. Next, imagine the collection of input data not only contains lists of α tokens, but also lists of γ tokens (c). In the conventional workflow, we need to add special filter actors to ship the α lists to A, whereas the γ lists are routed around the workflow and later on, merged back into the appropriate places. Note, that maintaining the relative ordering of lists either restricts the workflow to processing only one list at a time in between the two routing actors, or requires special delimiting tokens or punctuations. While the first option would reduce the amount of exploitable pipeline parallelism, the second option would ask for even more management actors. In the functional world, a custom Switch function would apply either the identity function Id or the workflow from (b) depending on the current list read. Since coactors pass on input data tokens that do not fit their read scope, the Comad workflow can again be applied unchanged. In the last example (d), the input lists are inhomogeneous – containing α and γ tokens in arbitrary order. While this is similar to case (c), the order of iteration and routing actors is swapped. First, the lists need to be disassembled and for each element it has to be checked if it is of the requested type. Again, no change is necessary to the collection-oriented workflow. Note, how this simple example shows how the structure of the input data is usually encoded in the graph structure of conventional workflows: the level of nesting in the input requires an iteration actor and inhomogeneous input collections require filtering actors.

1.4. Collection-Oriented Modeling and Design (Comad)

1.4.2

22

Limitations

The example given in Fig. 1.5 and the experience gained while modeling in a collectionoriented way indicate that Comad workflows are easier to build, maintain, and evolve. A generalization of the Comad approach, together with a more thorough analysis of this observation and more realistic examples will be presented in Chapter 2. Furthermore, the current Comad formalization and implementation does not meet the desideratum “Automatic Optimization” and it could also be improved to check workflow “Well-Formedness” and to support workflow design in general. Comad-workflows are executed in a pipelined fashion on a single machine. In Chapter 3 of this work, we investigate more efficient approaches. In particular, we design and evaluate data-parallel execution. Here, we utilize MapReduce [DG08] to make use of distributed resources in a cluster environment. Furthermore, when deploying virtual assembly-lines, many data items that are shipped to individual coactors are actually not used by the actor as they are not in its scope. These unnecessary data shipments can be avoided once the workflow system leverages knowledge about the workflow input schema, and actor read and write scopes. We will show in Chapter 4 how a type system can be used to address this shortcoming by automatically transforming an assembly-line into a dataflow network with optimized data routing. Since coactors are only invoked if there is data that matches their scope, it is possible that some actors will actually never execute. From a modeling point of view, it would be nice if the workflow system could warn the scientist about actors that will never execute during a workflow run. This situation might indicate that there is an error in the actor configuration, e.g., a typo when specifying the read scope, or that the data provided as input to the workflow does not conform to the expected schema. By developing a suitable type system, some of these situations of “starving” actors could actually be statically detected, thus helping scientists to build “well-formed” workflows. We will define concepts that are related to these design-desiderata in Chapter 6, and present theoretical work about a precise type-system in Chapter 7.

1.5. Towards Virtual Data Assembly Lines

23 Count Alignments

Data A

Workflow Graph

Sink

Data B

Coactor Layer (White Box Part) Scientific Functions (Black Box Part)

Update Statistics

Merge

Align DNA Sequences [ClustalW]

Refine Alignment [Gblocks]

s : DNASeq+ → append f (s) DNASeq+

f

Infer Set of PhylTrees [DNAPARS]

Compute a Consensus Tree [CONSENSE]

Display DNASequences, Infered Tree

a : Alignment → append PhylTrees[q(a)]

Alignment

Alignment

q

PhylTree+

Figure 1.6: Three-layer Comad architecture

1.5

Towards Virtual Data Assembly Lines

Virtual data assembly lines comprise three layers: a workflow graph, the coactor specifications or configurations, and the scientific functions modeled as “black boxes” (Fig. 1.6). The workflow graph represents a high-level description of the scientific process in the usual dataflow-oriented way: Nodes (or actors) represent data creation, processing or consumption, whereas the edges (called channels) represent data transfer between actors. The coactor definition layer is specifically designed to deal with data management tasks. Here, the scientist uses a custom language to express read, write and iteration configurations to bind the existing domain-specific functionality to the data organized in nested collections. In a workflow, it is represented as configurations of coactors, for example: //ProjA//GeneSequence to select GeneSequence(s) inside a collection called ProjA. As the tokens which are flowing on the channels can be seen as XML data streamed in a SAXlike manner, we will use methods from XML querying and processing to design this part of the workflow model (see Chapter 6). Scientific functions wrap already existing scientific software. Here, the large amount of software already available to scientists is modeled as functions that take simple data types, records or lists of them as input and produce other data as output. The actual data that is necessary for invocation will be fetched from the coactor’s input stream as specified by the coactor layer; similarly the function’s output data will be put back into the stream as specified by the coactor’s configuration.

1.5. Towards Virtual Data Assembly Lines

1.5.1

24

Research Questions

The core contributions of this work are to generalize Comad to address the shortcomings described in Section 1.3.4 and 1.4.2. In particular, we aim at answering the following questions: • What is a good level of abstraction for data modeling and expressiveness for the configurations? While full XML and XQuery would certainly allow us to move many tasks related to data management into the coactor layer, the semantics of a workflow could be detached from its graph representation. It is necessary that the configurations are expressive enough to express common domain-specific patterns of computation, but it should still facilitate descriptive and easy to understand workflow graphs. The expressiveness of the configurations also provides constraints and limitations for static analysis. Here, more theoretical questions arise: • What effect do various language features and different data models have on the static analysis of workflows? • Is workflow equivalence decidable for an ordered data model? In addition to ensure that scientific functions are invoked with the correct data provided, workflow systems can further assist the scientist to build well-formed workflows. We will show how the following question can be addressed: • Will this coactor invoke its scientific function given the workflow input schema and all actor configurations? Last not least, questions about optimizations arise: • Can we optimize the data shipping between coactors? • How can we detect and exploit data parallelism?

1.6. Detailed Description of Contributions

25

• How can we efficiently deploy the coactors onto a parallel computing environment, such as a cluster or a grid? Extending the Comad paradigm in scientific workflows raises the level of abstraction at which workflows are built. This is analogous to conventional programming languages: while quite efficient programs can be written in assembly languages, programming at a higher level of abstraction, e.g., in C++, Java or Haskell, is (usually) preferable since developers can produce more portable and more robust programs in a shorter time. Furthermore, compilers can automatically provide techniques for optimizations when programs are written in highlevel languages. In scientific workflows, plain PN workflows with hand-wired array-assembly and array-disassembly actors correspond to low-level languages. In virtual assembly lines, this is done by the framework while the user only provides a high-level description of the process.

1.6

Detailed Description of Contributions

The contributions of this dissertation are as follows: Development of the VDAL paradigm (Chapter 2). We first identify several design challenges (illustrated by examples) common in scientific workflow design. These challenges result from parameter-rich functions; data assembly/disassembly and data cohesion; conditional execution; iteration; and, more generally, workflow evolution. To address these, we propose VDAL (Virtual Data Assembly Lines), a datafloworiented approach for building scientific workflows. VDAL was inspired by and is a formal variant of Comad by McPhillips et al. [MB05, MBL06]. The key ideas of the VDAL approach are (1) data is organized in labeled nested collections flowing along a linear pipeline of workflow actors, and (2) workflow actors are wrapped inside a configurable shell that defines how the components interact with the data flowing between them. Using nested collections as a built-in data model removes the need

1.6. Detailed Description of Contributions

26

of ad-hoc records or array structures, and thus completely removes low-level shims (that were used to manage these structures) from the workflow. Deploying declarative configurations has the advantage that data management tasks are controlled by the workflow developer at a much higher level of abstraction than it was the case when explicit shim actors were used. Another crucial advantage is that in a VDAL workflow, actors are no longer tightly coupled via explicit wiring, instead the labeled collection structure provides a level of indirection that makes workflows not only free of datamanagement shims, but also resilient to changes to the input data and to the workflow itself. As a core contribution, we describe the anatomy of VDAL components and their configurations in detail. We define how the data-management shell selects input data from the input stream, how it then iteratively invokes scientific components with the selected data, and how they the place the result data back into the nested collection structure. Moreover, we then show how this approach addresses the design challenges we identified. This work has been published as [ZBML09a]. Exploiting data parallelism (Chapter 3). We investigate possibilities for efficient execution of VDAL workflows. To this end, we first show how to exploit the data parallelism inherent in the VDAL paradigm. We develop strategies to transform a VDAL workflow into an equivalent series of MapReductions that can be executed in parallel on a cluster computer or on a cloud infrastructure. Our strategies differ in the level of complexity of used algorithms and data structures. For each of them, we discuss advantages and trade-offs. We further conduct a thorough evaluation using the Hadoop implementation of MapReduce. Here, our experiments confirm that our approach decreases execution time for compute-intensive pipelines (speed-up factor of 20). While even our basic strategy achieves significant speed-ups (up to 17x) for workflows with relatively small collection structures, our most sophisticated strategy scales well even for very large collection structures and clearly outperforms the other

1.6. Detailed Description of Contributions

27

strategies in these cases. These efficiency gains and other advantages of MapReduce (e.g., fault tolerance and ease of deployment) make this approach ideal for executing large-scale, compute-intensive VDAL workflows. I thank Sven K¨ohler for the great help with the Hadoop implementation! This work has been published as [ZBKL09]. Minimizing data shipping (Chapter 4). In a VDAL workflow, the actor configurations describe which part of the input data is used by the actor; all not-selected data is simply ignored. Although this approach greatly helps during workflow design (see Chapter 2), it introduces unnecessary data shippings when implemented directly. Our contribution in Chapter 4 is to develop a type-system and algorithms to address this drawback. In particular, we show how to compile a VDAL workflow into an equivalent (low-level) workflow with additional data-routing components added. This low-level process network avoids unnecessary data shippings. Consequently, scientists can develop workflows using the VDAL abstractions without explicitly defining the data routing, instead the workflow system itself optimizes data shippings when the workflow is deployed and executed. Our experimental evaluation confirms the effectiveness of our approach, with savings in data shippings of up to 50% and a decrease of execution time by more than 60%. This work has been published as [ZBML09b]. Light-weight parallel workflow engine (Chapter 5). In Chapter 5, we describe our implementation of a light-weight engine for executing process network workflows. This engine was used to perform the experiments presented in Chapter 4. Our contribution here is to demonstrate how a core library can provide a fast and scalable basis for PN, and ultimately VDAL workflows. The second contribution is to show how such an external engine can be loosely coupled to the Kepler workflow system. In contrast to current approaches, in which data movement is explicitly defined in the workflow (e.g., via scp actors), the details of data movement is handled by the back-end workflow engine in our approach. Consequently, there is a clear separation between the scientific workflow logic and details about its deployment during runtime. This does not only

1.6. Detailed Description of Contributions

28

make workflow construction easier for the scientist, but also allows optimizations such as automatically choosing host machines as well as appropriate methods for data transfer. I thank my collaborator Xuan Li for implementing the Kepler side of the Kepler-PPN coupling. This work has been presented as [ZLL09]. Supporting Workflow Design (Chapter 6). We designed the VDAL configurations to be expressive enough to allow common data manipulation tasks, but still analyzable by the workflow system. In Chapter 6, we illustrate how static analysis can be used to support the scientist during workflow construction and maintenance. Given a configured VDAL workflow, the workflow system can, among other things, (1) predict the output schema of a workflow, (2) detect actors that will not invoke their inner component, e.g., due to errors in the configurations, and (3) display actor dependencies. Our contributions here are to show how to support these design features. To this end, we translate VDAL workflows to programs in the XML update language FLUX and extend the existing FLUX type system. This work has been published as [ZBL10]. Deciding Workflow Equivalence (Chapter 7). We further investigate the fundamental problem of deciding workflow equivalence. XQuery equivalence under an ordered data model has not been studied well by the database community. As a first step towards solving XQuery equivalence and ultimately workflow equivalence, we develop a new approach for static analysis of a for-let-return fragment of XQuery. Our approach is based on the new concept of possible-value types (pv-types). These structures exhibit similarities with conventional regular-expression types, however, they are not approximating query execution (like conventional types), but capture the query semantics exactly. It is thus feasible to decide query equivalence based on the equivalence of the output pv-types of the respective queries. Our contributions here are to develop pv-types for XML processing, and to define requirements for sound and complete type propagations. We then present a set of type propagation rules and prove their soundness and completeness. Based on these notions, we then show how decid-

1.6. Detailed Description of Contributions

29

ing query equivalence can be reduced to deciding pv-type equivalence, a problem that turns out to be hard as well. We show that in order to solve pv-type equivalence for an ordered data model, it is necessary to decide the equivalence of string-polynomials (SPE). String-polynomials are “polynomials” over a mathematical structure that has not yet been well studied. Our initial work towards solving SPE provides several normal-forms for string-polynomials that allow to approximate their equivalence. We conjecture the decidability of SPE and plan to continue this line of research as part of our future work.

30

Chapter 2

Improving Scientific Workflow Design with Virtual Data Assembly Lines Controlling complexity is the essence of computer programming. Brian Kernighan

Despite an increasing interest in scientific workflow technologies in recent years, workflow design remains a challenging, slow, and often error-prone process, thus limiting the speed of further adoption of scientific workflows. Based on practical experience with datadriven workflows, we identify and illustrate a number of recurring scientific workflow design challenges, i.e., parameter-rich functions; data assembly, disassembly, and cohesion; conditional execution; iteration; and, more generally, workflow evolution. In conventional approaches, such challenges usually lead to the introduction of different types of “shims”, i.e., intermediary workflow steps that act as adapters between otherwise incorrectly wired components. However, relying heavily on the use of shims leads to brittle (i.e., changeintolerant) workflow designs that are hard to comprehend and maintain. To this end, we present a general workflow design paradigm called virtual data assembly lines (VDAL). In

2.1. Workflow Design Challenges

31

this chapter1 , we show how the VDAL approach can overcome common scientific workflow design challenges and improve workflow designs by exploiting (i) a semistructured, nested data model like XML, (ii) a flexible, statically analyzable configuration mechanism (e.g., an XQuery fragment), and (iii) an underlying virtual data assembly line model that is resilient to workflow and data changes. The approach has been implemented by McPhillips et al. as Comad [MBL06], and applied to improve the design of complex, real-world workflows. Contributions. Based on practical experiences with the design of data-driven workflows from various domains (e.g., see [LAB+ 09]), we first identify a number of workflow design challenges and illustrate them with examples and use cases (Section 2.1). Specifically, we elaborate on the challenges resulting from parameter-rich functions; data assembly/disassembly and data cohesion; conditional execution; iteration; and, more generally, workflow evolution. We then present a general workflow design paradigm called virtual data assembly lines (VDAL) which has been implemented as Kepler/Comad [MBL06], but which is applicable to other data-driven systems (e.g., Taverna or Triana [TSWH07]) as well (Section 2.2). The crux of VDAL is that shims and complex wiring are minimized by encapsulating conventional black-box functions and services inside of a “data selection and shimming” layer that can be manually or automatically configured. We describe in detail the anatomy of VDAL components, i.e., the signatures and effects of operations inside of such components for scoping, binding, iterating over, and placing data. Finally, we show how our approach addresses the workflow design challenges mentioned above (Section 2.3). We discuss other related work in Section 2.4 and close with a short summary in Section 2.5.

2.1

Workflow Design Challenges

Here we identify and describe common scientific workflow use cases and design challenges we have encountered in practice when applying a conventional dataflow modeling approach. We revisit these challenges in Section 2.3 and show how VDAL addresses them. 1

This chapter is based on [ZBML09a].

2.1. Workflow Design Challenges

32

Figure 2.1: Parameter-rich service, record assembly and disassembly. A Cipres RAxML service is used to infer phylogenetic trees from aligned protein sequences, provided in CharacterMatrix. Besides the scientifically meaningful actor CipresRAxMLService, there are 6 auxiliary actors: ComposeNexus, RAxMLConfigurator, and three String actors are used to assemble the input data; ParseNexusTreesBlock is used to disassemble and convert the result for subsequent steps.

2.1.1

Parameter-Rich Functions and Services

Many scientific functions have a large number of input ports and parameters. For example, DNAML (DNA Maximum Likelihood) from PHYLIP (the Phylogeny Inference Package [Fel04]) takes 10 parameters in addition to the list of DNA sequences it operates on. In the conventional modeling approach, actors that wrap such applications have many ports, each connected to a distinct channel. This quickly leads to complex wiring when several of these components are used. One current solution is to use actor parameters for specifying the input values that do not change during workflow execution. However, this approach leads to less flexible actors because once an input has been modeled as a conventional parameter, its value cannot be changed at workflow run-time. Example: CipresRAxML. The CipresRAxML composite actor in the Kepler/pPOD re-

2.1. Workflow Design Challenges

33

lease [BMR+ 08] expects a CharacterMatrix containing DNA sequences and optionally a WeightVector as inputs. Furthermore, values for the parameters rate categories, model,

and initial rearrangement limit have to be specified. Fig. 2.1 shows a screen shot of the inside of the CipresRAxML composite actor. The composite uses three parameters to configure the RAxML service in addition to the two data input ports. While this approach successfully reduces the number of input ports to two, it also reduces flexibility. Since parameter values are fixed during workflow execution, the composite actor is not re-usable in a workflow that, e.g., iterates over the model parameter to find an optimal setting. Here, the actor would need to be modified to have an additional input port for the model input. A CipresRAxML actor that allows all its configuration to be changed during a workflow run would need five input ports.2 Although some of the parameters are for optimizing execution only, most of them do in fact influence the scientific result of the workflow. The CipresRAxMLService actor included in the Kepler/pPOD release only makes three of the

parameters visible, restricting the service’s flexibility even more. However, even the use of five distinct ports quickly becomes a modeling problem if the service should be called more than once. A full set of inputs needs to be sent for each invocation, and this quickly results in overly complex designs when loops or conditional execution are involved (see below). Packaging Inputs and Outputs.

To address the problem of having too many ports,

components sometimes expect all input data bundled together in a single large record data structure and similarly output all data produced during a single invocation bundled in a new record. A prominent example is the typical document-style web service that expects one large SOAP message and returns another SOAP message. Although these components have but a single input and a single output port, the general problem is not solved: complex input records must be assembled and outputs must be disassembled, tasks usually performed via shim-actors which themselves necessarily have many ports. Consequently, complex wiring 2

RAxML [SOL05] by itself has 32 command-line parameters, from which only five are modeled. An actor that would provide a more fine-grained access to the service would exhibit even more than 5 parameters or input ports.

2.1. Workflow Design Challenges

34

still occurs between these shims. Some scientific workflow tools, including Taverna via WSDL scavenger [OAF+ 04]3 and Kepler via a generic web-service actor [JAZ+ 05], support automatic creation of shims for web services, based on the operation’s WSDL specification. While such features clearly help the workflow designer, the shim actors still need to be connected, and can clutter the workflow. The problem has been recognized, e.g., by the Taverna developers: the tool provides a feature to hide these shims via a “boring” tag from the workflow graph. However, the underlying problem remains: new components added to a workflow still must be connected with the shims to work properly, this can be done while the shims are not hidden. The subworkflow inside the RAxML composite actor in Fig. 2.1 uses two different types of shims for record assembly: (1) ComposeNexus is an example of a black-box required record ; Nexus is a textual container format that can contain different data such as a character matrix, a weight vector, and sequence data. The RAxML service expects its input data in this specific format, and the designer of the RAxML actor thus has to assemble such a format. (2) The second kind of shim is not required by the given black-box function. Instead an ad hoc record is employed by the workflow designer: here, e.g., RAxMLConfigurator, is used to create a custom configuration record to bundle all configuration information into a single token. In the following sections, we will show more examples of such ad hoc record management. While records required by black-boxes must be constructed at some point in the process, our approach completely eliminates shims arising from ad hoc data records.

2.1.2

Maintaining Data Cohesion

Individual data items processed by scientific workflows often are related to each other in important ways. When DNA, RNA, or protein sequences are aligned, for example, multiple possible alignments are often computed, each of which can have various quality assessment scores. In an automated version of a workflow computing and comparing multiple alignments, the system must maintain relationships between parts of the input data sets and 3

also see http://www.ebi.ac.uk/Tools/webservices/tutorials/workflow/taverna

2.1. Workflow Design Challenges

35

Figure 2.2: Record-handling shims. Record disassemblers and assemblers (marked in red ovals) are used to unpack, restructure, and re-pack data in [SCZ+ 07].

portions of the workflow output. In order to maintain this data cohesion, current designs often create ad hoc records and array tokens. This approach has immediate drawbacks: (1) Packaging large amounts of data into one array often reduces workflow concurrency. When a large array token is created from an incoming stream of data, earlier arriving data is not sent to the next actor until the whole array is assembled. (2) Records and arrays have to be assembled and disassembled, which leads to shims that clutter the workflow graph and easily lead to complex routings. (3) Workflow designs cannot be adapted easily to changes in the data organization. Consider a record that contains an alignment and a score value. A subworkflow that requires such a record cannot easily process an array of alignments—even if the array contains the records as elements. Example: Record management.

The workflow in Fig. 2.2 uses a specific R model

for a bio-informatics sequencing task [SCZ+ 07]. A single array token input at the upper left corner is disassembled into individual element tokens, which are then used to build a custom record. The single record is routed to four record disassemblers that provide the

2.1. Workflow Design Challenges

36

(a) Scientific Actors DNA Sequence

List of DNA Sequence

DNA Sequence

List of Sequence Motifs

Sequence Motifs

List of TransTranscription cription Factors Factors

List of Functions

(c) Pure Dataflow Solution Workflow space – Loop shims as adapters

(b) Conceptional Workflow – Problems with array tokens

Figure 2.3: Maintaining nesting structure using array tokens and additional loops. (a) shows four scientific actors, each of which creates a list as output for each input token it receives. (b) shows a workflow in which these four actors are simply chained together: If actors are implemented to output a list of single tokens, associations between multiple input tokens and the output tokens are lost. If the output is in form of array tokens, then the connections are not well-typed causing errors either at compile or runtime. (c) shows how additional shims are used to address this issue: they unpack and pack the arrays and thus function as a “map” primitive.

raw data to four R actors. Once the subnetwork of R actors is run, a final genotype record is assembled and sent to the output port. Clearly, such low-level data access and re-packing operations distract from the primary functions of the workflow. Example: Data associations. Fig. 2.3(a) shows four hypothetical actors for genomics research, each of which takes a single data object and creates a list of related results. BLAST is used to find similar DNA sequences to a given DNA sequence; MOTIFSearch is used to detect one or more motifs, i.e., repeated patterns, in a DNA sequence; and from a given sequence motif TFLookup searches for proteins (transcription factors) reported to bind to the motif; finally via FunctionLookup, the specific biological functions associated with a transcription factor can be obtained. As shown in Fig. 2.3(b), we ideally would to be able to chain some or all of these actors together, e.g. to predict which functions a particular input DNA sequence is potentially associated with. However, there are problems with this approach: If each of the actors just outputs a token list for each received input token, then the associations between the data is lost. Consider a workflow input that has two DNA

2.1. Workflow Design Challenges

37

sequences, for each of which BLAST will return a list of similar sequences. However, on the output side of the BLAST actor, the two lists are not distinguishable from each other any longer, i.e., the output sequences are not grouped into two lists. One solution to maintain the groupings is to use an array as output structure at each service. However, if we then want to use the actor MotifSearch to analyze the results of a BLAST search for a given DNA sequence, then we need to insert a special looping shim that unpacks the array and sends it one-element-after-another to MotifSearch, as shown in Fig. 2.3(c). Then, MotifSearch’s returned array of motifs are packaged together to form an array of arrays of motifs. To extend the pipeline further, the actor TFLookup would need two of these looping shims to work properly. Furthermore, consider the case in which we would like to use multiple arrays of DNA sequences as overall workflow input. Here, we would need to add additional Loop shims around all existing actors as shown in Fig. 2.3(c). A solution for this problem is provided by Oinn [OGA+ 02] and implemented in the Taverna workflow system: The workflow system itself is made aware of array-structures, and the system itself automatically inserts these looping constructs if there are type mismatches. The Loop (or map) operations are not made explicit as actors in the system and thus workflows are kept clean, and most importantly, easy to evolve. Our solution is similar to Taverna’s, insofar as we also enrich the simple data model of dataflow networks. However, instead of nested lists, we will use labeled nested lists with annotations, a data model that corresponds to XML4 . We can therefore additionally use XML techniques to select the desired data to be fetched from the actor’s the input.

2.1.3

Conditional Execution

Conditional executions and data filtering steps are essential in many scientific analyses. Consider, for example a workflow that infers phylogenetic trees from a set of input sequences, and then computes a consensus tree only from those result trees that satisfy certain quality criteria, e.g., that have a strong support. Here, only trees with high support 4

labels correspond to XML tags; annotations correspond to XML attributes

2.1. Workflow Design Challenges

38

Figure 2.4: Dataflow model for “if (Test1 or Test2) then A”

should be used as input to the consensus step. In current models, many control-flow constructs are encoded into the dataflow network, which leads to workflows with many shim actors and complex wiring [BLNC06]. For example, an If-Then-Else like filter can be mapped into a dataflow graph as a control-flow actor and two distinct routes. Such control-flow actors together with their necessary wiring lead to complex designs for moderately-sized workflows [PLK07]. Example: If-Then-Else. Consider an example workflow in which an analysis A should be applied to incoming data only if at least one of two tests (performed via the actors Test1 and Test2) is successful. A common way to achieve this goal (see Fig. 2.4) is to use the actors Test1 and Test2, route their Boolean output to an OR that combines the truth-values, which is then routed to the control input of a BooleanSwitch. The switch will then route the data either to A or to a following Merge actor, which combines the two streams of data to one. To maintain the order of data tokens, the switch sends a control signal to the merge. Depending on this signal, the merge is reading data from port 1 or 2, respectively. Note that this solution does not only deploy three additional shim actors (or, switch, merge) with intricate wiring, but it will even fail working properly if actor A outputs more then one token for each token read on the input port. Note that the workflow in Fig. 2.4 is a very small example, which only implements the simple control-flow if (Test1 or Test2) then A. In practice, many more of these conditional executions might be used, which quickly lead to even more complex designs with many shims [BLNC06].

2.1. Workflow Design Challenges

39

(a)

(b)

a,b,c

1,1,1

1

a,b,c

1a,1b,1c

a,b,c

(c)

CrossProduct

t,t

1,2,3

(d)

111222333

1,2,3 [a,b,c]

1a,..

ffftffftffft [a,b,c]

abcabcabc

1a,1b,1c,2a,2b,2c,3a,3b,3c

Figure 2.5: Conventional cross-products. Low-level models explicitly compute the cross-product. Tokens are given in “reading order”, i.e., 1,2,3 means first token 1, then 2, and 3 are sent.

2.1.4

Iterations over Cross Products

Scientific workflows can be used to automate analyses over multiple, independent input data sets; repeat analyses on the same data set with multiple sets of parameter values; or both. Combining data and parameter sweeps resembles executing the analysis over a cross-product of some subset of the available data. In conventional dataflow approaches these cross-products have to be constructed manually via actors that loop over sets of data and iteratively output the data as input to a downstream actor. Example: Cross-products. Fig. 2.5 shows workflows that invoke an actor A iteratively over combinations of incoming data items. When the A actor has only one input port, see Fig. 2.5(a), the data can be provided directly on the input channel. Here, the FIFO queue semantics of dataflow networks achieves the desired result of multiple invocations of A with the different input data a, b, and c. Now, consider an actor A with more than one input port as in Fig. 2.5(b). If we want to iteratively execute A over a list of inputs to one port while keeping the input to another port constant, then an additional Repeater actor is needed to explicitly construct the cross-product of the input data (1) × (a, b, c) that is

2.1. Workflow Design Challenges

40

routed to A. First, the Repeater reads the input data 1 into internal memory. Then, upon receiving a signal on the trigger port at the bottom of the actor, it will output a copy of the stored data. Now consider the case of two lists of inputs (1, 2, 3) and (a, b, c) where we want to invoke A on each element of their cross-product. Now even more control-flow actors and routes are necessary: Fig. 2.5(c) shows a feasible design involving two specialized repeat actors wired together in a complex, crisscrossed pattern that takes considerable thought to design (and even more work to explain to others). Note, that this design works in a streaming fashion on the upper input but not on the lower one, i.e., the array [a,b,c] is consumed completely before any data is output, whereas the data on the above channel can be provided incrementally. Larger workflow designs are necessary if more than two lists are involved in the crossproduct. The design of the cross-product-generating actors could be placed into a composite actor as shown in Fig. 2.5(d) to hide the details from the modeler. However, different versions of this actor would need to be created for (i) different numbers of input ports, (ii) to realize other features such as streaming, and (iii) to accommodate input in single arrays or as streams. Loops over input data as shown in the previous examples are very common in scientific workflows, e.g. for multiple-dimension parameter sweeps. Therefore, the workflow system should make it easy for the workflow designer to construct these loops without worrying about the low-level details of data buffering and routing. Our approach will provide a declarative way of specifying these loops. Furthermore, since our data is organized in nested collections, explicit distinction between array and non-array tokens is not necessary. Thus, the user can concentrate on specifying what data should be involved in cross-products, and the workflow system itself can choose how to compute them.

2.1.5

Workflow Evolution

In real-world workflows, the sorts of use cases and design patterns described above are interwoven in intricate ways, making workflows not only hard to understand, but also hard

2.2. Virtual Data Assembly Lines (VDAL)

41

(a) Conventional dataflow A

B

C

(b) Virtual Assembly Lines XML

σ, γ, ω A

XML

σ, γ, ω B

XML

σ, γ, ω C

XML

Figure 2.6: Architectural differences of conventional versus VDAL. In VDAL, data massaging is moved to a configuration layer (σ, γ, ω) with declarative, local shims, minimal wiring complexity, thus facilitating simpler designs and reusable actors.

to modify and evolve. Fig. 2.6(a) schematically shows a dataflow network comprising three scientifically meaningful actors (A, B, and C), and a number of shim actors (depicted as grey boxes). To insert a new scientific actor into such a network, it is necessary to understand the wiring and complex interactions between existing control-flow shims. In the next section, we will show how to overcome these challenges using techniques from XML data processing, to allow scientific workflow designs as in Fig. 2.6(b), where each actor corresponds to a scientifically meaningful task.

2.2

Virtual Data Assembly Lines (VDAL)

We combine ideas from process networks with XML queries, updates, and stream processing. Specifically, our approach, Virtual Data Assembly Lines (VDAL) is characterized by the following three ideas (see Fig. 2.6): (1) Linear Workflows: A data assembly line always contains a linear workflow graph. That is, each actor has exactly one input and one output port. (2) Structure-rich channels: The data flowing from port to port is structured as labeled trees possibly with additional attributes much like XML data. The data is streamed in a SAX-like manner on the channels, although different execution strategies and optimizations are possible (presented in Chapter 3 and 4). This is in contrast to common approaches where data on channels is of simple types or custom-made record and array types. (3) Configuration shell: Scientific components are wrapped in a “white-box” data

2.2. Virtual Data Assembly Lines (VDAL)

42

selection and shimming layer which scientists can configure to specify what input data is taken from the input stream and where the result of the components application is put back into the stream. Here, we devise a domain-specific language to minimize and localize shimming tasks, e.g., those tasks performed by record-assembler and disassembler shims in conventional approaches. Moving the data selection and shimming into a configurable layer around each actor not only reduces the wiring complexity, but also supports a linear workflow layout in which actors are simply placed one after another in an intuitive order.

2.2.1

Inside VDAL

Change-Resilience in Assembly Lines. In a physical assembly line, workers perform specialized tasks on products that pass by on the conveyor belt of a moving assembly line. Specifically, a worker only “picks” relevant products, objects, or parts thereof, letting all irrelevant parts flow through. Since each worker’s scope is limited, a worker is unaware of the tasks of other workers and of the overall product being constructed. In particular, this has the advantage that a worker can be “reconfigured” to work on different parts of the object stream, and even moved up or down the assembly line, as long as certain inherent task dependencies are not violated. By limiting work (via a scoping/configuration mechanism) to certain relevant parts of the object stream, and “passing the buck” on irrelevant parts, workers in an assembly line are loosely coupled, and the overall design is modular and resilient to changes. We employ and extend this processing paradigm to data assembly lines of streaming XML-like data. VDAL data model. The data model for channels in VDAL are nested, labeled, ordered collections with metadata. This data model corresponds to XML. Labeled collections correspond to XML tags containing the collections’ data, called base data. Domain-specific types, e.g., PhylogeneticTree or CharacterMatrix, can be represented as CData. Also the usual general-purpose types such as Integer, Boolean and String can be used in leaf nodes of the XML tree. Metadata corresponds to XML attributes, which can provide more in-

2.2. Virtual Data Assembly Lines (VDAL)

τ

σ

τα

γ

α

43

A

β

ω

τβ

M

τ0

A

Figure 2.7: VDAL Actor Anatomy. In virtual data assembly lines, each black-box actor A is encapsulated between easily configurable, standard components σ, γ, ω that simplify data management and shimming. This allows flexible, localized data unpacking (via σ and γ) and re-packing (via ω), while requiring only one input and one output port through which XML-structured data is streamed.

formation, e.g., the score of a sequence alignment, or can be used as simple annotations, e.g., the tag faulty could be attached to data or whole collections. Deploying XML as the data model naturally preserves data cohesion and allows efficient streaming of data when the XML tree is serialized and processed by actors in a SAX-like manner. Moving data selection and shimming into configurations. Assume we want to place an actor A in a process network. If A has many input ports, then these must be wired to other actors (or shims) to describe the data routing (explicitly), leading to networks as shown in Fig. 2.6(a). If we instead design actor A to have one input port that expects data bundled in a custom record type α, then it is hard to place A into a network without explicit shims. If A’s predecessor produces type τ objects and the successor step requires type τ 0 objects:

τ

τ0

−→ A : α → β −→

A conventional approach requires that τ ≺ [α] and [β] ≺ τ 0 , that is, the input stream consists of a list of α-compatible types Jτ K ⊆ J [α] K and the output stream [β] has to be

compatible with τ 0 , i.e., J [β] K ⊆ Jτ 0 K. However, these are very rigid constraints: In general A might not be able to accept τ instances (but require an adapter to filter the relevant

part and/or to assemble the required α structure); similarly, β might not be of the desired result type τ 0 .

2.2. Virtual Data Assembly Lines (VDAL)

input i

44

output o

map A

map γ

σ τ

d0

d1

τα

ω d2

r0

r2 r1

τω

M

τ0

Figure 2.8: Dataflow inside VDAL Actor.

In contrast, in a virtual assembly line, each scientifically meaningful actor A is embedded in a framework of adapters as shown in Fig. 2.7. The data that flows into the actor is structured as an XML tree that maintains data associations. But instead of feeding the XML stream directly into the scientific actor A, configurable components around A select and package the data according to A’s input requirements. The inner functions of A are not understood by the workflow system, which simply invokes the component as a black-box. The behavior of the components σ, γ, ω, and M are determined by their configurations. With an appropriate formalism for these configurations, the workflow system itself can automatically analyze certain parts of the data flow in a workflow design—the components σ, γ, ω, and M can thus be viewed as white-box actors. However, the components σ, γ, ω and M need not be realized as explicit actors in the workflow specification. Instead their functionality can be provided automatically by the workflow system itself. Consequently, actors in VDALs are visually represented as normal actors, each with one input port and one output port, along with configurations for σ, γ, and ω; M does not have any configuration. Each of the components σ, γ, ω, and M is responsible for particular aspects of the data manipulation as detailed in the following subsection.

2.2. Virtual Data Assembly Lines (VDAL)

2.2.2

45

VDAL Components and Configurations

Fig. 2.8 shows how data is manipulated as it flows through VDAL actors. The scope parameter σ determines the parts of the input data that are read and potentially modified by the actor. Then for each σ-selected subtree di , the input assembler γ stages all necessary input data to invoke A, possibly multiple times. The data to be staged is specified via configuration of γ and may be given as literal values or via path expressions that describe how to extract the data from the subtree di currently in scope. After the black-box A has been invoked on each of the staged sets of input data provided by γ, the write expression ω specifies how the scope di should be modified, usually by inserting the results of A’s invocations. The component M inserts the modified subtree di back into the output stream o. Usually, scopes are modified as they flow through the actor, and thus M happens implicitly. Black-box actor A. The interface between a black-box actor A and the scientific workflow system is the list of the named input and output parameters to A. Each input and output parameter has a name and an associate type description. A type description consists of a name of a BaseType, i.e., Integer or CharacterMatrix, and an optional modifier “*” to indicate either that a list is required as input, or that a list is created as output. Scope σ.

The scope parameter σ selects relevant parts of the input stream τ . As in a

physical assembly line, the actor does not read, or even modify anything in τ that has not been selected via σ. Formally, σ is a function from XML data to a list of relevant read-scope matches [τα ]5 . σ : τ → [τα ] The scope σ is specified via an XPath expression that uses child and descendent axes. Since we want to ensure that the selected scopes are non-overlapping, we use a first-match semantics for the descendent axis //. That means, a breath-first traversal that checks for scope matches will not traverse into an already found match. While we prohibit general side 5

We use [x] to denote lists of type x.

2.2. Virtual Data Assembly Lines (VDAL)

46

axis, checking the presence and/or values of attributes attached to nodes along the path is allowed. Prohibiting side axis allows the decision whether a subtree is a scope match or not to be made solely based on its ancestor nodes and their attributes. In a streaming implementation, for example, we can decide whether a certain subtree is a scope-match as soon as the root node of this tree is encountered (as all ancestor nodes and attributes have already been seen). Input assembler γ. According to its configuration, the input assembler γ stages input for one or multiple invocations of the black-box actor A, using either data encountered inside each scope match di or data provided as literals in the configuration. If we represent one set of input values for A as a composite type α, then the input assembler γ is a function from τα to a list of α: γ : τα → [α] Each tuple in [α] is then provided to A, producing the output list [β]. As before, the black-box A is characterized as: A : α→β We suggest to use a query (or binding expression) for each input port of A. Each binding expression provides data either for a single invocation of A or for multiple invocations. Since black-box functions can expect a list of data per invocation, binding expressions usually select lists of lists. Formally, for each input port pi the binding expression Bi represents a query that given the data in the read-scope di ∈ τα produces a list of lists of base data suited for the port pi . Bi : τα → [ [ T ] ],

with T ∈ BaseType

The black-box A is then invoked once for each element of the Cartesian product C := B1 (di ) × B2 (di ) × · · · × Bn (di ),

(×)

2.2. Virtual Data Assembly Lines (VDAL) a) r:

B:

b)

X A B

B

c)

X A

A B

47

B

B

B

C1 C2 C3 C4

C1 C2 C3 C4

foreach $p in //C return $p

foreach $p in //B return $p//C

B(r): {C1 }, {C2 }, {C3 }, {C4 }

A

A

{C1 , C2 }, {C3 }, {C4 }

B

d)

X B

B

C1 C2 C3 C4

return //C {C1 , C2 , C3 , C4 }

X A

A B

B

A B

C1 C2 C3 C4 foreach $p in //A return //C {C1 , C2 , C3 , C4 }, {C1 , C2 , C3 , C4 }

Figure 2.9: Example grouping via binding expression in γ.

that is each element of C is of type α, which in turn is used to create an output tuple of type β = A(α). We suggest to use the standard foreach loop with two XPath expressions to specify the binding expression queries: foreach $p in XPath1 return XPath2 To be able to easily grab base-data, we imagine all BaseType-leaf nodes to be implicitly labeled with the type-name. Selecting these nodes via an XPath expression will select the actual value. Furthermore, in contrast to the usual XQuery semantics, we do not flatten the result sets to form one long output list, instead the result nodes from XPath2 are grouped by the result of XPath1 , i.e., for each new node bound to p a new group is formed. As example, consider the XML tree r as shown in Fig. 2.9. The C data that is available in the scope can be grouped in different ways: In Fig. 2.9(a), each Ci is put in a single group each of which will result in an invocation of the black-box actor. In Fig. 2.9(b), the results are grouped by B, i.e., all C’s that are descendent of the same B node will be in a single group. Here, the black-box function would be called 3 times, once with each group as input. Only one group for one invocation is created in Fig. 2.9(c), whereas in Fig. 2.9(d) the same input data (all 4 C’s) is presented twice to the black-box. Literal values. Besides being able to select data from the read-scope, we suggest that, like

2.2. Virtual Data Assembly Lines (VDAL)

48

in conventional dataflow models, literal values can also be put as parameters. We propose to extend standard conventions for integers (1, 34, -232), boolean values (true, false), strings (”foo”, ”bar”), or floating point numbers (0.2, -4.2e-7) with simple range constructs such as 1..10 (integers from 1 to 10) to facilitate simple parameter sweeps. Groups can easily be described using curly braces. Although it might be tempting to embed a small programming language here, we suggest to keep binding expressions rather simple. Using a Turingcomplete language would significantly reduce workflow predictability and the effectiveness of static analysis for VDAL workflows. Write expression ω. The purpose of the write expression ω is to insert the results [β] of the black-box function A into the scope τα , or to perform more drastic changes to the scope in order to produce τβ . Here, an XML update language can be used. Formally: ω : [β], τα → τβ We propose to use the XML update language FLUX [Che08] to specify the modification on the scope τα . The FLUX syntax is depicted in Fig. 2.10. To have access to the results of the black-box, a special variable $result can be used in the embedded XQuery expressions. For more flexibility while dealing with the black-box results, we not only provide the result value β in $result, but we also allow access to the input parameter α. In particular, each element of the list $result will contain an XML tree with root node labeled tuple and a subtree for each input and output parameter that was used in an invocation of A. Each subtree is labeled with the name of the parameter and contains the input or output data that was used or created, respectively. It is thus possible to insert the whole $result-list somewhere into the read scope, or to iterate over the list, and select only specific parameters that should be inserted into the scope. Replacement M . In the last step M , the modified scope τβ replaces τα in the original stream τ to form the output τ 0 (see Fig. 2.8). In a streaming implementation, the replacement would be implicit as τα would typically be changed “in place” to form τβ . Formally,

2.2. Virtual Data Assembly Lines (VDAL)

49

Stmt ::= Upd [WHERE Expr] — IF Expr THEN Stmt — Stmt ; Stmt — LET Var := Expr IN Stmt — Stmt Upd ::= INSERT (BEFORE — AFTER) Path VALUE Expr — INSERT AS (LAST — FIRST) INTO Path VALUE Expr — DELETE [FROM] Path — RENAME Path TO Lab — REPLACE [IN] Path WITH Expr — UPDATE Path BY Stmt Path ::= . — Label — node() — text() — Path/Path — Path AS Var — Path[ Expr ]

Figure 2.10: Syntax for FLUX (adapted from [Che08]). Updates as used in write scope ω. “Expr” denote XQuery expressions.

1 2 3 4 5 6 7 8 9 10 11 12 13 14

ScientificActor: CipresRAxML Input: model of String cha matrix of CharacterMatrix weight vec of WeightVector* rate cats of Integer init rearr limit of Integer Output: trees of PhyloTree* σ: //Nexus γ: model ← foreach $p in //Model return $p/String ← //CharacterMatrix cha matrix weight vec ← //WeightVector rate cats ← foreach $r in {25},{100} return $r init rearr limit ← 100 ω: INSERT AS LAST INTO . VALUE Trees[ $result/trees ]

Figure 2.11: Blackbox and VDAL actor configuration. Lines 1-7 describe the blackbox embedded in the VDAL actor. In lines 8-14, Scope, Bindings, Iteration, and WriteExpression are used to specify how the black-box is fed with data from the incoming XML stream (σ, γ) and how the output of the black-box is placed back into the stream (ω).

2.3. Design Challenges Revisited M has the following signature:

2.2.3

50

M : [τβ ], τ → τ 0

Example: VDAL Actor Configurations

In Fig. 6.2, the configuration for an CipresRAxML actor is shown. The black box has five input parameters, and produces a list of phylogenetic trees. The actor’s scope is //Nexus, such that input data is searched for only within subtrees labeled with Nexus. The service should be called for each method that is under a Model collection in the scope; the CharacterMatrix is also provided somewhere in the scope and selected via //CharacterMatrix. As rate categories two values, 25 and 100, should be used, and the initial rearrangement limit is set to 100. A cross-product of staged data upon which the actor is to be invoked is built from the multiple selected models and the specified seed values. The list of resulting trees is inserted within a new subtree labeled Trees inside the current scope match.

2.3

Design Challenges Revisited

Below we show how the VDAL modeling paradigm addresses the challenges presented in Section 2.1.

2.3.1

Parameter-rich Black Boxes

In Virtual Data Assembly Lines, parameters and inputs to black-box functions are not provided by individual ports nor is there a custom input structure necessary. Instead, VDAL extends the approach of regular parameters: Input can be specified either as literal values (just like with regular parameters) or as special path expressions that grab the data from the actor’s input stream. VDAL actors thus exhibit only one input and one output port, through which the XML-stream of data flows, reducing necessary wiring to a minimum. Reduction of workflow graph complexity. Of course, the input for a black-box still

2.3. Design Challenges Revisited

51

needs to be specified somehow; our approach moves moves the scientifically essential portion of the complexity from the graphical wiring into the configurations. Moreover, it completely removes from the model non-essential complexity, i.e. all explicit references from one actor to another. In a conventional workflow, a wire directly connecting one actor to another expressly indicates that the output of the first actor is to be consumed and processed by the second actor. In a VDAL workflow, in contrast, a wire directly connecting two actors by no means implies that the downstream actor uses any information transmitted by the actor over that wire. The order of actors in a VDAL workflow merely indicates the order in which actors will have access to the data stream. The wires between actors serve only as the channel over which the entire data stream passes between actors and do not indicate direct interactions between connected actors. A further way in which configurations are superior to explicit routing in selecting necessary input data is that configurations are declarative descriptions. They specify what data to use from where in the stream in contrast to the operational descriptions of wires and record-management shims. To select all character matrixes inside the current read scope, for example, an XPath expression “//CharacterMatrix” can be used. Not only is this more concise than one or more shim-actors for selecting data from a record-structures, it also makes the actor oblivious to certain changes in the input stream, and consequently makes the behavior and configuration of the actor more resilient to future changes in the effective schema of the incoming data. For example, additional data items or deeper nesting can be accommodated without changing the configuration of the actor. Example. With the flexible adapters σ and γ inside the actor, the CipresRAxMLService can just be inserted into the data assembly line (shown in Fig. 2.12). Via a configuration as in Fig. 6.2, the input data for the black box is selected from the incoming XML stream.

2.3. Design Challenges Revisited

52

Figure 2.12: Linear workflows. In Data Assembly Lines, even parameter-rich services are connected with only one input and one output channel. Individual input data are specified as literal values or as an XPath expression, which extracts data from the incoming stream. A sample configuration for the CipresRAxML actor is shown in Fig. 6.2. Project

Input

Nexus

Data

String

Character Matrix

PhyloTree

Consensus

PhyloTree

PhyloTree

PhyloTree

Figure 2.13: Hierarchical data used in phylogenetic workflow. Data-nodes (domainspecific and general purpose data) are shown as rectangular boxes, collection labels as ovals.

2.3.2

Maintaining Data Cohesion

The XML data model of VDAL can directly be used to maintain relationships between data items. During workflow execution, access to specific parts of the data is provided by the query capabilities of σ and γ. Data associations that are maintained in custom records, or domain-specific file formats, such as FASTA or Nexus files, can be modeled via the XML tree structure. Fig. 2.13 shows how the data processed by a phylogenetic workflow could be organized. Starting from the left side of the figure, the String data contains a URL that points to a Nexus file expected to contain a character matrix. The workflow fetches the file, converts it into a CharacterMatrix domain-type, and stores this data item in a Nexus collection. Via the CipresRAxML actor several PhyloTrees are inferred and placed in the same Nexus collection; in a last step, a consensus tree is computed and stored under a Consensus collection. The PhylipDrawgram actor, which displays phylogenetic trees, could then easily be configured to either draw all the trees in the workflow (σ = //PhyloTree),

2.3. Design Challenges Revisited Seqs

53

Seqs

S S S

S ...

s1 s2 s3

s1 B B s11 s12

Seqs s1

S ... B ... s11 M M m111 m112

...

σ: // S σ: // B σ: // M γ: in ← // Sequence γ: in ← // Sequence γ: in ← //Motif ω: foreach $s in ω: foreach $m in ... $result/Sequence do $result/Motif do insert B[$s] as last into . insert M[$m] as last into .

Figure 2.14: Maintaining nesting structure.

to draw only the trees inferred via CipresRAxML (σ = //Nexus/PhyloTree), or to draw only the consensus tree (σ = //Consensus/PhyloTree). Data cohesion for nested lists. Let us reconsider the example presented in Section 2.1.2: The services BLAST, MotifSearch, TFLookup, and FunctionLookup each produce a list of output items whenever one input item is received. To maintain the associations between items produced by actors and the inputs from which they were derived, individual data tokens were wrapped into array tokens. This leads to type-mismatches on the input ports of the actors if they are simply chained together. We thus had to introduce Loop actors that essentially perform a map over these lists. In data assembly lines, associations can be maintained using nested XML structures, from which the data is selected via configuration parameters. Since these parameters can be specified using the descendent axis, the actor is decoupled from the actual nesting depth of the organizational structure. Fig. 2.14 shows the workflow from Section 2.1.2 modeled as a data assembly line. As input data, we place each sequence si inside an S-collection.

The BLAST actor’s scope σ

selects each of the collections in turn. Input assembler γ selects the one sequence inside the scope and invokes the black-box function to create the list of sequences as output. Via

2.3. Design Challenges Revisited

54

the write expression ω, we then insert a B-collection for each output sequence into the S-collection. The resulting XML structure is shown over the channel leading from BLAST to MotifSearch. The second actor then analogously creates M-subcollections for each motif that was associated with the “BLAST-ed” sequences. Succeeding actors follow the same design idea. Note that this approach not only removes explicit loop-actors, but also enables each actor to select the correct input data independently of how deeply nested within the input data stream it is located. It is thus easy to add additional nesting to the input schema. A top-level “Projects” collection could, for example, be introduced to hold multiple “Seqs” collections, each containing, say, sequences from different groups organisms. During the workflow execution, this organizational structure would be kept intact. Furthermore, we could add additional information into the XML tree without disturbing the actors that do not need to access it. Since γ selects the data that is relevant from the input, additional data is ignored without the need of further routing actors.

2.3.3

Conditional Execution

In data assembly lines, conditional execution and the necessary data routing is localized to the adapters σ, γ and ω. Consider the use case from Section 2.1.3, in which an actor A should be executed on some data d only if at least one of two tests (perfomed by actors Test1 and Test2) were successful. Instead of using different routes for the data, we use

the actors Test1 and Test2 to tag the data items with the result of the test. Then, we exploit the querying capabilities of σ and γ to only select these data items for which at least one of the test results are positive. The other data is simply ignored and passed down the data assembly line. In a sense, the routing around the actor A is kept local (inside the configuration shell of A) while the information originating from the test actors Test1 and Test2 is attached to the data.

2.3. Design Challenges Revisited

σ: // Project γ: in ← each // a ω: if ($out) then tag in@OK

55

σ: // Project γ: in ← each // a ω: if ($out) then tag in@OK

σ: // Project γ: in ← each // a@OK ω: insert $out after .

Figure 2.15: Localizing if-then-else routing via XML attributes. Both Test-actors select all data labeled with a. If the test succeeds, they add an annotation “OK” to the data item, i.e., tag the item. The third actor A, then only processes the data if it has been tagged, effectively implementing “If Test1 or Test2 then A.”

Nexus

Nexus

Model "GTRCAT" "GTRMIX"

Trees

Model CharacterMatrix

WeightVector

"GTRCAT" "GTRMIX"

Character- WeightVector Matrix

T1 T2 T3 T4

Figure 2.16: Cross-products in VDAL. Performing cross-products is a built-in feature of the VDAL configuration layer. Thus, no shims or complex routing is necessary. Here, the black-box component is invoked four times according to the configuration given in Fig. 6.2: two different models and two values for the rate cats parameter. Four output trees are placed under the newly created Trees-collection.

2.3.4

Iterations over Cross Products

Via parameters for the scope and input assembler, the workflow developer declares which data is used as input for the black boxes. Using the foreach construct, cross-products can easily be declared without the need of explicit routing and token repetition. Input data for these multiple black box invocations can be specified as parameter via literal values in the binding expressions, or inside the input data stream of the actor. Fig. 2.16 shows how the CipresRAxML actor with a configuration as in Fig. 6.2 transforms an incoming data stream: The binding expressions will select the CharacterMatrix and

2.4. Related Work

56

WeightVector inside the Nexus collection. For the model parameter, all Strings under //Model (here "GTRCAT" and "GTRMIX") are selected inside the foreach construct. For the rate cats input, multiple parameters are provided as literal values. The CipresRAxML service is invoked four times, as there are two values for the model ("GTRCAT" and "GTRMIX") and two values for the rate cats parameter (25 and 100). The resulting trees, are placed inside a new collection labeled with Trees, according to the write expression ω. Since input creation and iterative invocation of the black-box is part of the workflow infrastructure, no explicit loops, or repeater shims have to be placed around the actor.

2.3.5

Workflow Evolution

The linear form of VDAL workflows makes it easy to combine the above described use cases and design patterns into a single workflow. The sample workflow from Fig. 2.6(a) with its three scientific actors, modeled as a VDAL workflow has three actors as in Fig. 2.6(b). A virtual data assembly line localizes control-flow inside well-structured, configurable shells around black-box actors. Data is not broken up and scattered across different wires, but flows as XML-like structures from one VDAL actor to the next. Each such actor can locally determine which portions of the incoming data stream are of interest and which can be ignored and passed unprocessed to downstream actors. To insert an actor into a VDAL workflow, the workflow creator needs to know primarily the XML schema on the stream between the actors. These schemas often correspond to folder structures or scientifically meaningful hierarchies, and can thus be very intuitive for the modeler [MBL06]. One can also co-design schema and workflow while utilizing automatic schema propagation through already configured actors utilizing the techniques described in Chapter 6.

2.4

Related Work

Related work has been done in the context of Kepler [LAB+ 06] and Ptolemy [Pto06], which Kepler is based on. Here, theoretical research has been conducted on comparing

2.4. Related Work

57

different models of computation [LSV98]. Furthermore, an actor language called CAL has been developed in [EJ03]. Our approach focuses on the integration of existing components rather than on designing a new general-purpose language. There also exist many other scientific workflow tools, such as Taverna [OGA+ 02], Triana [TSWH07], and Vistrails [BCC+ 05, FSC+ 06]. Taverna’s data model is list-based and supports nesting. The workflow language supports an implicit looping construct that is in spirit comparable to the VDAL idea, insofar as the necessary map operations for applying scientific functions to incoming lists of data can automatically be inferred. It is also possible to declare different looping strategies (cross-product, dot-product) for actors with multiple lists as input [TMG+ 07, HS08]. However, Taverna’s data model does not deploy labels, and data selection is thus not as flexible in in VDAL workflows. In the scientific workflow community, workflow design issues have received comparatively little attention in the past, but their importance is now more fully emerging [BL05, Dee05, HKS+ 05, FSC+ 06, KSC+ 08, LLF+ 09, ABE+ 09, vHHH+ 09]. Hull et al. [HSL+ 04] highlight and classify some of the problems relating to shims, and propose to treat them by employing ontologies as semantic types. An approach to infer data transformation shims using semantic annotations is described by Bowers et al. in [BL04]. The authors of [GRD+ 07] emphasize the use of semantic representations and planning in scientific workflows. Arguably the work closest to ours with respect to the goal of solving the shimming problem is [LLF+ 09] which focuses on type mismatches between the data types of consecutive steps and calls this a shimming problem of “Type I”; the authors describe as “Type II” the problem of mismatching connections of a scientific task that is nested within another, enclosing component that wraps the inner task. Our results extend [LLF+ 09] (e.g., we consider additional design challenges, not just Type I and II problems), and are complementary, e.g., to [HSL+ 04, BL05, GRD+ 07] (e.g., we do not consider semantic mismatches). In Chapter 1 and [MBZL09] some general advantages of a collection-oriented, assembly-line style design were presented, but no concrete examples were discussed.

2.5. Summary

2.5

58

Summary

In this chapter, we have presented VDAL, a scientific workflow modeling and design paradigm that aims at minimizing the “shimantic web syndrome” [HSL+ 04], i.e., the proliferation of unnecessarily complex workflow designs that involve large numbers of shim actors and “messy wiring”, thus obfuscating the scientific protocol that the workflow should capture in the first place. VDAL borrows ideas from, among others, flow-based programming [Mor94], functional programming (e.g., map γ), and, most importantly, Kepler/Comad [MB05, MBL06]. At the heart lies the idea of a virtual data assembly line, where nested data collections are streamed through a largely linear chain of VDAL actors, each of which has a built-in, easily configurable data access and management layer for selecting a substream of relevant input elements (σ), from which concrete data inputs can be further sub-selected and reorganized (γ), before being fed to the innermost scientific black-box function (A), placing A’s results at suitable positions in the output stream (ω). This architecture (i) eliminates many “data massaging shims” as their functionality is instead part of the actor configuration, given by the standard operations (σ, γ, ω), and (ii) minimizes non-local wiring 6 : e.g., in Fig. 2.4 the actors BooleanSwitch and DetermMerge are directly connected via one channel and non-locally wired through a subworkflow A. In contrast, VDAL workflow designer can understand a workflow locally, by inspecting its actor configuration; global effects, on the other hand, can be inferred using static analysis if necessary; see Chapters 4, 6 and 7.

6 A connected pair of actors A→B has non-local wiring, if there is an alternate, indirect path A→ · · · →B between them.

59

Chapter 3

Optimization I: Exploiting Data Parallelism A distributed system is one in which the failure of a computer you didn’t even know existed can render your own computer unusable. Leslie Lamport

In the previous chapter, we showed that the design of scientific workflows can benefit from a collection-oriented modeling paradigm which views scientific workflows as pipelines of XML stream processors. In this chapter1 , we present approaches for exploiting data parallelism in XML processing pipelines through novel compilation strategies to the MapReduce framework. Contributions.

The main contributions of this chapter are (i) the development of a

set of strategies for compiling scientific workflows, modeled as XML process pipelines, into parallel MapReduce networks, and (ii) a discussion of their advantages and trade-offs, based on a thorough experimental evaluation of the various translation strategies. Our evaluation uses the Hadoop MapReduce system as an implementation platform. Our results show 1

This chapter is adapted from [ZBKL09].

3.1. Introductory Example

60

XML Pipeline: Data:

//B

//C

A

with #B = 4 #C = 3 B1 B2 B3 B4 C1 C2 C3 ... C1 C2 C3 ... ... D ... D D ... D 1

100

1

100

//D

//B:

//C

B

C CC ...

D ... D

//B

A B B B ... ...

C ...CC

D ...D

Figure 3.1: XML Pipeline Example. XML Pipeline (top), where each step is labeled with its scope of work (e.g., //B, //C), shown with sample input data (bottom left) and data partitioning for Steps 1 and 5 (bottom right). that execution times of XML workflow pipelines can be significantly reduced using our compilation strategies. These efficiency gains, together with the benefits of MapReduce (e.g., fault tolerance) make our approach ideal for executing large-scale, compute-intensive XML-based scientific workflows. Outline. The rest of the chapter is organized as follows. In Section 3.1, we present a VDAL workflow that we will use throughout the chapter. Section 3.2 briefly reviews the MapReduce paradigm with an example that demonstrates the features utilized in our strategies. In Section 3.3 we describe a framework for XML processing pipelines, introduce important notions for their parallel execution, and give several pipeline examples. Section 3.4 presents our three parallelization strategies as well as their advantages and trade-offs. In Section 3.5 we present our experimental evaluation. We discuss related work in Section 3.6 and conclude in Section 3.7.

3.1

Introductory Example

Consider the simple XML processing pipeline shown in Fig. 3.1. This pipeline consists of five steps, each of which (i) receives XML structures from previous steps, and (ii) works over specific XML fragments (subtrees) within these structures. These fragments are determined through XPath expressions that specify the “scope” of a step. Steps are invoked over each scope match (i.e., matching XPath fragment), and steps can perform arbitrary mod-

3.1. Introductory Example

61

ifications to matched fragments using general XML processing techniques (e.g., XQuery, XSLT). The modifications made by steps often involve calling built-in (scientific) functions whose outputs are added within the matched fragment, or used to replace existing parts of the fragment. The result is a modified (i.e., updated) XML structure that is passed to subsequent steps. As an example, the first step of Fig. 3.1 has the scope “//B”, allowing it to perform arbitrary changes on “B”-rooted subtrees, i.e., new data items or subtrees can be inserted anywhere below the “B” node. However, for the middle step with scope “//D”, changes may only be performed at the leaves of the given structure shown in the bottom-left of Fig. 3.1. To exploit data parallelism, we map scope matches (fragments) to “work-pieces” that are then processed in parallel by MapReduce. The bottom-right of Fig. 3.1 shows how the data is partitioned for the scope “//B” as used in Steps 1 and 5. A naive adoption of this approach, however, can lead to bottlenecks in the splitting and regrouping phase of the parallel MapReduce execution. For example, from Step 1 to 2 the subtrees shown at the bottom right of Fig. 3.1 must be partitioned further. Grouping all work-pieces together again to then re-split for the second task is clearly inefficient. Furthermore, from Step 3 to 4, the “D”-rooted trees must be re-grouped to form trees rooted at “C”. Performing this grouping in a single global task also adds an unnecessary bottleneck because all required re-groupings could be done in parallel for each “C”-rooted subtree. Detailed Contributions.

We describe and evaluate three novel strategies—Naive,

XMLFS, and Parallel —for executing XML processing pipelines via the MapReduce framework. The Naive strategy deploys a simple step-wise splitting as outlined above. The XMLFS strategy maps XML data into a distributed file system to eliminate the grouping bottleneck of the Naive strategy. The Parallel strategy further utilizes existing splits to re-split the data in parallel, thereby fully exploiting the grouping and sorting facilities of MapReduce. In general, each of these strategies offers distinct advantages for applying MapReduce to data-parallel processing of XML.

3.2. MapReduce

62

We also present a thorough experimental evaluation of our strategies. Our experiments show a twenty-fold speed-up (with 30 hosts) in comparison to a serial execution, even when the basic Naive strategy is used. We also show that our Parallel approach significantly outperforms Naive and XMLFS for large data and when the cost for splitting and grouping becomes substantial. We consider a wide range of factors in our experiments—including the number of mapper tasks, the size of data and the XML nesting structure, and different computational load patterns—and we show how these factors influence the overall processing time while using our strategies.

3.2

MapReduce

MapReduce [DG08] is a software framework for writing parallel programs. Unlike with PVM or MPI, where the programmer is given the choice of how different processes communicate with each other to achieve a common task, MapReduce provides a fixed programming scheme. A programmer employing the MapReduce framework implements map and reduce functions, and the MapReduce library carries out the execution of these functions over corresponding input data. While restricting the freedom of how processes communicate with each other, the MapReduce framework is able to automate many of the details that must be considered when writing parallel programs, e.g., check-pointing, execution monitoring, distributed deployment, and restarting individual tasks. Furthermore, MapReduce implementations usually supply their own distributed file systems that provide a scalable mechanism for storing large amounts of data. Programming model. Writing an application using MapReduce mainly requires designing a map function and a reduce function together with the data types they operate on. Map and reduce implement the following signatures map

::

(K1 , V1 ) → [(K2 , V2 )]

reduce

::

(K2 , [V2 ]) → [(K3 , V3 )]

3.2. MapReduce

63

where all Ki and Vi are user-defined data types. The map function transforms a key-value pair (short kv-pair) into a list of kv-pairs (possibly) with different types. The overall input of a MapReduction is a (typically large) list of kv-pairs of type (K1 , V1 ). Each of these pairs is supplied as a parameter to a map call. Here, the user-defined map function can generate a (possibly empty) list of new (K2 , V2 ) pairs. All (K2 , V2 ) pairs output by the mapper will be grouped according to their keys. Then, for each distinct key the user-defined reduce function is called over the values associated to the key. In each invocation of reduce the user can output a (possibly empty) list of kv-pairs of the user-defined type (K3 , V3 ). The MapReduce framework divides the overall input data into kv-pairs, and splits this potentially large list into smaller lists (so-called input splits). The details of generating kvpairs (and input splits) can also be specified by the user via a custom split function. After kv-pairs are created and partitioned into input splits, the framework will use one separate map process for each input split. Map processes are typically spawned on different machines to leverage parallel resources. Similarly, multiple reduce processes can be configured to process in parallel different distinct keys output by the map processes. Example. Assume we want to generate a histogram and an inverted index of words for a large number of text files (e.g., the works of Shakespeare), where the inverted index is represented as a table with columns word, count, and locations. For each distinct word in the input data there should be exactly one row in the table containing the word, how often it appears in the data, and a list of locations that specify where the words were found. To solve this problem using MapReduce, we design the type K1 to contain a filename as well as a line number (to specify a location), and the type V1 to hold the corresponding line of text of the file. When given a (location, text) pair, map emits (word, location) pairs for each word inside the current line of text. The MapReduce framework will then group all output data by words, and call the reduce function over each word and corresponding list of locations to count the number of word occurrences. Reduce then emits the accumulated data (count, List of locations) for each processed word, i.e., the data structure V3 will

3.3. Framework

64

contain the required word count and the list of locations. The MapReduce framework can additionally sort values prior to passing them to the reduce function. The implementation of secondary sorting depends on the particular MapReduce framework. For example, in Hadoop [Bor07] it is possible to define custom comparators for keys K2 to determine the initial grouping as well as the order of values given to reduce calls. In our example above, we could design the key K2 to not only contain the word but also the location. We would define the “grouping” comparator to only compare the word part of the key, while the “sorting” comparator would ensure that all locations passed will be sorted by filename and line number. The reduce function will then receive all values of type location in sorted order, allowing sorted lists of locations to be easily created. In general, MapReduce provides a robust and scalable framework for executing parallel programs that can be expressed as combinations of map and reduce functions. To use MapReduce for parallel execution of XML processing pipelines, it is necessary to design data structures for keys and values as well as to implement the map and reduce functions. More complex computations can also make use of custom group and sort comparators as well as input splits.

3.3

Framework

The general idea behind transforming XML processing pipelines to MapReduce is to use map processes to parallelize the execution of each pipeline task according to the task’s scope expression. For each scope match the necessary input data is provided to the map tasks, and after all map calls have executed, the results are further processed to form either the appropriate input structures for the next task in the pipeline or the overall output data. For example, consider again the pipeline from Fig. 3.1. The partitioning and re-grouping of XML data throughout the pipeline execution is shown in Fig. 3.2: data in the first row is split into pieces such that at most one complete “B” subtree is in every fragment, which is then processed in parallel with the other fragments. Then, further splits occur for scope “C”

3.3. Framework

65 in

//B //C //D //C //B out

S S S G G G

A[ B[ C[

D[] . . . D[]

]C . . . C[

D[] . . . D[]

]C ]B . . . B[

...

]B ]A

A[ B[ C[

D[] . . . D[]

]C . . . C[

D[] . . . D[]

]C ]B . . . B[

...

]B ]A

A[ B[ C[

D[] . . . D[]

]C . . . C[

D[] . . . D[]

]C ]B . . . B[

...

]B ]A

A[ B[ C[

D[] . . . D[]

]C . . . C[

D[] . . . D[]

]C ]B . . . B[

...

]B ]A

A[ B[ C[

D[] . . . D[]

]C . . . C[

D[] . . . D[]

]C ]B . . . B[

...

]B ]A

A[ B[ C[

D[] . . . D[]

]C . . . C[

D[] . . . D[]

]C ]B . . . B[

...

]B ]A

A[ B[ C[

D[] . . . D[]

]C . . . C[

D[] . . . D[]

]C ]B . . . B[

...

]B ]A

Figure 3.2: Splits and groups for the strategy “Parallel”. For each step in the pipeline the data is partitioned such that all data for one scope match is inside one fragment while each fragment holds at most one match. and “D” respectively. Data is later re-grouped to ensure that all elements corresponding to a scope match are available as a single fragment. In the following we define our data model and assumptions concerning XML processing pipelines. We also characterize the operations that may be performed on single fragments within map calls (i.e., by pipeline tasks) to guarantee safe parallel execution.

3.3.1

XML Processing Pipelines

We assume XML processing pipelines that adopt the standard XML data model corresponding to labeled ordered trees represented as sequences of tokens; namely, opening tags “[T ”, closing tags “]T ”, and data nodes “#d ”. Data nodes typically represent data products whose specific format is understood by the software components implementing pipeline tasks, but not by the XML framework itself, which treats data as opaque CData nodes. For instance, data nodes may represent binary data objects (such as images) or simple text-based data (e.g., DNA sequence alignments). Pipeline tasks typically call “scientific functions” that receive data nodes as input and produce data nodes as output. In addition, tasks are annotated with scopes that define where in the overall XML structure input data is taken from and output data is placed.

3.3. Framework

66

Each scope specifies XML fragments within the input structure that represent the context of a task. Pipeline tasks may insert data (including XML tree data) anywhere within their corresponding context fragments or as siblings to the fragments, and remove data or complete subtrees anywhere within their fragments (including removing the fragments themselves). It is often the case that a given XML structure will contain multiple matching fragments for a task. In this case, the task is executed over each such match. We assume tasks do not maintain state between executions, thus allowing them to be safely executed in parallel over a given XML structure via the MapReduce framework. More formally, a pipeline consists of tasks T (or actors) each of which updates an XML structure X to form a new structure X 0 . Further, T = (σ, A) where the scope σ is a (simple) qualifier-free XPath expression consisting of child (/) and descendent-or-self (//) axes, and A is a function over XML structures. A subtree si in an input XML structure X is a scope match if σ(X) selects the root node of si . For nested scope matches, only the highest-level match in X is considered—a common restriction (e.g., [BCF03]) for avoiding nested executions. Formally, σ selects n non-overlapping subtrees si from X: σ(X) = {s1 , . . . , sn }. Then, the function A is called on each of these subtrees to produce a new XML fragment, i.e.: for each si : s0i = A(si ). The output document X 0 is then formed by replacing all si subtrees in X by the respective outputs s0i : X 0 = X[s1 7→ s01 , s2 7→ s02 , . . . ]. We require that A be a function in the mathematical sense, i.e., a result s0i only depends on its corresponding input si . This implies that s0i can be computed from si independently

3.3. Framework

67

from data inside other fragments sj or completely non-matched data in X.2

3.3.2

Operations on Token Lists

During pipeline execution we represent XML data as a sequence (i.e., list) of tokens of the form [T , ]T , and #d . By convention, we use capital letters to denote token lists and lowercase letters to denote trees and (ordered) forests3 . Token lists are partitioned into fragments that are sent to map calls for concurrent processing. Below we characterize the changes the map calls may perform on fragments to avoid jeopardizing overall data integrity. Note that the proposed rules can be followed locally and thus eliminate the need for more involved locking mechanisms. Definition (Balanced Token List). Consider the following rules to modify token lists: A #d B ⇒ A B

A, B ∈ Token List

(3.1)

A [X ]X B ⇒ A B

A, B ∈ Token List

(3.2)

Rule (3.1) deletes any data node whereas (3.2) deletes matching “open” and “close” nodes if they are next to each other within a sequence and have matching labels. As usual, we write T ⇒∗ T 0 if there exists a sequence of token lists Ti such that T = T1 ⇒ T2 ⇒ · · · ⇒ Tn = T 0 . A token list T is balanced if it can be reduced to the empty list, i.e., T ⇒∗ [ ]. Note that ⇒∗ is normalizing, i.e., if T ⇒∗ [ ] and T ⇒∗ T 0 then T 0 ⇒∗ [ ]. This means that for a balanced list T , applying deletion rules (3.1) and (3.2) in any order will terminate in the empty list (by induction on list length). Also note that an XML forest naturally corresponds to a balanced token list and vice versa. As described above, we want calls to map to compute new forests s0i from existing trees si . In particular, s0i can be computed by performing tree insertion and tree deletion 2

In essence, we perform a “map A” on the list of scope matches with map being the standard map function of functional programming languages. We thus require that A is a function to parallelize A invocations. 3 Although the term hedge seems more appropriate as it implies an order on the list of trees, we conform to most existing literature and use the term forest here to denote an ordered list of trees.

3.3. Framework

68

operations in an arbitrary order on si . The following operations on token lists correspond to these allowed operations on trees. Observation (Safe insertions). Inserting a balanced token list I at any position into a balanced token list T corresponds to inserting the forest i into the forest t (where forests i and t correspond to token lists I and T , respectively). In particular this operation results in a balanced token list T 0 . We call such an insertion a safe insertion. Proof: The result T 0 is trivially a balanced token list, since the newly inserted balanced fragment I can be deleted from T 0 via the deletion rules given above, resulting in T , which is balanced. Furthermore, the balanced token list I corresponds to a forest i. Since any location between two tokens in the list T corresponds to a position in the forest t, a safe insertion will insert i at this position.



Note that insertions which simply maintain the “balance” of a sequence, but are not safe, can change ancestors of already existing nodes. Consider the case of inserting the unbalanced fragment “]A [A ” into the middle of the balanced token list “[A #d #d ]A ”. This insertion will result in the balanced list “[A #d ]A [A #d ]A ”. However, the second #d token has changed parent nodes without explicitly being touched. Observation (Safe deletions). Removing a consecutive and balanced token list D from a balanced token list T results in a balanced token list T 0 . This operation corresponds to deleting the forest d from t. We call such a deletion a safe deletion. Proof: T 0 is trivially balanced since “⇒” is normalizing. Corollary 1 (Safe operations carry over to fragments of token lists).

 Viewing

token-list fragments as parts of the total (balanced) token list, we can perform safe insertions and safe deletions to perform the desired operations inside the scope of a pipeline task. Corollary 1 ensures that map calls can change their fragments by performing safe insertions and deletions without interfering with the data of other map calls. Moreover, since the complete scope is inside the fragment received by the map call, each map call is able to

3.3. Framework

69

Pipeline with read scopes: blur σ : //C Dataview: A B B B ... C C C C d

d

d

d

Sample input and output: montage σ : //B

colorize σ : //C A B B B ... C C C C d’

d’

blur

d’

d’

colorize

B C

A B ... B C

A B B B

A B C C CC

A B

d• d• d•

d’ d1d2d3d4 d’... d4

montage

Figure 3.3: Image transformation pipeline. All images are blurred ; then from each, four new images are created by coloring; and finally a big montage is created from all images below each “B”. delete its scope match, or to perform any “localized” operations on it using all the data inside its scope.

3.3.3

XML-Pipeline Example

In addition to the simple pipeline introduced in Fig. 3.1, we also consider a common image processing pipeline shown in Fig. 3.3. This pipeline is similar to a number of (more complicated) scientific workflows that perform image processing, e.g., in functional Magnetic Resonance Imaging (fMRI) [ZDF+ 05] and plasma-fusion data analysis [PLK07]. The pipeline employs the convert and montage tools from the Image-Magick [Ima08] suite to process multiple images organized according to nested XML collections. As shown in Fig. 3.3, a top-level hAi collection contains several hBi collections, each of which contains several hCi collections with an image d inside. This collection structure represent different independent tasks (hBi collections) and, within each task, a varying number of images can be processed to one montage (hCi collections within each hBi). The first step of the pipeline blurs the images (via the “convert -blur” command). Since this operation is performed on each image separately, we define the task’s scope σ using the XPath expression //C and its corresponding function A such that it replaces the image within its scope with a modified image resulting from invoking the blur operation. The next step in the pipeline creates a series of four colorized images from each blurred image d0 using the command “convert -modulate

3.4. Parallelization Strategies

70

100,100,i ” with 4 different values for i . The last step of the pipeline combines all images under one hBi collection into a single large image using the montage tool. The scope σ of this last task is //B since all images inside a hBi-labeled tree are input to the montage operation. Here the framework groups previously split fragments to provide the complete hBi subtree to the montage task.

3.4

Parallelization Strategies

We consider three strategies, Naive, XMLFS and Parallel, whose main differences are shown in Fig. 3.4. These strategies use variations of key-value data structures as well as split, map, and reduce operations, and build upon each other to address various shortcomings that arise in large-scale and compute-intensive processing of nested data.

3.4.1

Naive Strategy

The Naive approach corresponds to a straightforward application of MapReduce over XML data. As shown in Fig. 3.4, we cut XML token sequences into pieces for the map calls, perform the task’s operation A on its scope, and finally merge the result in the reduce step of the MapReduction to form the final output4 . The Naive approach uses the following data structures for keys and values. Key:

Integer Token := XOpen | XClose | Data

Value: TList := List of Token For each task in an XML pipeline, we create a MapReduction with split, map, and reduce as shown in listing 3.1. From an XML structure, SplitNaive creates a kv-pair for each match of the task’s scope: each pair comprises an Integer as key, and a TList as value. To decide if a current token opens a new scope in line 4 of listing 3.1, we use a straightforward technique to convert the qualifier-free, simple XPath-expression σ into a deterministic 4

This parallelization is a form of a simple scatter and gather pattern.

3.4. Parallelization Strategies XMLFS

Parallel A

A

S

A

A Split

Map

A

split for task i

G

output in XMLFS

A

G hadoop file system with naming scheme

A

input in XMLFS

A

A

S

A

A one output file

one input file

A

A Reduce

Split

Map Reduce

A

S

A

S

A

S

A

S

A

S

Map+Split

G G G

split for task i + 1

Naive

71

G G Reduce

Figure 3.4: Processes and dataflow for the three parallelization strategies finite-state automaton (DFA) reading strings of opening tokens. The DFA accepts when the read string conforms to σ. Using a stack of DFA states, we keep track of the current open tags. Here, we push the current state for an open token and reset the DFA to the state popped from the stack when a closing token is read. To prevent nested scope matches, we simply go into a non-accepting state with self-loop after we encounter a match. Note that closing the match will “pop” the automaton back to the state before the match. We are able to use this simple and efficient approach for streaming token lists because of the simplicity of the XPath fragment in which scopes are expressed.5 The first pair constructed by SplitNaive contains all XML tokens before the first match, and each consecutive pair contains the matching data, possibly followed by non-matching data. Each pair is then processed in MapNaive. Then, ReduceNaive merges all data fragments back into the final XML structure. Since our grouping comparator always returns “equal”, the one reduce task will receive all output from the mappers; also the fragments will be received in document order because the MapReduce framework will sort the values based on the key, which is increasing in document order. The output structure can now be used as input data for another MapReduce that executes the next step in the pipeline. 5

In general, this fragment is sufficient for modeling many scientific applications and workflows. Considering more complex fragments of XPath together with available streaming algorithms for them, such as those in [BCG+ 03, GS03], is beyond the scope of this chapter.

3.4. Parallelization Strategies 1

3

5

7

9

11

13

15

17

19

72

SplitNaive: TList input, XPath σ → [(Integer , TList )] int i := 0; TList splitOut := [ ] FOREACH token IN input DO IF (token opens new scope match with σ) AND (splitOut 6= [ ]) THEN EMIT (i, splitOut) // one split for each scope match i++; splitOut := [ ] splitOut.append(token) EMIT (i, splitOut) MapNaive: Integer s, TList val → [(Integer , TList )] val’ := A(val) // execute pipeline task EMIT (s, val’) ReduceNaive: Integer s, [TList] vals → [(Integer ,TList )] TList output := [ ] WHILE vals.notEmpty() DO output.append(vals.getValue()) // collapse to single value EMIT (0, output) Listing 3.1: Split, Map, Reduce for Naive strategy

Shortcomings of the Naive Strategy The major shortcoming of the Naive approach is that although data is processed in parallel by calls to map, both splitting and grouping token lists is performed by a single task. Split and reduce can thus easily become a bottleneck for the execution of the pipeline.

3.4.2

XMLFS Strategy

The XMLFS strategy removes the bottleneck in the reduce phase of the Naive approach by mapping XML structures to a distributed file system (see Fig. 3.4). Many MapReduce implementations, including Hadoop and Google’s MapReduce, provide a distributed file system that allows efficient and fault-tolerant storage of data in the usual hierarchical manner of directories and files, and this distributed file system is employed in the XMLFS approach as follows. Mapping XML structures to a file system. An XML document naturally corresponds

3.4. Parallelization Strategies

73

to a file-system-based representation by mapping XML nodes to directories and data nodes to files. We encode the ordering of XML data by pre-pending the XML-labels with identifiers (IDs) to form directory and file names. The IDs will also naturally ensure that no two elements in the same directory will have the same name in the file system even though they have the same tag. Note that although we do not explicitly consider XML attributes here, we could, e.g., store them in a file with a designated name inside the directory of the associated XML element. Using a file system based representation of XML data has many advantages: (1) XML structures can be browsed using standard file-system utilities. The Hadoop software package, e.g., provides a web-based file-system browser for its Hadoop file system (hdfs) [Bor07]. (2) Large amounts of XML data can easily be stored in a fault-tolerant manner. Both Hadoop-fs and the Google File System provide distributed, fault-tolerant storage. Specifically, they allow users to specify a replication factor to control how many copies of data are maintained. (3) The file system implementation provides a natural “access index” to the XML structure: In comparison to a naive token list representation, navigating into a subtree t can be performed using simple directory changes without having to read all tokens corresponding to subtrees before t. (4) Applications can access the “distributed” XML representation in parallel, assuming that changes to the tree and data are made at different locations. In particular, pipeline steps can write their output data s0i in parallel. XMLFS-Write.

XMLFS adapts the Naive approach to remove its bottleneck in the

reduce phase. Instead of merging the data into a large XML structure, we let each task write its modified data s0i directly into the distributed file system. Since we do not need to explicitly group token lists together to form bigger fragments, we can perform this operation

3.4. Parallelization Strategies

74

directly in the map calls. This approach removes the overhead of shuffling data between map and reduce calls as well as the overhead of invoking reduce steps. In particular, the XMLFS strategy does not require the grouping and sorting feature of the MapReduce framework since each task is implemented directly within the map function. In XMLFS, the file system layer performs an implicit grouping as opposed to the explicit grouping in the Naive reduce function. When map calls write the processed XML token list T to the file system, the current path p from the XML root to the first element in T needs to be available since the data in T will be stored under the path p in the file system. We add this information to the key as a list of open tags, called the leading path. IDs for maintaining order among siblings must also be available. Since map calls may not communicate with each other, the decisions about the IDs must be purely based on the received keys and values, and the modifications performed by a task’s operation A. Unfortunately, the received token lists are not completely independent: An opening token in one fragment might be closed only in one of the following fragments. Data that is inside such a fragment must be stored under the same directory on the file system by each involved map call independently. It is therefore essential for data integrity that all map calls use the same IDs for encoding the path from the document root to the current XML data. We now make these concepts more clear, stating requirements for IDs in general, as well as requirements for ID handling in split and map functions. Requirements for Token-Identifiers (IDs).

The following requirements need to be

satisfied by IDs: • Compact String Representation: We require a (relatively small) string representation of the ID to be included in the token’s filename, since we must use the ID for storing the XML data in the distributed file system. • Local Order: IDs can be used to order and disambiguate siblings with possibly equal labels. Note that we do not require a total order: IDs only need to be unique and ordered for nodes with the same parent.

3.4. Parallelization Strategies

75

• Fast comparisons: Comparing two IDs should be fast. • Stable insertions and deletions: Updates to XML structures should not effect already existing IDs. In particular, it should be possible to insert arbitrary data between two existing tokens. It should also be possible to delete existing tokens without changing IDs of tokens that have not been deleted. Choice of IDs. Many different labeling schemes for XML data have been proposed; see [HHMW07] for a recent overview. For our purposes, any scheme that satisfies the requirements stated above could be used. These include ORDPATHs described in [OOP+ 04] or the DeweyID-based labels presented in [HHMW07]. However, note that many proposed ID solutions (including the two schemes just mentioned) provide global IDs, facilitate navigation (e.g., along parent-child axes), and allow testing of certain relationships between nodes (e.g., whether a node is a descendent of another node). Since we only require local IDs, i.e., IDs that are unique only among siblings, and we do not use IDs for navigation or testing, we adopt a conceptually easier (though less powerful) labeling scheme. Of course, our IDs could easily be replaced by ORDPATHs, or other approaches, if needed. Simple decimal IDs. A natural choice for IDs are objects that form a totally ordered, dense space such as the rational numbers. Here, we can find a new number m between any two existing numbers a and b, and thus do not need to change a or b to insert a new number between them. Using only these numbers that have a finite decimal representation (such as 0.1203 as opposed to 0.3 periodical 3) we would also gain a feasible string representation. However, there is no reason to keep the base 10. We instead use Maxlong as a base for our IDs. Concretely, an ID is a list of longs. The order relation is the standard lexicographical order over these lists. As a string representation we add “.” between the single “digits”. Since one digit already has a large number of values, long lists can easily be avoided: To achieve short lists we use a heuristic similar to the one proposed in [HHMW07] that works well in practice. When the initial IDs for a token stream are created, instead of numbering tokens successively, we introduce a gap between numbers (e.g., an increment of 1000). Note

3.4. Parallelization Strategies

76

that since we only label nodes locally, we can accommodate Maxlong/10006 sibling nodes with a one-“digit” ID during the initial labeling pass. With a gap of 1000, e.g., we can also insert a large number of new tokens into existing token lists before we need to add a second “digit”. In our tests, e.g., we never had to create an ID with more than one digit. Splitting input data. Creating key-value pairs for the XMLFS strategy is similar to the Naive strategy with the exception that we create and maintain IDs of XOpen and Data tokens. The XMLFS strategy uses the following data structures for keys and values. ID

:= List of Long

IDXOpen := Record{ id: ID, t: XOpen}

Key:

IDData

:= Record{ id: ID, t: Data}

IDToken

:= IDXOpen | IDData | XClose

XKey

:= Record{ start: ID, end: ID, lp: TIDList}

Value: TIDList

:= List of IDToken

In the key, we use lp to include the leading path from the XML root node to the first item in the TIDList stored in the value. As explained above, this information allows data to be written back to the file system based solely on the information encoded in a single key-value pair. Finally, we add the IDs start and end to the key, which denote fragment delimiters that are necessary for independently inserting data at the beginning or end of a fragment by map calls. For example, assume we want to insert data D before the very first token A in a fragment f 7 . For a newly inserted D, we would need to choose an ID that is smaller to the ID of A. However, the ID must be larger than the ID of the last token in the fragment that comes before f . Since the IDs form a dense space, it is not possible to know how close the new ID D.id should be to the already existing ID of A. Instead, we use the start ID in the key, which has the property that the last ID in the previous fragment is smaller. Thus, the newly inserted data item can be given an ID that is in the middle of start and A.id. 6 7

approximately 9×1015 on 32-bit systems The task might want to insert a modified version of its scope before the scope.

3.4. Parallelization Strategies

77

Similarly, we store a mid-point ID end for insertion at the end of a TIDList.8 Listing 3.2 and listing 3.3 give the algorithm for splitting input data into key-value pairs. We maintain a stack openTags of currently open collections to keep track of the IDs in the various levels of an XML structure as we iterate over the token list. Whenever we split the stream in fragments (line 11) we compute a mid-point of the previous Token-ID and the current one. The mid-point is then used as an end ID for the old fragment, and will later be the start ID for the fragment that follows. Note that we reset lastTokenID to “[0]” whenever we open a new collection since our IDs are only local. Moreover, if we split immediately after a newly opened collection, the mid-point ID would be [500] (the middle of [0] and the first token’s ID [1000]). It is thus possible to insert a token both at the beginning of a fragment and at the end of the previous fragment. Map step for XMLFS. As in the Naive strategy, the map function in the XMLFS approach performs a task’s operation A on its scope matches. Similarly, safe insertions and deletions are required to ensure data integrity in XMLFS. Whenever new data is inserted, a new ID is created that is between the IDs of neighboring sibling tokens. If tokens are inserted as first child into a collection, the assigned ID is between [0] and the ID of the next token. Similarly, if data is inserted as the last child of a node (i.e., the last element of a collection), then the assigned ID is larger than the previous token. Note that when performing safe insertions and deletions only, the opening tokens that are closed in a following fragment cannot be changed. This guarantees that the leading path, which is stored in the key of the next fragment, will still be valid after the updates on the values. Also, XClose tokens that close collections opened in a previous fragment cannot be altered with safe insertions and safe deletions, which ensures that following the leading paths of fragments will maintain their integrity. After data is processed by a map call, the token list is written to the file system. For this 8

When using ORDPATH IDs, we could exploit ORDPATH’s careting method to generate an ID very close to another one. However, this technique would increase the number of digits for each such insertion, which is generally not desired.

3.4. Parallelization Strategies 1

78

SplitXMLFS: TIDList input, XPath σ → [(PKey,TIDList )] CALL Split(input, σ, [0], [maxlong], [ ])

3

7

MapXMLFS: PKey key, TIDList val → [(PKey , TIDList )] IF (key.lp / val[0]) matches scope σ val’ := A(val) Store val’ in distributed file system

9

// No Reduce necessary, Map stores data

5

Listing 3.2: Split and Map for XMLFS

write operation, the leading path in the key is used to determine the starting positions for writing tokens. Each data token is written into the current directory using its ID to form the corresponding file name. For each XOpen token, a new directory is created (using the token’s ID and label as a directory name) and is set as the current working directory. When an XClose token is encountered, the current directory is changed to the parent directory. Shortcomings of the XMLFS Strategy Although the XMLFS approach addresses the bottleneck of having a single reduce step, splitting the data for the map task is still done in a single serial task, which can become a bottleneck for pipeline execution. Further, even the distributed file system can become a bottleneck when all map calls write their data in parallel. Often only a few (or even only one) server controls the distributed file system’s directory structure and metadata. As the logical grouping of the XML structure is performed “on the file system”, these servers might not be able to keep up with the parallel access. Since both the Google file system and the Hadoop file system hdfs are optimized for handling a moderate number of large files instead of a large number of (small) files or directories, storing all data between tasks to the file system using the directory-to-XML mapping above can become inefficient for XML structures that have many nodes and small data tokens. Additionally, after data has been stored in the file system, it will be split again for further parallel processing by the next pipeline task. Thus, the file system representation must be

3.4. Parallelization Strategies

1

3

5

7

9

11

13

15

17

19

21

23

25

27

Split: TIDList input, XPath σ, ID startID, ID endID, TIDList lp → [(PKey, TIDList )] TIDList openTags := lp // list of currently open tags TIDList oldOpenTags := lp // leading path ID lastEnd := startID // ending ID of last fragment ID lastTokenID := startID // ID of last token TIDList splitOut := [ ] // accu for fragment value FOREACH token IN input DO IF (openTags / token matches scope σ) AND (splitOut 6= [ ]) THEN ID newend := midPoint(lastTokenID, token.id) key := NEW PKey (lastEnd, newend, oldOpenTags) oldOpenTags := openTags EMIT key, splitOut // output current fragment lastEnd := newend; splitOut := [ ] splitOut.append(token); IF token is IDData THEN lastTokenID := token.id IF token is IDXOpen THEN openTags.append(token) lastTokenID := [0] IF token is Close THEN lastOpenToken := openTags.removeLast() lastTokenID := lastOpenToken.id ENDFOR key := new PKey (lastEnd, endID, oldOpenTags) EMIT key, splitOut // don’t forget the last piece Listing 3.3: Split for XMLFS & Parallel

79

3.4. Parallelization Strategies

80

transformed back into TIDLists. This work seems to be unnecessary since the previous task used a TIDList representation, which was already split for parallel processing. For example, consider two consecutive tasks that both have the same scope: Instead of storing the TIDLists back into the file system, the first task’s map function could directly pass the data to the map function of the second task. However, once consecutive tasks have different scopes, or substantially modify their data to introduce new scope matches, simply passing data from one task’s map function to the next would not work. We address this problem in the Parallel strategy defined below.

3.4.3

Parallel Strategy

The main goal of the Parallel Strategy is to perform both splitting and grouping in parallel, thus providing a fully scalable solution. For this, we exploit the existing partitioning of data from one task to the next while still having the data corresponding to one scope inside a key-value pair. Imagine two consecutive tasks A and B. In case both tasks have the same scope, the data can simply be passed from one mapper to the next if A does not introduce additional scope matches for B, in which case we would further need to split the fragments. In case the scope of task B is a refinement of A’s scope, i.e., A’s σ1 is a prefix of B’s σ2 , A’s mapper can split its TIDList further and output multiple key-value-pairs that correspond to B’s invocations. However, it is also possible that a following task B has a scope that requires earlier splits to be undone, for example if task A’s scope is //A//B whereas task B’s scope is only //A, then the fine-grained split data for A needs to be partially merged before it is presented to B’s mappers. Another example is an unrelated regrouping: here, splitting and grouping are necessary to re-partition the data for the next scope. Even in this situation, we want to efficiently perform the operation in parallel. We will use MapReduce’s ability of grouping and sorting to achieve this goal. In contrast to the Naive Approach, we will not group all the data into one single TIDList. Instead, the data is grouped into lists as they are needed by the next task. As we will show, this can be done in parallel.

3.4. Parallelization Strategies J.5

Unpartitoned data O A1[ B1[ J.5

Grouped by //D //D A1[ B1[

81

D1

D2

A1[ B1[ J.5

A1[ B1[ J1.5

X2

Intermediary J.5 A1[ J.5 A1[ B1[ J.5 A1[ B1[ J1.5 I A1[ B1[ D1 D2 ]B groups

X2

A1[ B1[

J.5

A1[ J.5

Grouped by //B //B A1[ B1[

D2

X2

]B

NG

D1

]B

A1[ B1[

A1[ B1[

D1

D2

D3 A1[ J2.5

D3

A1[ J2.5

D3

]A

B2[

]A

B2[

]A

NG

]B

X2

A1[ J2.5

D3

]A

D1 B2[ J.5

D1

J1.5

B2[ J.5

B2[

B2[

B2[

J1.5

B2[

D1

D1

]B

]B ]B

]B

Figure 3.5: Parallel re-grouping. Example of how to change fragmentation from //D to //B in parallel. Since splitting from row two to row three is performed independently in each fragment this step can be performed in the mapper. Grouping from row three to row four is performed in parallel by the shuffling and sorting phase of MapReduce such that the merge can be done in the reducers, also in parallel.

3.4.4

Parallel Strategy in Detail

Consider an arbitrarily partitioned TIDList. Fig. 3.5 shows an example of such a grouped list in the second row. Each rectangle corresponds to one key-value pair (or fragment): The value (a TIDList) is written at the bottom of the box, whereas the key is symbolized at the top-left of the box. IDXOpen and IDXData tokens are depicted with their corresponding IDs as a subscript; XClose tokens do not have an ID. For ease of presentation we use decimal numbers to represent IDs with the initial tokens having consecutively numbered IDs. The smaller text line in the top of the boxes show the leading path lp together with the ID start. The key’s ID end is not shown—it always equals the start-ID of the next fragment, and is a very high number for the last fragment. The first box in the second row, for example, depicts a key-value pair with the value consisting of two XOpen tokens, each of which having the ID of 1. The leading path in the key is empty, and the start-ID of this fragment is 0.5. Similarly, the second box represents a fragment that has as value only a token [D with ID 1. Its leading path is A1[ B1[ , and the start-ID of this fragment is 0.5.

3.4. Parallelization Strategies

82

Now, consider that the split as shown in the second row of Fig. 3.5 is the result after the task’s action A is performed in the mappers. Assume the next task has a scope of //B. In order to re-fragment an arbitrary split into another split, two steps are performed: A split and a merge operation. Split-Operation.

Inside the mapper, each fragment (or key-value pair) is investigated

whether additional splittings are necessary to comply with the required final fragmentation. Since each fragment has the leading path, a start and an end-ID encoded in the key, we can use algorithm Split as given in listing 3.3 to further split fragments. In Fig. 3.5, for instance, each fragment in the second row is investigated if it needs further splits: The first and the fourth fragment will be split since they each contain a token [B . If there is one fragment with many B subtrees, then it will be split in many different key-value pairs, just like in the previous approach. Note that this split operation is performed on each fragment independently from each other. We will therefore execute Split in parallel inside the mappers as shown in the dataflow graph on the right side of Fig. 3.4 on page 71. The pseudo-code for the mapper task in listing 3.4, line 6. Merge-Operation. The fragments that are output by the split-Operation contain enough split-points such that at most one scope match is in each fragment. However, it is possible that the data within one scope is spread over multiple, neighboring fragments. In Fig. 3.5, for example, the first B-subtree is spread over three fragments (fragment 2, 3, and 4 in the row showing the intermediary groups). We use MapReduce’s ability to group key-value pairs to merge the correct fragments in a Reduce step. For this, we put additional GroupBy information into the key. In particular, the key and value data structures for the parallel Strategy are as follows: GroupBy := Record{ group: Bool, gpath: TIDList } Key:

PKey

Value: TIDList

:= Record of XKey and GroupBy := List of IDToken

3.4. Parallelization Strategies 1

3

5

7

83

MapParallel: SKey key, TIDList val → [(SKey , TIDList )] IF (key.lp / val[0]) matches scope σ val’ := A(val) List of (SKey, TIDList) outlist; // split according to the scope σ 0 of the following step outlist := Split(val’, σ 0 , key.start, key.end, key.lp) FOREACH (key,fragment) ∈ outlist DO EMIT(key, fragment);

9

11

13

15

ReduceParallel: SKey key, [TIDList ] vs → [(SKey , TIDList )] TIDList out := [ ] WHILE (val := vs.next()) out.append(val); key.end := val.end // set end in key to end of last fragment EMIT(key, out) Listing 3.4: Map and Reduce for Parallel

Fragments, that do not contain tokens that are within scope simply set the group-flag to false and will thus not be grouped with other fragments by the MapReduce framework. In contrast, fragments that contain relevant matching tokens will have the group flag set. For these, we use the structure gpath to store the path to the node matching the scope. Since there is at most one scope-match within one fragment (ensured by the previous splitoperation) there will be exactly one of these paths. In Fig. 3.5, we depicted this part of the key in the row between the intermediary fragments and the final fragments split according to //B: The first fragment, not containing any token below the scope //B, is not going to be grouped with any other fragment. The following three fragments all contain A1[ B1[ as gpath, and will thus be presented to a single reducer task, which will in turn assemble the fragments back together (pseudo-code is given in listing 3.4. The output will be a single key-value pair containing all tokens below the node B as required. Order of fragments. The IDs inside the TokenList of the leading path lp together with the ID start in a fragment’s key can be used to order all fragments in document order. Since IDs are unique and increasing within one level of the XML data, the list of IDs on the path leading from the root node to any token in the document forms a global numbering

3.4. Parallelization Strategies 1

3

5

7

9

11

84

GroupCompare: SKey keyA, SKey keyB → { <, =, > } IF (keyA.group AND keyB.group) THEN // group based on grouping-path RETURN LexicCompare(keyA.gpath, keyB.gpath) ELSE // don’t group (returns < or > for two different fragments) RETURN SortCompare(keyA, keyB) SortCompare: SKey keyA, SKey keyB → { <, =, > } // always lexicographically compare ‘‘leading path ⊕ start ’’ RETURN LexicCompare( keyA.lp ⊕ keyA.start, keyB.lp ⊕ keyB.start ) Listing 3.5: Group and sort for Parallel strategy

scheme for each token whose lexicographical order corresponds to standard document order. Further, since each fragment contains the leading path to its first token and the ID start, a local ID, smaller than the ID of the first token, the leading path’s ID-list extended by start can be used to globally order the fragments. See, for example Fig. 3.5: In the intermediary row, the ID lists 0.5 < 1, 0.5 < 1, 1, 0.5 < 1, 2.5 < 1.5 < 2, 0.5 are ordering the fragments from left to right. We use this ordering for sorting the fragments such that they are presented in the correct order to the reduce functions. Listing 3.5 shows the definitions for the grouping and sorting comparator used in the Parallel strategy. Two keys that both have the group flag set, are compared based on the lexicographical order of their gpath entries. Keys that have group not set are simply compared. This ensures that one of them is strictly before the other that the returned order is consistent. The sorting comparator simply compares the IDs of the leading paths extended by start lexicographically.

3.4.5

Summary of Strategies

Table 3.1 presents the main differences of the presented strategies, Naive, XMLFS, and Parallel. Note, that while Naive has the simplest data structures it splits and groups the data in a centralized manner. XMLFS parallelizes grouping via the file system but still has a centralized split phase. The Parallel strategy is fully parallel for both splitting and

3.5. Experimental Evaluation

Data Split Group KeyStructure ValueStructure

85

Naive XML File Centralized Centralized by one reducer One integer

XMLFS File system representation Centralized Via file system + naming No shuffle, no reduce Leading path with Ids

SAX-elements

SAX-elements with XMLIds

Parallel Key-value pairs Parallel Parallel by reducers Leading path with Ids and grouping information SAX-elements with XMLIds

Table 3.1: Main differences for compilation strategies grouping at the expense of more complex data structures and multiple reduce tasks.

3.5

Experimental Evaluation

Our experimental evaluation of the different strategies presented above is focused on addressing the following questions: 1. Can we achieve significant speed-ups over a serial execution? 2. How do our strategies scale with an increasing data load? 3. Are there significant differences between the strategies? Execution Environment.

We performed our experiments on a Linux cluster with 40

3GHz Dual-Core AMD Opteron nodes with 4GB of RAM and connected via a 100MBit/s LAN. We installed Hadoop [Bor07] on the local disks9 , which also serve as the space for hdfs. Having approximately 60G of locally free disk storage provides us with 2.4TB of raw storage inside the Hadoop file system (hdfs). In our experiments, we use an hdfs-replication factor of 3 as it is typically used to tolerate node failures. The cluster runs the ROCKS [roc] software and is managed by SunGrid-Engine (SGE) [Gen01]; we created a common SGE parallel environment that reserves computers for being used as nodes in the Hadoop 9

Running Hadoop from the NFS-home directory results in extremely large start-up times for mappers and reducers.

3.5. Experimental Evaluation

86

environment while performing our tests. We used 30 nodes running as “slaves”, i.e., they run the MapReduce tasks as well as the hdfs name nodes for the Hadoop file system. We use an additional node, plus a backup-node, running the master processes for hdfs and the MR task-tracker, to which we submit jobs. We used Hadoop version 0.18.1 as available on the web-page. We configured Hadoop to launch mapper and reducer tasks with 1024MB of heap-space (-Xmx1024) and restricted the framework to 2 Map and 2 Reduce tasks per slave node. Our measurements are done using the UNIX time command to measure wallclock times for the main Java program that submits the job to Hadoop and waits until it is finished. While our experiments were running, no other jobs were submitted to the cluster to not to interfere with our runtime measurements. Handling of Data Tokens. We first implemented our strategies while reading the XML data including the images into the Java JVM. Not surprisingly, the JVM ran out of memory in the split function of the Naive implementation as it tried to hold all data in memory. This happened for as few as #B = 50 and #C = 10. As each picture was around 2.3MB in size, the raw data alone already exceeds the 1024MB of heap space in the JVM. Although all our algorithms could be implemented in a streaming fashion (required memory is of the order of the depth of the XML tree; output is successively returned as indicated by the EMIT keyword), we chose a trick that is often used in practice: we place references in form of file-names into the XML data structure, while keeping the large binary data at a common storage location (inside hdfs). Whenever we place an image reference into the XML data, we obtain an un-used filename from hdfs and store the image there. When an image is removed from the XML structure we also remove it from hdfs. The strategy of storing the image data not physically inside the data tokens also has the advantage that only the data that is actually requested by a pipeline step is lazily shipped to it. Another consequence is that the data that is actually shipped from the mapper to the reducer tasks is small and thus making even our naive strategy a viable option.

3.5. Experimental Evaluation

87

Number of Mappers and Reducers. As described in Section 3.2, a split method is used to group the input key-valuable pairs into so-called input splits. Then, for each input split one mapper is created, which processes all key-value pairs of this split. Execution times of MapReductions are influenced by the number of mapper and reducer tasks. While many mappers are beneficial to load balancing they certainly increase the overhead of the parallel computation especially if the number of mappers significantly outnumbers the available slots on the cluster. A good choice is to use one mapper for each key-value pair if the work per pair is significantly higher than task creation time. In contrast, if the work A is fast per scope match then the number of slots, or a small multiple of them is a good choice. All output key-value pairs of the mapper are distributed to the available reducers according to a hashing function on the key. Of course, keys that are to be reduced by the same reducer (as in naive) should be mapped to the same hash value. Only our smart approach has more than one reducer. Since the work for each group is rather small, we use 60 reducers in our experiments. The hash function we used is based on the GroupBy-part of the PKey. In particular for all fragments that have the group flag set, we compute a hash value h based on the IDs inside gpath: Let l be the flattened list of all the digits (longs) inside the IDs of gpath. Divide each element in l by 25 and then interpreted l as a number N to the base 100. While doing so, compute h = (N mod 263) mod the number of available reduce tasks. For fragments with the group flag not set, we simply return a random number to distribute these fragments uniformly over reducers10 . Our hash-function resulted in an almost even distribution of all k-v-pairs over the available reducers.

3.5.1

Comparison with Serial Execution

We used the image transformation pipeline (Fig. 3.3), which represents pipelines that perform intensive computations by invoking external applications over CData organized in a hierarchical manner. We varied the number #C of hCi collections inside each hBi, i.e., the 10 Hadoop does not support special handling for keys that will not be grouped with any other key. Instead of shuffling the fragment to a random reducer, the framework could just reduce the pair at the closest reducer available.

20000 18000 16000 140003.5. Experimental Evaluation

88

12000 25

25

25

20

20

20

15

15

15

10

10

10

5

5

5

0

0

0

10000 8000 6000 4000 2000 0

(a) #C = 1

Serial

(b) #C = 5

Naive

XMLFS

(c) #C = 10

Parallel

Figure 3.6: Serial versus MapReduce-based execution. Relative speed-ups to serial execution of image processing pipeline (Fig. 3.3). All three strategies outperform a serial execution. The achieved speed-ups for #C =1 is only around 13x, whereas in the experiments with more data, more than 20x speed-ups were achieved. #B was set to 200. total number of with hCi labeled collections in a particular input data is #B · #C. Execution times scaled linear for increasing #B (from 1 to 200) for all three strategies. We also ran the pipeline in serial on one host of the cluster. Fig. 3.6 shows the execution times for #B = 200 and #C ranging over 1, 5 and 10. All three strategies significantly outperform the serial execution. With #C = 10, the speed-up is more than twenty-fold. Thus, although the parallel execution with MapReduce has overhead for storing images in hdfs and copying the data from host to host during execution, speed-ups are substantial if the individual steps are relatively compute intensive in comparison to the data size that is being shipped. In our example, each image is about 2.3MB in size; and blur executed on the input image in around 1.8 seconds, coloring the image once takes around 1 second, the runtime of montage varies from around 1 second for one image to 13 seconds for combining 50 images11 . We also experimented with the number of mappers. When creating one mapper for each fragment, we could achieve the fastest and most consistent runtimes (shown in Fig. 3.7). When fixing the number of mappers to 60, runtimes started to have high fluctuations due 11

There are 5 differently colored images under each hCi, with #C = 10, thus 50 images have to be “montaged”.

3.5. Experimental Evaluation

89

to so-called “stragglers”, i.e., single mappers that run slow and cause all other to wait for the stragglers’ termination. For this pipeline, all our approaches showed almost the same run-time behavior with Naive performing slightly worse in all three cases. The reason for the similar runtimes is that the XML structure that is used to organize the data is rather small. Therefore, not much overhead is caused by splitting and grouping the XML structure, especially compared to the workload that is performed by each processing step.

3.5.2

Comparison of Strategies

To analyze the overhead introduced by splitting and grouping, we use the pipeline given in the introduction (Fig. 3.1). Since it does not invoke any expensive computations in each step, the run-times directly correspond to the overhead introduced by MapReduce in general and our strategies in particular. In the input data, we always use 100 empty hDi collections as leaves, and vary #B and #C as in the previous example. The results are shown in Fig. 3.7. For small data sizes (#C = 1 and small #B) Naive and XMLFS are both faster than Parallel, and XMLFS outperforms Naive. This confirms our expectations: Naive uses fewer reducers than the Parallel approach (1 vs. 60) even though the 60 reducers are executed in Parallel, there is some overhead involved to launch the tasks and wait for their termination. Furthermore, the XMLFS approach has no reducers at all and is thus a mapper-only pipeline and very fast. We ran the pipeline with #C = 1 until #B = 1000 to investigate behavior with more data. From approximately #B = 300 to around 700, all three approaches had similar execution times. Starting from #B = 800, Naive and XMLFS perform worse than Parallel (380s and 350s versus 230s, respectively). Runtimes for #C = 10 are shown in Fig. 3.7(b), Here, Parallel outperforms Naive and XMLFS at around #B = 60 (with a total number of 60,000 hDi collections). This is very close to the number of 80,000 hDi collections at the “break-even” point for #C = 1. In Fig. 3.7(c) this trend continues. Our fine-grained measurements for #B = 1 to 10 show that the “break-even” point is, again, around 70,000 hDi collections. The consistency in the

3.5. Experimental Evaluation

90

runtime [seconds]

1000 naive xmlfs parallel

800 600 400 200 0 0

100 200 300 400 500 600 700 800 900 1000 #B is varied on the X-Axis (a) #C = 1

runtime [seconds]

1000 naive xmlfs parallel

800 600 400 200 0 0

20

40

60 80 100 120 140 160 180 200 #B is varied on the X-Axis (b) #C = 10

runtime [seconds]

1000 naive xmlfs parallel

800 600 400 200 0 0

50 100 150 #B is varied on the X-Axis

200

(c) #C = 100

Figure 3.7: Runtime comparison of the compilation strategies. Runtimes executing the pipeline given in Fig. 3.1 are compared. On the X-Axis #B is varied, Y-axis shows wallclock runtime of the pipeline. For small XML structures, Naive and XMLFS outperform Parallel since fewer tasks have to be executed. On the other hand, the larger the data the better Parallel performs in comparison to the other two approaches.

3.6. Related Work

91

“break-even” point numbers suggests that our parallel strategy outperforms XMLFS and Naive once the number of fragments to be handled and regrouped from one task to the next is on the order of 100,000. In this experiment, we set the number of mappers to 60 for all steps as the work for each fragment is small in comparison to task startup times. As above, we used 60 reducers for the Parallel strategy. Experimentation result. We confirmed that our strategies can decrease execution time for (relatively) compute-intense pipelines. Our image-processing pipeline executed with a speed-up factor of 20. For XML data that is moderately sized, all three strategies work well, often with XMLFS outperforming the other two. However, if data size increases Parallel clearly outperforms the other two strategies due to its fully parallel split and group.

3.6

Related Work

Although the approaches presented here are focused on efficient parallelization techniques for executing XML-based processing pipelines, our work shares a number of similarities to other systems (e.g., [TSWR03, Dee05, ZHC+ 07, FPD+ 05]) for optimizing workflow execution. For example, the Askalon project [FPD+ 05] has a similar goal of automating aspects of parallel workflow execution so that users are not required to program low-level grid-based functions. To this end Askalon provides a distributed execution engine, in which workflows can be described using an XML-based “Abstract Grid Workflow Language” (AGWL). Our approach, however, differs from Askalon (and similar efforts) in a number of ways. We adopt a more generic model of computation that supports the fine-grain modeling and processing of (input and intermediate) workflow data organized into XML structures. Our model of computation also supports and exploits processes that employ “update semantics” through the use of explicit XPath scope expressions. This computation model has been shown to have advantages over traditional workflow modeling approaches [MBZL09], and a number of real-world workflows have been developed within the Kepler system using this approach

3.6. Related Work

92

(e.g., for phylogenetics and meta-genomics applications). Also unlike Askalon, we employ an existing and broadly used open-source distribution framework for MapReduce (i.e., Hadoop) [DG08] that supports task scheduling, data distribution, and check-pointing with restarts. This approach further inherits the scalability of the MapReduce framework.12 Our work also significantly differs from Askalon by providing novel approaches for exploiting data parallelism in workflows modeled as XML processing pipelines. Alternatively, Qin and Fahringer [QF07] introduce simple data collections (compared with nested XML structures) and collection shipping constructs that can reduce unnecessary data communication (similar approaches are also described in [FLL09, Goo07, OGA+ 02]). Using special annotations for different loop constructs and activities, they compute matching iteration data sets for executing a function, and forward only necessary data to this iteration instance. Within a data collection each individual element can be addressed and separately shipped. This technique requires users to specify additional constraints during workflow creation, which can make workflow design significantly more complex. In Chapter 4, we address similar problems for XML processing pipelines, however, the necessary annotations in our approach can be automatically inferred based on the workflow scope descriptions. We complement these approaches here by focusing on strategies for efficient and robust workflow execution through data parallelization strategies, while leveraging data and process distribution and replication provided by Hadoop. Thus, through our compilation strategies, we directly take advantage of the operations and sorting capability of the MapReduce framework for data packaging and distribution. MapReduce is also employed in [FLL09] for executing scientific workflows. This approach extends map and reduce operations for modeling workflows, requiring users to design workflows explicitly using these constructs. In contract, we provide a high-level workflow modeling language and automatically compile workflows to standard MapReduce operations. Our work also has a number of similarities to the area of query processing over XML 12

Which was demonstrated, e.g., by solving the Tera-sort challenge, where Hadoop successfully scaled to close to 1000 nodes and Google’s MapReduce to 4000 nodes on the Peta-sort benchmark.

3.6. Related Work

93

streams (e.g., see [KSSS04a, CCD+ 03, CDTW00, BBMS05, KSSS04b, GGM+ 04, CDZ06]). Most of these approaches consider optimizations for specific XML query languages or language fragments, sometimes taking into account additional aspects of streaming data. FluXQuery [KSSS04a] focuses on minimizing the memory consumption of XML stream processors. Our approach, however, is focused on optimizing the execution of compute and data intensive “scientific” functions and developing strategies for parallel and distributed execution of corresponding pipelines of such components. DXQ [FJM+ 07] is an extension of XQuery to support distributed applications, and similarly, in Distributed XQuery [RBHS04], remote-execution constructs are embedded within standard XQuery expressions. Both approaches are orthogonal to our approach in that they focus on expressing the overall workflow in a distributed XQuery variant, whereas we focus on a dataflow paradigm with actor abstractions, along the lines of Kahn process networks [Kah74]. A different approach is taken in Active XML [ABC+ 03], where XML documents contain special nodes that represent calls to web services. This approach constitutes a different type of computation model applied more directly to P2P settings, whereas our approach is targeted at XML processing applied to the area of scientific applications deployed within in cluster environments. To the best of our knowledge, our approach is the first to consider applications of the MapReduce framework for efficiently executing XML processing pipelines. Most relevant for the here presented approach is the work around Google’s Map-Reduce framework (see [DG08], or [L¨ am08]). Similar to Map-Reduce, our approach provides a framework where user-defined functions (or external programs) can be applied to sets of data. In addition to map-reduce, however, our framework itself is aware of whole pipelines comprised of many of these functions for data analysis. Furthermore, our framework provides a hierarchical data model and a declarative middle-layer to configure at which granularity user-defined functions are to be applied to the data. “Should f be called on each A or on each B (which each in turn contains a list of As?” can easily be configured using a configuration language, whereas this aspect is not considered by the Map-Reduce programming model.

3.7. Summary

3.7

94

Summary

In this chapter, we have presented novel approaches for exploiting data parallelism for efficient execution of XML-based processing pipelines. We considered a general model of computation for scientific workflows that extends existing approaches by supporting fine-grain processing of data organized as XML structures. Unlike other approaches, our computation model also supports processes that employ “update semantics” [MBZL09]. In particular, each step in a workflow can specify (using XPath expressions) the fragments of the overall XML structure they take as input. During workflow execution, the framework supplies these fragments to processes, receives the updated fragments, combines these updates with the overall structure, and forwards the result to downstream processes. To efficiently execute these workflows, we introduced and analyzed new strategies for exploiting data parallelism in processing pipelines based on a workflow compilation to the MapReduce framework [DG08]. While MapReduce has been shown to support efficient and robust parallel processing of large (relational) data sets [YDHP07], similar approaches have not been developed that leverage MapReduce for efficient XML-based data processing. The work presented here addresses these open issues by describing parallel approaches to efficiently split and partition XML-structured data sets, which are input to and produced by workflow steps Similarly, we describe mechanisms for dynamically merging partitions at any level of granularity while maximizing parallelism. Our parallel strategy allows for maximal decentralized splitting and grouping at any level of granularity: If there are more fragments than slots for parallel execution, i.e., hosts or cores, than any re-grouping is performed in parallel. This has been achieved via specific key-structures and MapReduce’s sorting support. Furthermore, our framework also allows the data to be merged into a very small number of very large partitions. This is in contrast to existing approaches, in which the partitions are either computed centrally (which can lead to bottlenecks) or a fixed partition scheme is assumed. Supporting a dynamic level of data partitioning is beneficial to the workflow tasks as they are provided the data in exactly the granularity they requested via

3.7. Summary

95

declarative scope expressions. Our experimental results verify the efficiency benefits of our parallel regrouping in comparison to more central approaches (Naive and XMLFS). By employing MapReduce we also obtain a number of benefits “for free” over more traditional workflow optimization strategies, including fault tolerance, monitoring, logging, and recovery support. As future work, we intend to extend the Kepler Scientific Workflow System with support for our compilation strategies as well as to combine the data parallel approaches presented here with the pipeline parallel and data shipping optimizations presented in Chapter 4.

96

Chapter 4

Optimization II: Minimizing Data Shipping Efficiency is intelligent laziness. David Dunham

As we have seen in Chapter 2, VDAL provides a number of advantages over traditional workflow modeling and design. However, there is an associated problem: The flexibility of the VDAL approach comes in part from its “ignore data that is out-of scope” approach. When implemented directly, this can introduce significant overhead in a distributed environment, because all data is sent to actors even if they are configured to ignore some or even most of it. This is, especially in data-intensive scientific applications, a major drawback of VDAL workflows. In this chapter1 , we show how this problem can be solved by the workflow system itself without requiring the scientist to explicitly define the data routing as it is the case in, for example plain process networks. We can thus provide the high-level modeling features to the user while simultaneously keeping the overhead at a minimum. We consider a special variant of VDAL workflows called ∆-XML. In particular, we use ordered trees as the basic data model. Since this is the classic model for XML data, we can 1

This chapter is based on [ZBML09b].

4.1. ∆-XML: Virtual Assembly Lines over XML Streams

97

not only leverage a large body of existing work on XML, but also are our results widely applicable to other distributed XML processing systems. Contributions. Using a type system based on XML schemas, we show how to perform a data dependency analysis to determine which parts of the input data is actually consumed by an actor. We then deploy additional components into the VDAL workflow (distributors and mergers) to eliminate unnecessary data transport in between actors. They key idea is to dynamically partition the data stream, ship to the “right” place, and then reassemble. We describe this process in detail, and present an experimental evaluation that shows the effectiveness of this approach.

4.1

∆-XML: Virtual Assembly Lines over XML Streams

We adopt a simplified XML data model to represent nested data collections in Comad-style workflows. Here, an XML stream consists of a sequence of tokens of the usual form, i.e., opening tags “[t ”, closing tags “]t ”, and data nodes “#d ”. A well-formed XML stream corresponds to a labeled ordered tree. In general, we view an XML stream as a tree for typing, but as a sequence of tokens in XML process networks. Definition 4.1 (Streams). Data streams s are given by the following grammar: s ::=  | #d | [t s]t | ss where t is a label from the label set T . From the perspective of ∆-XML we assume data nodes to contain binary data whose specific representation is unknown to the framework, but understood by actors. Although element labels are typically associated directly to opening and closing delimiters, we use the more convenient notation t[. . . ] when writing streams. Note that element attributes are not considered in the model, but can be emulated as singleton subtrees.

4.1. ∆-XML: Virtual Assembly Lines over XML Streams

98

In the following we describe specific languages to express ∆-XML data schemas and actor configurations, and describe type propagation procedures based on these for ∆-XML pipelines. In the next section, we use the approach for type propagation described here to optimize shipping costs for distributed ∆-XML pipelines.

4.1.1

Types and Schemas

To describe the structure of ∆-XML data streams, we use XML regular expression types [HVP05, MLMK05]. These correspond to XML-Schema, but use a more compact notation. As it is well known, DTDs can be expressed via XML-Schema, in fact, XML-Schemas are more expressive than DTDs since they can capture context-dependence [BNdB04, LC00]. Definition 4.2 (Type declaration). A type declaration (or production rule) has the form T ::= hti R where T is a type name, hti is a label (or tag), and R is a regular expression over types. Regular expressions are defined as usual, i.e., using the symbols “,” (sequence), “|” (alternation or union), “?” (optional), “∗” (zero or more), “+” (one or more), and “” (empty word). When clear from the context, we omit “,” for sequence expressions. Definition 4.3 (Schema). We define a schema τ as finite set of type declarations. Every schema τ implies a set of labels Lτ and types Tτ = Cτ ∪˙ Bτ . The complex types Cτ are those that occur on the left-hand side (lhs) of a declaration in τ , and the base types Bτ are those that only occur on the right-hand side (rhs) of a declaration in τ . By convention, we use the type name Z to denote the base type representing any data item #d . For the purposes of type propagation and optimization, we place the following restrictions on schemas τ : 1. τ has a single, distinguished type S (i.e., the start or root type), such that S does not occur on the rhs of a declaration in τ . Thus, schemas can be represented as trees

4.1. ∆-XML: Virtual Assembly Lines over XML Streams

99

(see Fig. 4.1). 2. Each complex type of τ is defined in at most one type declaration. 3. The type declarations of τ are non-recursive. 4. τ is non-ambiguous [MLMK05], i.e., each stream that is an instance of τ has a unique mapping (i.e., interpretation; see below) from s to τ . This very common restriction is also known as a deterministic content model; it is, for example, required for element types by the W3C recommendation for XML [BPSM+ 08]. Restrictions (1) and (2) are for simplifying notation. (3) is an assumption made in the here presented X-CSR algorithm. A generalization to recursive schemas is, however, possible by computing shipping destinations during run-time. Restriction (4) is crucial and necessary for the definition of signatures and schema-based operations on streams. By dropping (4), we would allow signatures for which there exists data streams with ambiguous mappings to the signature, and thus the question whether an actor reads a particular data item is not well-defined anymore. This restriction is common in the XML community []. Definition 4.4 (Reachability, down-closed and up-closed). We define reachability on types in a schema τ as follows: The type B is directly reachable from A, denoted A ⇒τ B, ∗

iff B occurs in the rhs of the declaration for A; as usual we define ⇒τ as the transitive and reflexive closure of ⇒τ . Let T be a set of types in schema τ . We say that T is down∗

closed iff T is closed under ⇒τ . We define the down-closed extension of T (denoted T↓τ , or simply T↓ when the context is clear) to be the smallest down-closed set T0 that contains T. Similarly, we define up-closed and the up-closed extension (denoted T↑τ ) for the inverse of the relation ⇒τ . Definition 4.5 (Roots, independence). Using reachability, we can define the roots T∧ of a set of types T in τ (i.e., the “top-most” types of the set) as the smallest set where (T∧ )↓ = T↓ . Similarly, we say that a set of types T is independent for a schema τ if it is

4.1. ∆-XML: Virtual Assembly Lines over XML Streams

100

not possible to reach a type T2 ∈ T from another type T1 ∈ T, i.e., it is not the case that ∗

T1 ⇒τ T2 for T1 6= T2 . Definition 4.6 (Interpretation). An interpretation I of a stream s against a schema τ is a mapping from each node n in s to a type T such that: 1. I(n) = S if n is the root node of s (where S is the start type of τ ), 2. for each node n and its child nodes n1 , n2 , . . . , nm , there exists a type declaration X ::= hai R with: (i) I(n) = X such that n has the tag hai; and (ii) I(n1 ), I(n2 ), . . . , I(nm ) ∈ JRK, where JRK is the set of strings over type names denoted by the regular expression R.

Definition 4.7 (Instance). A stream s is an instance of the schema τ (denoted s ∈ Jτ K) iff there exists an interpretation of s against τ .

Definition 4.8 (Subtype). A schema τ1 is a subtype of a schema τ2 , denoted as τ1 ≺ τ2 , iff Jτ1 K ⊆ Jτ2 K. As discussed in [HVP05], the problem of determining whether two regular expression types are subtypes is decidable, but exptime-complete; however, the highly optimized practical implementation seems to work well in practice. Example 4.9 (Types, Closures, and Roots). Consider the schema τ in Fig. 4.1. Types are shown graphically such that S is the root type of τ , and each downward-pointing arrow denotes a type declaration (with tags given on edges). Moreover, as shown in Fig. 4.1, the set {B, C, Z4 } is down-closed, but not up-closed because S ⇒ B and S is not member of the set. {B, C} has a single root B. Also, D↑ = {S, A, D} and D↓ = {D, F, G}. An instance s of τ is also given, such that an interpretation I maps each node of s to the type with the corresponding node label.

4.1. ∆-XML: Virtual Assembly Lines over XML Streams

101

τ: S hsi

Tτ = {S, A, . . . , Z} Lτ = {hsi, hai, . . . , hhi}

(A | B)∗ hai

hbi

D+ | E ∗ C hdi

F ∗G hfi hgi

Z1 Z2

hei

hci

H∗

Z4

{D}↑ = {S, A, D} {D}↓ = {D, F, G, Z1 , Z2 } {B, C}∧ = {B}

hhi

Z3

s ∈ Jτ K: s = s[b[c]a[d[ffg]d[gg]]a[e[hh]e[hhh]]b[c]b[c]]

Figure 4.1: A simple schema τ along with: the types and labels of τ ; example up-closed, down-closed, and root sets; and an instance s. Base values of type Z are omitted in s.

4.1.2

Actor Configurations

In ∆-XML process networks, actor configurations describe where an actor can modify its input, how the actors output is structured and where it is put back into the stream. Definition 4.10 (Actor Configuration). A configuration is of the form ∆A = hσ, τα , τω i that consists of a specialized actor signature σ for selecting and replacing relevant subtrees of an input stream, an input selection schema τα (A’s read-scope), and an output replacement schema τω (A’s write-scope). In general, an actor can be given different configurations in different ∆-XML pipelines. The schema τα describes how a subtree should be structured to be in scope for an actor. The signature σ determines which parts of the in-scope input is to be replaced by new data. The schema of collection data produced by the actor is given in τω . Next, we define actor signatures, which are used to describe how actors modify the XML stream.

4.1. ∆-XML: Virtual Assembly Lines over XML Streams

102

Definition 4.11 (Signature, match rules, match and replacement types). A signature is a set of match rules. A match rule has the form X→R where X is a type of τα , and R is a regular expression over types of τω . We call X the match type, and each type in R a replacement type. We require the match types of a configuration to be independent (see Def. 4.5), thus avoiding ambiguous or nested matches. Additionally, there is no match type used in two different rules in σ, i.e.for each match type, there is exactly one replacement given. Intuitively, a match rule says that for any fragment of type X in the input stream, the actor A will put in place of X an output of type R. Unlike for data streams, configuration schemas are allowed to contain multiple root types, which provides greater flexibility when configuring actors. For example, a common root type is not required for types X1 and X2 (similarly Y1 and Y2 ) used in different match rules X1 → Y1 and X2 → Y2 . Match types can be additionally constrained within τα as follows: If a match type X occurs in the lhs of a τα type declaration, the declaration constrains X’s content model ; whereas if X occurs in the rhs the declaration constrains X’s “upper” context. Similar upper-context constraints are not allowed in τω : We require all lhs types of τω to be reachable from a replacement type Y in σ such that Y is not a type of τα . We also require that the result of applying σ to τα be a non-recursive, non-ambiguous schema (and similarly for propagated schemas τ 0 = ∆A (τ )). Once workflows are configured, we can detect cases that violate this constraint and reject these designs. Example 4.12 (Actor Configurations). Consider an actor A1 that produces a thumbnail from an image, and an input type τ : {S ::= hsi G∗ , G ::= hgi Z}

4.1. ∆-XML: Virtual Assembly Lines over XML Streams

103

representing a set of images of type G. To replace each image in the given set with the corresponding thumbnail, we use the configuration ∆A1 = hσ, τα , τω i such that σ : {G0 → T }, τα : {G0 ::= hgi Z}, and τω : {T ::= hti Z}. The type that results from applying ∆A1 to τ is thus τ 0 : {S ::= hsi T ∗ , T ::= hti Z}. Now consider an actor A2 that takes an image and produces a collection containing both the thumbnail and the original image. We can use A2 to replace each image in a stream of type τ with a thumbnail collection using the configuration ∆A2 = hσ, τα , τω i such that σ : {G0 → C}, τα : {G0 ::= hgi Z}, and τω : {C ::= hci G0 T, T ::= hti Z}. The type that results from applying ∆A2 to τ is τ 0 : {S ::= hsi C ∗ , C ::= hci GT, G ::= hgi Z, T ::= hti Z}. Finally, given an input schema τ with intermediate levels of nesting τ : {S ::= hsi X ∗ Y ∗ , X ::= hxi G∗ , Y ::= hyi G∗ , G ::= hgi Z} we can configure A2 to work only on the images under X (and not those under Y ) by simply adding the declaration X 0 ::= hxi G0∗ to τα in ∆A2 .

4.1. ∆-XML: Virtual Assembly Lines over XML Streams

4.1.3

104

Type Propagation

Given an input schema τ and an actor configuration ∆A = hσ, τα , τω i, we can infer the schema τ 0 = ∆A (τ ) from ∆A and τ as follows. Without loss of generality, we assume that complex type names are disjoint between τ and τα ∪ τω . Let Mσ = M(σ) be the set of match types of σ. We define Mτ as the types of τ that correspond to the match types Mσ , where T ∈ Mτ iff there exists a T 0 ∈ Mσ such that T and T 0 have the same element tag. Let T0 = (M↓τ )↑ be the set of potentially relevant types, which include the types of Mτ and (intuitively) the types in τ that are “above” and “below” them. Here T0 defines the initial context and corresponding content model of match types in σ mapped to τ . The actual context (the relevant types) Tα = πα (T0 ) are obtained from a so-called relevance filter πα . We use πα : Tτ → Tτ here specifically to remove, or filter out, the types of Tα that do not satisfy the context and content-model constraints of τα . We can define πα as follows. Notice that from T0 we can obtain a set of type paths P of the form

X1 /X2 / . . . /Xn

(n ≥ 1)

where X1 = S, Xi is a parent type of Xi+1 in τ (i.e., Xi ⇒τ Xi+1 ), and there is an Xj = T ∈ Mτ . Thus each path of P starts at the root type of τ and passes through a corresponding match type from σ. Informally, a path is removed from P if: (1) the types above T (wrt. τ ) along the path do not satisfy the context constraints of T in τα ; or (2) the types below T are either not mentioned in or do not satisfy the content-model constraints of T in τα .2 Similar to [HVP05], both tests for determining whether a given path satisfies the constraints of τα for T can be performed by checking inclusion between tree automata [CDG+ 97]. Thus Tα = πα (T) consists of the set of nodes along unfiltered paths P0 ⊆ P. ∆-XML Type Propagation. The output type τ 0 = ∆A (τ ) can now be inferred from τ and ∆A as follows. Let P0 be the set of unfiltered paths of T as described above. Further, 2

Note that match types T with multiple root types in τα can be satisfied from any one of the corresponding root types, i.e., the set of root types of T can be viewed as unions of constraints.

4.2. Optimizing ∆-XML Pipelines

105

let P ∈ P0 be a path that passes through a match type T ∈ Mτ such that T 0 ∈ Mσ is the corresponding match type in σ, and T 0 → R is the replacement rule for T 0 . We construct τ 0 from τ for all such P by replacing the particular occurrence of T in τ 0 (according to P ) by R, then adding the associated replacement type declarations of τω to τ . In the following section, we use the equivalent equation τ 0 = ∆A (Tα ) to denote type propagation, such that Tα refers only to the rooted fragments of τ that are relevant for an actor configuration ∆A , as opposed to the entire τ schema. Given the above type propagation procedure for ∆-XML, type propagation through a ∆-XML pipeline is straightforward. In particular, we sequentially infer types τi+1 = ∆Ai (τi ) for 1 ≤ i ≤ n, where the output of actor Ai is sent directly to the actor Ai+1 , τ = τ1 is the input schema of the pipeline, and τ 0 = τn+1 is the output schema of the pipeline.

4.2

Optimizing ∆-XML Pipelines

After a ∆-XML pipeline has been built and configured, each actor automatically receives data that matches its read scope τα , whereas data outside of the read scope is automatically routed around the actor. Note that all data, regardless of whether it is within an actors scope or not, is still shipped to each actor. Here we describe the X-CSR dataflow optimization that exploits schema and signature information provided by ∆-XML pipelines to route data directly to relevant “downstream” actors. Intuitively, we distribute the workflow’s input data and data that is generated by actors to the first actor that has it in scope in the pipeline. Here the distribution problem arises: Problem 4.13 (Distribution). Given the data to be distributed. What is the destination for each collection token and data element? At the input of down-stream actors (and at the very end of the workflow) there might be data arriving from various locations upstream in the workflow. Since we want to keep the optimized workflow equivalent to the unoptimized one we need to guarantee that each actor

4.2. Optimizing ∆-XML Pipelines

106

receives the same data in its scope with and without optimization. It is therefore important to carefully order the data arriving from various locations upstream while merging it again: Problem 4.14 (Merge). How to merge multiple streams such that the original order of the data and collections is restored? We solve the Distribution-Problem by analyzing where in the workflow which parts of the input is used first. We do this by leveraging the signature information of the actors and the workflow’s input schema. We solve the merge problem by putting additional information in the streams when they are split up by the distributors. On the “main-stream” going serially from actor to actor we put “hole” markers at positions where we cut out parts of the data that is sent further down the stream (on “bypassing lanes”). By grouping bypassed token sequences using filler markers, we can then pair holes and fillers to merge the stream order-preserving.

4.2.1

Cost Model

For our routing optimization we assume that base data is significantly larger in size than the opening and closing tokens used to provide the context. This assumption is especially true in data-driven scientific workflows [PLK07], which typically deal with complex and large data sets. We also assume for simplicity of presentation that all actors are allocated at different hosts and that data can be sent between arbitrary hosts. We strive to minimize the total amount of data that is shipped during the execution of a scientific workflow (modeled as a ∆-XML pipeline).

4.2.2

X-CSR: XML Cut, Ship, Reassemble

In ∆-XML, by definition, actor results are independent from all data fragments that are outside the actor’s read-scope. Therefore, we can alter the incoming stream to the actor as long as we guarantee that all data within its read scope is presented to the actor as before.

4.2. Optimizing ∆-XML Pipelines

107

Example. Consider a stream with the schema {S ::= hsi (A | B | C)∗ , A ::= hai Z, B ::= hbi Z, C ::= hci Z} and a workflow consisting of three actors A1 , A2 , and A3 that are consuming A, B and C while producing A0 , B 0 , and C 0 respectively. We introduce a stream splitter, or distributor in front of the three actors. The distributor will have three output channels each of which leading directly to one of the actors. After all three actors “did their job”, there are three separate streams each containing the output of a single actor. A stream merger is then used to merge these streams together to form a single output. Of course, we expect the output stream of an optimized workflow to be equivalent to an unoptimized one. It is therefore essential that we keep track of the order of the events— especially when splitting the stream. We therefore insert markers, or holes (denoted ◦) into the main stream3 whenever we decide to send irrelevant data onto the “fastlane.” When we later merge the bypassed data fragments back into the stream, we only need to fit those fragments into their original positions denoted by corresponding holes.4 We further “group” fragments in the bypassing lane by adding filler-tags (denoted •) to match single holes within a possibly longer sequence of bypassed elements.

General Case In the general setting, we deploy distributors after each actor to route its output to the closest downstream actor that has the data in its scope. Similarly, we also deploy mergers in front of every actor as it might receive data from various upstream locations. In Fig. 4.2, this general pattern is shown5 . Consider the input type τ of the ∆-XML pipeline in Fig. 4.2, together with the read-scope of actor A1 : Only F is in the match type 3

The “main stream” is the stream on the original assembly-line channels, i.e. from one actor to the next. This approach is similar to promise-pipelining, a technique that greatly reduces latency in distributed messaging settings. 5 There is no merger in front of the first actor and no distributor after the last one, as there is obviously no data to be merged, and only one final destination to send data to. 4

4.2. Optimizing ∆-XML Pipelines

108

(a) Original pipeline S

S

(A | B)∗ D+ | E ∗

(A | B)∗ C

FG

S

C

i∈τ

D+ | E ∗

D+ | X ∗

BE(E | B)(E | B)

H∗

C

A2

σA1 = {F → BEG}

C

C

H∗

H∗

(BEG)G

A1

(A | B)∗

(A | B)∗ C

D+ | E ∗

H∗

S

S

H∗ H∗ C

BX(X | B)(X | B)

H∗ C

C

A3

σA2 = {G → (E | B)}

C

C

o

σA3 = {E → X}

T

(b) X-CSR optimized pipeline S

S

S

S S

S )∗

(A | ◦04 )∗

(A | ◦04

D+ | ◦∗03

D+ | ◦∗03

F ◦02

(BEG)◦02

(A | ◦04

i∈τ

(A | ◦04 )∗

D+ | ◦∗03

D+ | ◦∗03

D+ | E ∗

◦14 ◦13 (E | B)(E | B)

H∗ ◦14 E(E | ◦24 )(E | ◦24 )

(◦14 ◦13 G)G

d◦01 = i1

D0

(A | ◦04 )∗

(A | ◦04 )∗

C H∗

S

)∗

o1

A1

d•02

H∗ C H∗ C

D1

d◦12

M2

i2

A2

H∗ H∗

o2

D2

D+ | X ∗ ◦14 X(X | ◦24 )(X | ◦24 )

H∗

d◦23

M3

i3

o3

A3

o

M4

T

d•13

d•03

d•24

d•14

d•04

(c) Schema partitioning performed by distributors D0 :

D1 :

S

D2 :

S

(A | ◦04 )∗

S

(A | ◦04 )∗

(A | ◦04 )∗

D+ | ◦∗03

D+ | ◦∗03

•04 D+ | ◦∗03

d◦01

•03

F ◦02 •02

d•02

G

d◦12

B

E

C H∗

(◦14 ◦13 G)◦02

d•04

d•03 d•14

•14

•13

B

E

C

H∗

◦14 ◦13 (E | ◦24 ) (E | ◦24 )

d◦23

d•13

H∗

d•24

•24

H∗

•24

B

B

C

C

d•24

Figure 4.2: X-CSR (“X-scissor ”) in action: (a) conceptual user design (unoptimized) with actor signatures σAi (part of configurations ∆Ai ), initial input schema τ and inferred intermediate schemas (dash-boxed schema trees, above channels) and final inferred schema (after A3 ). (b) optimized (system-generated) design: The sub-network M2 ; A2 ; D2 (M2 and D2 reside on the same host as A2 ) shows the general pattern: A2 receives, via the merger M2 , all parts relevant for its read-scope, then performs its localized operation. The distributor D2 “cuts away” parts that are irrelevant to the subsequent actor A3 and ships them further downstream, but not before leaving marked “holes” behind where the cutting was done. This allows downstream mergers to pair the cut-out, bypassed chunks (which were going on the “fastlane”) back with the holes from which the chunks were originally taken; (c) distributors D0 , D1 , and D2 “cut” the schemas on the wire as indicated.

4.2. Optimizing ∆-XML Pipelines

109

of its signature σ1 . The relevant type path leading to F is S/A/D/F , we therefore send the “F -data” and its context to actor A1 . Now consider the second actor A2 : its match type is G. As there is a G in τ (right next to F ), this G is relevant for the second actor. Note, that because actor A1 is not allowed to change parts of the stream that are not in its scope, i.e. G, it is safe to send G “on the fastlane” directly to the front of actor A2 where it will be merged back into the main stream. Next, consider actor A3 , which is operating on E. Since both, actor A1 and A2 will ignore the “E-data” (including the list of H-data beneath it) we can safely ship this portion of the stream to the third actor. Now we have determined shipping destinations for all the types except B and C. As they are not “picked up” by any actor in the workflow, we bypass them to the very end. In summary, we imagine the input schema τ to be cut into pieces as shown in the bottom of Fig. 4.2. The immediately following actor will receive the data inside its read scope and its corresponding context. Then all other downstream actors take turns cutting out their portion of the stream, possibly together with some of the remaining context, i.e. that has not yet been shipped to another, preceding actor. The distributor then partitions the stream, according to the partitioning of the schema. We will use a partition of types d◦i,i+1 , d•i,i+2 , . . . , d•i,i+j ⊆ Tτ , from a schema τ to describe the action of a certain distributor Di . While d◦i,i+1 contains the main context (up to the type S) and holes wherever data is cut out, the d•i,j contain bypassed data grouped under •-labels. Labeling Holes. To be able to attach bypassed parts back into the main context, which is sent on the main line, the distributors put hole-markers (◦) into the stream. are grouped using filling-tags. To match up the fillings with the holes later on, holes need to be indexed: If holes would not be distinguishable, merger M2 would not know whether an encountered hole corresponds to some data that is sent on d•02 (and should thus be filled), or if the hole corresponds to some data on channel d•03 (and should therefore be ignored because merger M3 will fill it). However, only marking the hole with the index of the merger that should

4.2. Optimizing ∆-XML Pipelines

2

4

6

8

10

12

14

16

18

INPUT: τ, ∆1 , . . . , ∆n ; ∆i = hσi , ταi , τωi i OUTPUT: d◦i−1,i for i = 1, . . . , n d•ij for i + 1 < j; j = 1, . . . , n + 1 CODE: Mσi := M(σi ), i = 1, . . . , n τ1 := τ FOR i := 1 TO n DO i ↑τi d◦i−1,i := παi ((M↓τ σi ) ) ◦ R := di−1,i FOR j := i + 1 TO n DO i ↑τi d•i−1,j := πα ((M↓τ σj ) ) \ R R := R ∪ d•i−1,j ENDFOR d•i−1,n+1 := Tτi \ R τi+1 := ∆i ( d◦i−1,i ( mi ( τi ) ) ) ENDFOR RETURN: all d◦ij and d•ij , with labeling according to channel number

110 Input schema Configurations Distribution queries

Actor match types Intermediate schemas Ship what is asked for Already assigned variables

Type propagation

Listing 4.1: X-CSR algorithm for statically computing distributor specifications

fill in the data again is not sufficient. To see this, consider merger M3 when it is receiving a hole marked with “to-be-filled-by-M3 ”. It then cannot decide from which bypassing channel it is supposed to grab the filling from. However, since we are not changing the order of main and bypassing channels, it is not necessary to number the holes and fillings uniquely; “source-distributor” and “destinationmerger” provide enough information for each merger to unambiguously augment the stream with the formerly bypassed data.

4.2.3

Distributor and Merger Specifications

As illustrated above, mergers are not very complex. A merger Mi with one main-line and 1, . . . , n fastlane inputs will sequentially read the main-line stream. This “actor” ignores all tokens in the stream except holes labeled with ◦jk where k = i. When such a hole ◦ji is read, the merger reads a new filling from channel j and inserts the data within the filling

4.3. Implementation and Evaluation

111

markers back into the main stream. Distributors, on the other side need to be configured, i.e. the correct partitioning d◦i,i+1 , d•i,i+2 , . . . , d•i,i+j of the set of types needs to be inferred. This can be done in the spirit described in the example above (general algorithm is given in listing 4.1). Starting with the input type τ of the workflow (line 7), match types are complemented by all types and data below them (down-close operation in line 9) and the types up to the root symbol S are added (up-close operation). Then, the relevant parts are selected via the παi operator as described in the previous section. This set of types “denotes” the operation on the main line. We accumulate the types we have already assigned a destination to in the set R (line 13). Then, we loop over all the downstream actors to find the “left-over” yet relevant data for them. If some types are not relevant for any actor in the workflow, they are added to the last bypassing channel, which will merge at the very end (like B and C in Fig. 4.2). Once one distributor is fully specified, the current type is propagated through the “hole-making” operation and through the merger, and then the result type is propagated through the next actor (line 16). All the following distributors are then configured performing the same steps. From Schema-level to Instance-level. During runtime, the distributor continuously maps incoming tokens to type symbols in its schema6 . It then sends the data to the correct destination based on the partition of its schema, adding holes and fillings as appropriate.

4.3

Implementation and Evaluation

In this section, we describe the implementation and experimental evaluation of the X-CSR optimization. The cost savings enabled by X-CSR are in part based on the following observations. Assuming actors perform in-place substitution and given a basic cost model that considers only the cost of shipping base data, X-CSR yields optimal shipping costs: 6

Remember, this mapping exists as our schemas are non-ambiguous. To increase “stream-ability” we can further demand that the mapping can be computed as the tokens come in, of course.

4.3. Implementation and Evaluation

112

Proposition 4.15 (Shipping Optimality). Every base data element is sent directly from its originating actor Ai on host Hi to its destination actor Ak on host Hk for i ≤ k without being sent to an intermediate actor Aj on host Hj for which it is irrelevant. To see that X-CSR is shipping optimal, notice that as soon as a data token is produced by an actor (or provided to the first actor of a pipeline), X-CSR finds the closest actor downstream that has the data item in scope. The data is then directly sent to this actor without passing through intermediate actors (as in the unoptimized case). Because the data item must be received by this actor to guarantee equivalence with the unoptimized version of the pipeline, this shipping is indeed necessary, and thus optimal. We also show in the evaluation below that the overhead introduced by X-CSR is minimal. In an unoptimized workflow, shipping data hdi from actor Ai to actor Ai+n , the closest downstream actor that has hdi in scope, involves shipping sizeof hdi∗n bytes. The optimized version will directly send the data to Ai+n and will thus only ship sizeof hdi bytes, resulting in a savings of sizeof hdi ∗ (n − 1). It can be shown that the saved shipping cost is linear in the number of bypassed actors as well as in the size of the total base data involved in the shipping optimization. Thus, in X-CSR, the more data is shipped, the bigger the savings.

4.3.1

Experimental Setup

We have implemented a distributed stream-processing system based on the Kahn Process Network [Kah74] model. We use PVM (Parallel Virtual Machine) [PVMa] for process creation and message passing between actors. Each actor is implemented as its own process and runs on a different host. Opening and closing tags (including holes and fillers) are sent using PVM messages, whereas (large) data tokens are kept as files on the local filesystems and are sent between hosts using remote secure copy (scp). Keeping large data files on disk is common in scientific workflows, e.g., actors often invoke external programs with input and output provided and generated as files. This setting fits our assumptions that data is generally expensive to ship compared to collection delimiters.

4.3. Implementation and Evaluation

113

To evaluate the approach, we deployed the system on a 40-node, 2.5GHz Dual Opteron processor, cluster running Linux. Nodes are connected to each other by a 100 MBit/s switched LAN. Each actor is started on one of the cluster nodes using a round-robin assignment of actors to hosts. Example Workflow. The following example is used to analyze and explain the benefits of the X-CSR optimization. Consider a 3-actor workflow A1 → A2 → A3 with replacement rules ∆A1 : σ = {A → B | U } ∆A2 : σ = {B → C | V } ∆A3 : σ = {C → W }. Corresponding actor match types τα and replacement types τω contain type declarations of the form X ::= hxi Z for each type X in the replacement rules. For instance, actor A1 works on input labeled with hai, it either outputs data tagged with hbi or hui depending on the actual data it reads. The properties on which A1 ‘chooses’ its output is not observed by the type system: A1 could perform an expensive analysis of its input data and depending on the quality of the outcome it might be in need for further refinement by A2 , or no further refinement by A2 or A3 is needed. Analogously, A2 could be able to produce a final result V or pass its result to the third actor. In our experiments we consider data tokens 5MB in size. We also conducted experiments with varying data sizes of 1, 10, 20, and 100MB, where execution times scaled linearly across data sizes. In addition, we assume each actor Ai immediately outputs its result without any extensive computation or delay.

4.3.2

Experimental Results

Based on the actor configurations, several scenarios of data flow are possible. In table 4.1, we present three cases, called parallel, serial, and mixed, to study the savings in data shipping, as well as the overall execution time. Fig. 4.3 shows wall-clock execution times

4.3. Implementation and Evaluation Scenario (a) Parallel (b) Serial (c) Mixed

A1 : hai 7→ hui A2 : hbi 7→ hvi A3 : hci → 7 hwi A1 : hai 7→ hbi A2 : hbi 7→ hci A3 : hci 7→ hwi A1 : hai 7→ hbi A2 : hbi 7→ hvi A3 : hci 7→ hwi

114

Input Data Input Workflow

A2

A3

s[ a[z] ∗ i ] A1

A2

A3

s[ (a[z] ∗ i) (c[z] ∗ i) ] A1

A2

A3

A2

D0

M4

opt.

80i

35i −56%

≈ 3.6i

≈ 1.1i −69%

80i

80i 0%

≈ 3.6i

≈ 2.6i −28%

80i

50i −38%

≈ 3.6i

≈ 2.2i −39%

A3

D0

D0

A1

A2

A3

A2

A1 A3

M4

M4

Exec. Time (sec) orig. opt.

orig.

A1

s[ (a[z] b[z] c[z] w[z]) ∗ i ] A1

Data Shipped (MB)

Actual Dataflow

Table 4.1: X-CSR optimized vs. standard: Reduction in data shipping and execution times for these cases. We varied the amount of input data to the workflow on the x-axis. We executed each workflow 5 times for each input7 . Crosses and pluses represent individual runs (optimized and original workflow respectively). The curves connect the averaged times to factor out noise and to show the overall trend. (a) “Parallel” actors. Consider input s such that A1 and A2 always output U and V data, respectively. In this case, the three actors work independently of each other. The input data has the structure s[(a[z]b[z]c[z]w[z]) ∗ i]”, i.e., a stream with i repetitions of a[z]b[z]c[z]w[z] in which each z stands for a data item sized 5MB. We varied i from 1 to 20. In the unoptimized pipeline, 4i data tokens are sent in between the three actors, source and sink. Hence, a total of 4i · 4 · 5MB = i · 80MB is sent in between hosts. In the X-CSR optimized pipeline, the first distributor D0 will send each data item directly to the actor that is “picking it up” or to the end of the stream if no actor has the data item in scope (w[z] here). Also each actor’s output will be sent directly to the end of the stream. This decreases the total amount of sent data from i · 80MB to i · 35MB - a reduction by 56% (see table 4.1(a)). Runtime measurements are shown in Fig. 4.3: Execution times scale linear with the number of data that is streamed through. On average, the unoptimized workflow took 3.6i seconds, whereas the optimized version took 1.1i seconds to completion. This is a reduction 7 For some configurations, we ran the workflow more often gaining the same results with equally low variance.

4.3. Implementation and Evaluation

115

80 Orig individual Orig avgerage Opt individual Opt average

70

runtime [seconds]

60 50 40 30 20 10 0 0

2

4

6

8

10

12

14

16

18

20

12

14

16

18

20

12

14

16

18

20

(a) Parallel 80 Orig individual Orig average Opt individual Opt average

70

runtime [seconds]

60 50 40 30 20 10 0 0

2

4

6

8

10

(b) Serial 80 Orig individual Orig average Opt individual Opt average

70

runtime [seconds]

60 50 40 30 20 10 0 0

2

4

6

8

10

(c) Mixed

Figure 4.3: X-CSR experiments standard versus optimized. Execution times with and without X-CSR optimization for increasing number of data items in the different scenarios.

4.3. Implementation and Evaluation

116

by 69% of the original time. The system experiences a larger speed-up as the saved amount of data would suggest, since by using distributors and mergers the expensive data transfer is moved away from the actors themselves allowing additional concurrency. We will observe a speed-up due to this effect in all other cases as well. (b) “Serial” actors. Now, consider the other extreme case that A1 and A2 are always outputting hbi and hci tagged data, respectively. In case only hai data is provided as input to the workflow, no single data item can be bypassed. The dataflow structure of the optimized workflow does not differ from the original workflow structure (table 4.1(b)). Since the same data has to be shipped in both versions, we would expect their execution times to be very similar. In our experiments, however, the optimized version outperformed the original one by 28%. We attribute this additional speed-up to the increased concurrency we get due to decoupling the sending and receiving from the actor’s execution by introducing distributors and mergers. (c) “Mixed parallel and serial” actors.

Let A1 and A2 always output hbi and hvi

labeled data, respectively. If we provide the workflow with only hai data (for the first actor) and hci data (for the third), the dataflow as in table 4.1(c) arises. Savings in data shipping as well as in execution time are as expected between our two extreme case “parallel” and “serial”. Dynamic routing.

In practice, processing pipelines can often involve combinations of

the cases given in (a)-(c). That is, a single run of a pipeline can involve different types of routings and levels of parallelism. Because our distributors are implemented using a hierarchical state-machine to parse incoming data streams, the correct routing decision will be made dynamically, at runtime. Having all of the possible actor dependencies within one workflow demonstrates the generality of the X-CSR approach: While it would be possible to explicitly model a task-parallel network as depicted in table 4.1(a), this model would not be able to accommodate the case that A1 produces output for A2 . On the other hand, modeling the workflow as in (b) results in expensive unnecessary shipping when data is

4.3. Implementation and Evaluation

117

not serially flowing from one actor to the next. Leveraging X-CSR, the workflow can be conveniently modeled as a linear pipeline while the data is dynamically routed by the framework itself, ensuring shipping optimality for large data items according to (sopt). Overhead. To investigate the overhead introduced by the additional actors (distributors and mergers), as well as by the additional tokens (holes and fillers) sent, we ran the workflow on the same input structure but without data items. The execution times without data being sent are very small in general. In fact, the time spent for sending tokens in the workflow was 0.5 seconds8 for both optimized and original workflow. However, starting and connecting the actors on the different hosts increased from 0.2 seconds to 0.4 from the original to the optimized version. Hence, the total execution time (running + setup) increased from 0.7 seconds to 0.9 seconds for the given workflow when run without the actual data. We believe that the initial delay is a tolerable overhead and do not expect to slow down the workflow significantly due to the shipping of additional tokens, distributors and mergers – considering that in real workflows significant execution time is spent in shipping and computation. Comparison to “central database” approach.

The parallel example would also

behave not too bad in a more traditional setting where the data is kept in a central repository and only references are shipped through the stream. Each actor would fetch the relevant data, process it and then push it back to the central server. However, since we would need to ship i · 30MB9 from and to the actors, this would take at least 2.4i seconds10 for a server connected with 100MBit/s (as our cluster) — which is more than twice the time it takes in the X-CSR optimized version. The situation is even worse in the serial case, as there are 6 · 4 · i datashipping of 5MB chunks involved11 which would result in a lower bound of 9.6 seconds—more than 3 times slower then our X-CSR approach. 8

Filtering out some larger execution times that were caused due to noise on the cluster. or even i · 5MB more if we assume source and target are on different nodes as in our scenarios 10 i · 30MB · 8Bits/Byte divided by 100MBits/sec 11 each of the actors fetches and puts the data 9

4.4. Related Work

118

Read-only access of actors.

The default mode of computation in ∆-XML assumes

that actors perform in place substitution, i.e., in general matched fragments are replaced with new data. If, however, an actor only adds new data to the input stream—keeping its matched input data intact—the presented ∆-XML type system is not aware of this. However, our framework, can be easily extended to handle add-only actors, by allowing configurations that declare which matched types are to be replaced or left in-place by an actor. An extended X-CSR algorithm could then take this additional information into account and ship read-only data to multiple destinations in parallel to increase concurrency.

4.4

Related Work

Closely related to the shipping optimization presented here is work about query processing over XML streams, see [LMP02, CCD+ 03, CDTW00, KSSS04b, GGM+ 04, CDZ06] among many others. However, most of these approaches consider optimizations for specific XML query languages or language fragments, sometimes taking into account additional aspects about the streaming data (e.g., sliding windows). They do, however, not evaluate their approaches against the specific desiderata I have given in Section 1.3.4. They also do not focus on incorporating existing scientific functions into the framework; and XML documents dealt with in these approaches usually do not contain large chunks of leaf-node data—which is very common for scientific applications. They consequently do not address implications that come with this assumption. To the best of our knowledge, there exists, for example, no work that exploits a regular expression-based type system to analyze dataflow and subsequently optimize the data shipping for distributed XML process networks. Since VDAL workflows are usually executed as streaming collections flowing through data processors (the actors), previous work on stream processing, and in particular on streaming XML data becomes relevant. Work in the general area of stream processing (see [BBD+ 02, CBB+ 03] for an overview) is concerned with the following aspects: Blocking/unblocking.

A significant amount of work focuses on operators for stream

4.5. Summary

119

processing. As streaming datasets can only be read once (due to their possibly infinite size), the focus is often on “unblocking” traditionally blocking operators. The punctuations work [TMSF03] is related to our own since the holes we use in the X-CSR optimization can be seen as special punctuations that inform stream processors about properties of the stream, i.e., that data arriving at another stream has to be read to fill in the hole. Bounding memory of stream processing elements. Since input streams can potentially be infinite, it is necessary to restrict memory usage of the stream processor. Here, a large body of work considering how to use automata to parse and process XML streams exists, see [Sch07] for a survey. Closely related to the X-CSR shipping optimization is the work presented in [CLL06]. Here, Chirkova et al. consider the problem of minimizing the data that is sent as answer to related relational queries in a distributed system. For a set of conjunctive queries it is tried to find a minimal set of views that is sufficient to answer the queries. A minimal view set is a set that takes the least number of bytes to store it. We are not aware of any work in the field of XML processing that tries to minimize the size of shipped data between stream processors.

4.5

Summary

In this chapter we showed how to utilize type-level information about actors to optimize data transport in scientific workflows. We presented a formalism to represent schema information about the data sent between actors. We also defined actor signatures to characterize which parts of the input are used by the actors to produce outputs. Performing a data-dependency analysis, we were able to insert distributors into the pipeline to ship base data only to actors that will read, modify, delete or replace this data. Using labeled hole- and filler-items in the data stream allowed us to merge the forked data streams back into one single output. We showed the optimality of our approach when considering large base data sizes wrt. the information captured by the actor signatures. Our experimental analysis, performed on a cluster, showed the effectiveness of the overall approach.

120

Chapter 5

Implementation: Light-weight Parallel PN Engine Science is what we understand well enough to explain to a computer. Art is everything else we do. Donald Knuth

In this section, we describe the implementation of our light-weight, parallel process network engine (PPN). This engine has been used to perform the experiments described in Chapter 4. We first describe the design decisions made, then give an overview of the system architecture, and present experiments demonstrating system performance. The PPN engine has been coupled to the Kepler system via a specific Kepler director, as an effort for supporting transparent parallel execution within the Kepler system. In the last part of this chapter, we describe this effort1 .

5.1

General Design Decisions

Our work here is intended as an experimentation platform for the execution of scientific workflows that are computational intensive of data-intensive or both. We provide a clean 1

The PPN engine and Kepler coupling has been presented in [ZLL09].

5.1. General Design Decisions

121

process-network engine, with Actor and Port abstractions. We also put a strong emphasis on being able to write actors in different programming languages. Currently, actors can be written in C++, Perl, Python and Java and as shell scripts. To have a scalable base system, workflow execution is completely decentralized, that means a central component is only necessary to orchestrate (set up, monitor and terminate) workflow execution. As language, we chose C++ as it is easy to link C++ libraries to other languages such as Perl, Java or Python via SWIG [SWI]. PPN is implemented on top of PVM, a portable software library for message passing that provides abstractions for hosts, tasks and messages in between tasks. We used PVM++ [PVMb] to interface with PVM as well as the Boost library [Boo] for interacting with the filesystem and for parsing command-line options. All our network and messaging access are done through PVM++ which makes it easy to exchange PVM by another library, for example, an MPI implementation. In PPN, each actor is implemented as its own process. We provide a base-class Actor from which we inherit all components in a workflow (see listing 5.1). For logging/debugging purposes, each actor has an actor name, and an instance name. The method initialize() is called when the actor is created, before ports are connected; go() is called iteratively while the workflow is running and go() itself returns true. When the execution is done the method cleanup() is invoked to perform user-defined cleanup work. For sending data, we created a template class port that can be instantiated with primitive types such as int, char or std::string. The port class encapsulates sending and receiving completely (see listing 5.2). From the actor’s point of view, the two methods << and >> are used to send and receive data through and from a port. We also implemented a custom BLOB data type that represents data that is in the temporary directory of the actor. The BLOB class provides methods to get a filename to this data, and to create new BLOB tokens from existing files. When BLOB data is sent through ports it is sent via scp to the host on which the receiving actor is located. The workflow system can be configured to not copy files if both actors are on the same host. Instead, a hard link is created to the

5.1. General Design Decisions

1

3

5

7

9

11

13

15

17

19

21

class ActorImpl; class Actor { public: Actor(const std::string & instanceName = ""); virtual ~Actor(); virtual void initialize(); virtual bool go(); virtual void cleanup(); std::string getInstanceName(); virtual std::string getActorName() { return "Actor"; } static void sleep(unsigned long microseconds); static void system(const std::string &cmd); protected: ActorImpl *myImpl; }; Listing 5.1: Actor class declaration

122

5.1. General Design Decisions 1

3

5

7

123

template class InPort : public InPortB { public: InPort(const char *name); InPort & operator>>(T &data); void read(T &data); T readR() { T d; read(d); return d; } ~InPort();

9

11

13

15

17

19

21

protected: InPortImpl *myImpl; virtual void reset(); }; template class OutPort : public OutPortB { public: OutPort(const char *name); OutPort & operator<<(T &data); void write(T &data); void writeV(T data) { write(data); } ~OutPort();

23

25

27

protected: OutPortImpl *myImpl; virtual void reset(); }; Listing 5.2: Port class declaration old location2 .

5.1.1

Workflow Setup

When all actors are implemented, they can be started and the ports can be connected to each other. For connecting ports, we need to communicate with each actor’s portmanager to initialize member variables storing the connection’s endpoint. Starting actors and connecting ports can be done using the methods of a workflow-manager. The workflow2

a more portable reference-counting mechanism is also thinkable

5.1. General Design Decisions

2

124

#!/usr/bin/perl -w use strict; use PWFS;

4

6

my $sender = wfs::start("TokenSender_p"); my $receiver = wfs::start("TokenReceiver_p");

8

PortConnect($sender, "out", $receiver, "in");

10

wfs::RPC_simple_initializeDone($receiver); wfs::RPC_simple_initializeDone($sender);

12

exit 0; Listing 5.3: Sample PPN workflow setup script

manager is comparable with Kepler’s director; however, the actors are not depending on the workflow manager. In case the workflow manager fails, the actors will happily continue to work. In fact, it is even possible to have multiple managers giving the actors instructions. Also, since the workflow managers are not necessary during workflow run-time, these tasks are often terminated after the workflow has been set up. Usually, workflow setup comprises three phases. First, all actors are started, then all ports are connected with each other and finally, a specific message is sent to all actors to switch from initialize state to calling the method go. A sample workflow setup script is shown in listing 5.3. We bench-marked the runtime of several workflows to get a better understanding of our system. We wanted to investigate message sending and workflow start-up times on one machine and on a cluster. For these experiments we used the by PVM provided processspawning facilities and started our actors round-robin on the available hosts. The first experiment shown in Fig. 5.1 investigates workflow startup-time and how fast non-BLOB data tokens are sent through the workflow. As experimental setup we used a workflow that has a source, n pipe-actors, and a sink; we varied n from 0 to 100—creating workflows with up to 102 actors. As data, we created a collection s[a[] * 100] containing

5.1. General Design Decisions

125

100 empty a collections. In Fig. 5.1 the results of our experiments are shown. The graphs show the time when each phase of the execution was finished, i.e., the time for connecting all ports is the difference between the finish time of the previous step (startup) and the finish time for portConnect. Fig. 5.1(a) shows measurements when we run our experiment on a single host. As expected, the overall execution time increases with increasing workflow length. Starting all the actors and finally running the workflow takes the longest time, while connecting ports is fairly quick and sending the finial notification to start workflow execution is too fast to explicitly show up in the graph. Comparing the figure with Fig. 5.1(b), where we used 35 hosts to execute the workflow shows that even when there is no actual work performed by the actors using multiple hosts already increases overall execution. In particular, it is interesting that starting 100 actors remotely is significantly faster than starting all of them on the same host, i.e., process creation time (probably due to the necessary allocations of memory, etc.) outweighs the increased time necessary for communicating over the Network (to start a task remotely as opposed to on the same machine). Also, sending empty collections in parallel in between hosts outperforms sending them on a single host (with a dual-core processor). In Fig. 5.2, we ran the same workflows but sent binary data through it. In both instances, local execution and cluster execution, the data is copied when being sent from one actor to another. Also here, we see that the introduced slowdown by sending the data over the network is outperformed by the speed-up due to parallel execution of the workflow. Note, that during the execution of the workflow with 40 actors, 41 times 200MB data is moved. On the single machine the overall datathroughput is 190 MBit/sec3 , in the cluster, it is 1000 MBit/sec due to parallelism. Note, that when trying to send data from one cluster node to the other on the command line we achieved a maximum speed of 260MBit/sec when sending a file of size 1GB; sending a file sized 20MB (as in the workflow experiment) was possible at speed of 110MBit/sec. In other words, even with small file sizes of 20MB, parallel execution reduced the overall workflow execution time to around 25%. We anticipate to get even better speed-ups for workflows 3

200MB · 8 Bits/B / 340 sec

5.1. General Design Decisions

126

that perform CPU-intensive computations. 4.5

3.5

run unleash portConnect Startup

4 3.5

3

runtime [seconds]

runtime [seconds]

4.5

run unleash portConnect Startup

4

2.5 2 1.5

3 2.5 2 1.5

1

1

0.5

0.5

0

0

0

10

20

30 40 50 60 70 number of pipe actors

80

90

100

0

(a) Runtimes on single host

10

20

30 40 50 60 70 number of pipe actors

80

90

100

(b) Runtimes on cluster with 35 hosts

Figure 5.1: Analysis of Workflow execution times with varying number of actors (only collection-tokens: input ‘s[ a[] * 100]’)

350

350 run unleash portConnect Startup

run unleash portConnect Startup

300

250

runtime [seconds]

runtime [seconds]

300

200 150 100 50

250 200 150 100 50

0

0 0

5

10

15 20 25 number of pipe actors

30

(a) Runtimes on single host

35

40

0

5

10

15 20 25 number of pipe actors

30

35

40

(b) Runtimes on cluster with 35 hosts

Figure 5.2: Analysis of Workflow execution times with varying number of actors (with data: input ‘s[d @ 10MB * 20]’)

5.1.2

Workflow Run

During a workflow run, no central component is involved in the execution. Actors read through the port interface from channels and block in case no data is available. After performed their computation, actors write out data through the port interface. The pro-

5.2. VDAL-specific Functionality 1

3

5

7

S A B D E F G

::= ::= ::= ::= ::= ::= ::=



127

(A | B)* ; (D . D*) | E* ; C* ; G . F* ; H* ; PCDATA ; PCDATA Listing 5.4: Sample schema declaration

1

3

‘sigma: F -> X ‘talpha: F ::= ‘tomega: X ::= Y ::=

. Y* PCDATA PCDATA; PCDATA Listing 5.5: Sample signature declaration

grammer only provides scientific functionality that runs locally4 and the workflow tool takes care of appropriate parallelization and efficient deployment to a target platform.

5.2

VDAL-specific Functionality

We designed textual representations for production rules, schemas and actor signatures as defined in Chapter 4. We also designed a textual representation for dummy input data description. Using the Perl module collection Parse::Eyapp we automatically parse these descriptions into abstract syntax trees, which we use for further processing. Sample schemas, signatures and dummy data descriptions are shown in listing 5.4, listing 5.5, and listing 5.6, respectively. In particular, for schemas we check if the parsed regular expressions in the content models of the type rules are One-unambiguous using the algorithm described in [BKW98]5 . Of course, we check one-unambiguousness not with respect to the type variables but according to the label that is used in the respective rules. In the type rule X ::= haiB.C.D, for example, we consider the type rules for B, C and D for checking one4 or at some remote place, in future versions where the scientific workflow tool integrates remote services as well 5 We use a slightly corrected version

5.3. Kepler as a PPN GUI

128

unambiguousness: If B ::= hbiZ, C ::= hciZ, and D ::= hdiZ the rule is suitable whereas it would not be for B ::= hbiZ and C ::= hbiZ, for example. Based on our type rules being one-unambiguous, we create DFAs for each content model and integrate them to a hierarchical state machine with a stack for parsing the XML stream content as it comes in. We implemented the X-CSR algorithm for a set of schemas and signatures to generate a hierarchical DFA for the distributors that outputs the destination actor for each data item that is processed by the distributor. Utilizing this information, we accordingly send holes and fillers to the appropriate channels if needed. s[ b[ C @ 1MB ] * 3 . a[ d[ g[ PCDATA @ 10KB ]]] . a[ e[ H @ 20MB * 3 ]]] Listing 5.6: Sample description of synthetic data

5.3

Kepler as a PPN GUI

In this section we describe the coupling of PPN to the Kepler system6 . Our Kepler extension in form of a Kepler director named PWS for parallel workflow system, allows PPN actors to be represented as actors in Kepler. Users are thus able to use the intuitive Kepler GUI (an adoption of Vergil, the Ptolemy II GUI [Pto06]) to build and execute PPN workflows that can run locally as well as in a cluster environment utilizing parallel resources. Kepler is loosely coupled to PPN, in particular, the data processed on the cluster is not routed through the Kepler system. Instead, Kepler is used only to setup, launch and monitor cluster PN workflows. However, by exploiting the communication channel from Kepler to PWS, Kepler actors and PWS actors, coexisting in the same workflow, can exchange data with each other through String ports.

5.3. Kepler as a PPN GUI

129

Figure 5.3: General Architecture of Kepler-PPN Coupling

5.3.1

Architecture

In Fig. 5.4, the general architecture for the Kepler-coupling is shown. Inside the Kepler workflow system, stub actors are used to represent PPN actors. Information about available PPN actors, and their details such as name, ports, and parameters is provided to Kepler via an XML file, which could easily been automatically generated from a repository of PPN actors. A sample configuration, describing the ConvertResize actor, which wraps the convert command-line tool, is shown in listing 5.7. Thus, the scientist uses PPN actors in Kepler just as if they were local Kepler actors: placing them on the canvas via drag’n’drop and then, connecting ports and configuring parameter settings. Once the workflow is started from the Kepler system, the Kepler director establishes a connection to a PPN workflow manager. It then advises the PPN workflow manager to start the appropriate PPN actors with the in Kepler configured parameters. In the next phase, the Kepler director connects the ports of the PPN actors, and then unleashes the actors. From then on, the PPN workflow runs without a dependency to Kepler. In particular, data from one PPN actor to the next is not channeled through Kepler, 6

This section is based on joint work with Xuan Li [ZLL09].

5.3. Kepler as a PPN GUI 1

3

5

7

9

11

13

15

17

130

input Input image. Type: PWS-based File-BLOB output Output image. Type: PWS-based File-BLOB 320x200 Specify the new size of the image here. (convert syntax) ... Listing 5.7: Kepler configuration file of existing PPN actors

which would create a centralized bottleneck. In contrast, details about workflow execution and data transport are handled by the PPN engine. This architecture clearly separates the workflow model from low-level decisions about workflow execution. The user can also specify hints for the parallel execution, such as which actors should be co-located on the same host of the cluster. In future work, these specifications could be more elaborate.

5.3.2

PPN Monitoring Support in Kepler

To provide the user with feedback about the PPN execution, Kepler periodically contacts the PPN actors, inquiring information about their current status and about how many tokens have already been flowing through their ports. During a workflow execution, a PPN actor can be in one of four states: (1) Waiting for input data: When an actor reads from an input port that has no tokens queued, the actor blocks according to PN semantics. We use a yellow traffic light to symbolize this state (see the StringPipe actors in Fig. 5.5).

5.3. Kepler as a PPN GUI

131

(2) Working: The data has been read and is currently processed. Note that in this state, the actor can also call external programs as it is done in the command-line actor that wraps standard UNIX programs. We use a green traffic light here (see the Mono actor in Fig. 5.6). (3) Waiting for down-stream actors: In case the FIFO buffer capacity for sent messages has been reached, actors also blocks on writing. We use a red-light here (see the Movie2Frames actor in Fig. 5.6).

(4) Waiting for termination: In this category are source actors that have output all data, and downstream actors that are blocked on read but cannot receive further data because the upstream actor waits for termination and the FIFO channel has been drained. Here, we let all three lights of the traffic light shine green. The number of sent and received messages is also shown next to the actor symbol in Kepler (see Fig. 5.5). Notice that because Kepler polls this information, the workflow run itself is not waiting for the central Kepler component.

5.3.3

Communication with Kepler Actors

The loose coupling of Kepler and PPN moves the authority about sent data tokens away from the Kepler system into PPN. While this design decision is beneficial for scalability, it causes problems if normal Kepler actors are used alongside PPN actors in a single workflow. Since PPN actors in Kepler are only stubs, data sent to them from regular Kepler actors needs to be transported to the PPN system; similarly, data sent from PPN to a normal Kepler actor needs to be passed from the PPN environment to the Kepler system. We implemented this mechanism of Kepler-PPN actor communication via special actors on the PPN side: If a PWS actor A is connected to a Kepler actor B on the canvas, we will create an additional PPN actor PK alongside the PPN counterpart of A in the PPN system. This PPN actor PK thus acts as proxy for the Kepler actor in the PPN system.

5.3. Kepler as a PPN GUI

132

PPN Dir.

K

pvm daemon

scp

scp

scp

scp

PVM Cluster

Figure 5.4: Kepler-PPN Coupling

Figure 5.5: Demonstrating Communication between Kepler and PPN. In this Workflow, a Kepler actor creates a string “hello world” which is sent to the StringPipe actor inside the PPN system, to another PPN StringPipe actor and back to Kepler to be displayed in standard Text-Display. Note, the yellow traffic-light icons in at the PWS actors, which symbolize the “waiting for input data” state, and the monitors counting how many tokens have been sent through the PPN ports on the cluster.

5.3. Kepler as a PPN GUI

133

Figure 5.6: Kepler-PPN Coupling in action. On top, the PWS workflow in Kepler is shown. Below, the XPVM GUI shows the different processes (bars) with color-coded status (green=working, white-waiting, yellow=waiting for network) and their communication messages (red lines).

5.3.4

Demonstration: Movie Conversion Workflow

Fig. 5.6 shows a workflow that converts a colored movie into a grey-scale movie. The FileReader reads the move file from disk, the Fork sends one copy to Movie2Frames and one

to another Fork. On the first branch, the movie is converted to images, then each image is turned into a grey-scale image, and then a movie is assembled again. To this movie, the sound from the original movie is added and the result is played by an Mplayer actor and saved to a directory. This workflow can be run locally, or on a cluster computer, speeding up execution time.

5.4. Summary and Related Work

5.4

134

Summary and Related Work

In this chapter, we described our light-weight process network engine PPN. We performed experiments showing improvements in execution time when run on a cluster computer. We furthermore described how PPN workflows can be created, run and monitored using the Kepler workflow system. We imagine our system to be used for future experimentation, and we think that its architecture can used as a model for future couplings of specialized execution engines to a general workflow system, such as Kepler. In [YB05], an overview of building and executing workflows on grid infrastructures is given. A pipeline-engine for the grid environment is presented in [BKC+ 01]. The system employs a restricted, not as rich data-model as VDAL. In particular, it does not support hierarchical data and processing on multiple levels. Several other workflow-systems for deployment on the Grid have been proposed: Gridant [FMJG+ 05, AvLH+ 04], as well as interactive parallel programming environments in general as in [vL96]. To the best of our knowledge, however, non of these systems deploy a hierarchical data model as in our case.

135

Chapter 6

Workflow Analysis I: Supporting Workflow Design It takes less time to do a thing right, than it does to explain why you did it wrong. Henry Wadsworth Longfellow

In this chapter1 we illustrate how a scientific workflow system based on the VDAL approach can support scientists during workflow design-phase. We propose system features that will increase the scientist’s productivity by providing valuable information and feedback about the workflow under construction. To this end, we first describe design use cases, e.g., displaying actor dependencies or detecting so-called starving actors. After quickly reviewing the general VDAL design and its components, we then show how a VDAL workflow can be compiled to existing XML languages, and we will demonstrate how static analysis can be used to address some of the design use cases. Supporting the remaining use cases is part of future work. 1

This chapter is based on [ZBL10]

6.1. Design Use Cases

6.1

136

Design Use Cases

Predicting output schema from configured workflow with input schema. Given a configured workflow with input schema τi , the workflow system should be able to predict the output schema of the workflow. That is, a possibly tight output schema τo should be constructed such that for all inputs i ∈ τi for the workflow, the workflow output o is of schema τo . Detecting starving actors.

Given a configured workflow with input schema τi , find

all actors A such that for all inputs i ∈ τi the scientific function of A will never be executed when running the workflow on i. This case can occur if the input provided to the workflow does not contain all necessary data for the scientific function to execute. It is also possible that there is some configuration mismatch, caused for example by a spelling error—providing this information in both cases to the workflow designer can help catching design errors early on. Displaying actor dependencies. In contrast to conventional dataflow networks, actors in virtual assembly lines are not tightly coupled to each other via explicit data channels; instead the configurations determine how black-boxes are applied to which parts of the structure-rich XML stream. While this enables black-boxes to be flexibly applied to data in the input stream without additional wiring, it also hides actor dependencies. Consider for instance an actor A2 that immediately follows an actor A1 in a VDAL workflow. Unlike in a conventional dataflow network, A2 does not necessarily depend on the data produced by A1 , for example if A2 ’s scope does not select any of the data that A1 had produced. While constructing and modifying scientific workflows, however, it would be nice to know about the dependencies between actors (just like it was the case in conventional dataflow networks). Removing an actor Ai from a workflow, for example, might be completely fine if Ai ’s work is not essential for following actors. A workflow might just as well work without some optional refinement steps.

6.1. Design Use Cases Detecting inconsistent workflows.

137 Given a configured workflow, is there an input

schema such that all actors are non-starving? In other words, the workflow system should warn the user if for all possible input schemas there would exist starving actors. An example for an inconsistent workflow is a workflow with two actors, the first one is configured to replace all A-data by B-data, whereas the second one is supposed to replace all A-data by C-data. Clearly, for any possible input, the second actor will never find any A-data in its input stream as all of it has already been replaced by B-data. Generating canonical input schema and instance. Given a configured workflow that is not inconsistent, generate an input schema τ0 for which all actors in the workflow are not starving. Furthermore, find a smallest one of these input schemas (for some complexity measure of the schema). Then, generate an input instance i ∈ τ0 by choosing canonical values for the base types in τ0 . This could not only help to “dry-run” the workflow, but also to inform the scientist which input data is expected by the workflow with the current configurations. Displaying schema-level provenance graph. Generate a graphical representation of how the scientific functions are applied to which data items and how new data is generated. The graph should contain abstractions for base data items, and the collection structure. Each collection node or base-data node should be connected via “read”/“write” edges to nodes representing scientific functions if the scientific function will read/modify the corresponding data. Newly created data items or collections should be linked via “created by” edges to the scientific function that created them. Data that is duplicated or moved to different locations in the tree should be linked to the data in the old schemas via “relocated” edges. This graph is not only useful for tracking data provenance, but can also function as a description of what the scientific workflow will compute. By having both graphical representations, workflow graph and schema-level provenance graph, the workflow designer will be able to quickly spot errors that would be hard (or even impossible) to detect based on the workflow graph alone.

6.2. Well-formed Workflows

XML

138

σ, γ, ω

σ, γ, ω

XML

A

B

C

input i

output o

map B

map γ

σ

d0

XML

F0

F

τ

σ, γ, ω

XML

d1

ω d2

r0

τα

r2 r1

τω

M

τ0

Figure 6.1: Components and dataflow inside VDAL actor.

Suggesting actor configurations. Given a partially configured workflow, i.e., a workflow in which not all actors are configured, suggest missing actor configurations such that the workflow will be well-formed. In case only few configurations would make the workflow wellformed, automatically suggesting them would significantly speed-up the workflow designer.

6.2

Well-formed Workflows

Before we make our design-related notions more concrete, we will briefly review virtual assembly lines in general, and the individual components of VDAL-actors.

6.2.1

Review: Virtual Assembly Lines

In virtual assembly lines, actors operate on XML input data (see Chapter 2). As summarized in Fig. 6.1, a VDAL actor comprises a scientific function (here A, B or C), a scope expression σ, an input assembler configuration γ, and a write expression ω. The scope defines which part of the input an actor can read and modify. In particular, given an input i, the actor components interact in the following way: The scope σ selects a list of nonoverlapping trees di from the input i (see Fig. 6.1 for an illustration). For each selected tree

6.2. Well-formed Workflows

∃i ∈ τ ∀i ∈ τ

139 σ(i) 6= [ ] reading always reading

σ(i) = [ ] sometimes blind blind

Table 6.1: Actor definitions: Reading versus blind di ∈ σ(i) the actor’s input assembler γ specifies how data is extracted from the current tree di to produce a list of inputs for the scientific function. Let us call this list the firing list F = γ(di ). For each element e ∈ F the scientific function B is called (or fired) using e as input data, creating an output data item e0 . Thus, for one scope match di , there will be a list, called F 0 , which contains all output data e0 . Since each output corresponds to one input, the lists F and F 0 are equal in size. The write expression ω then specifies how the output F 0 is used together with di to create the output ri of the actor invocation. In a last step, which is often implicit in a streaming implementation, the output ri replaces the input scope di .

6.2.2

Notions about Well-formed Workflows

As already mentioned there are several interesting properties a workflow system can ensure. Most importantly, it should be guaranteed that all scientific functions are called with data that is compatible to the input types of the scientific function. Other desirable properties originate from the virtual assembly-line approach: An actor will only execute the scientific function if (1) the scope will select data from the input, and (2) its input assembler creates non-empty firing list. In this section, we make these characteristics of workflows more concrete. Definition 6.1 (Configured actor). An actor is configured, partially configured, unconfigured if all, some or non of its Cow parameters (scope σ, input assembler γ, and write expression ω) are specified. Similarly for workflows, a configured / unconfigured workflow is a workflow in which all actors are configured/unconfigured. We call a workflow that is neither configured nor unconfigured partially configured.

6.2. Well-formed Workflows for σ reading τ : ∃i ∈ τ . ∃d ∈ σ(i) ∀i ∈ τ . ∀d ∈ σ(i)

140 γ(d) 6= [ ] feeding always feeding

γ(d) = [ ] sometimes starving starving

Table 6.2: Actor definitions: Feeding versus starving We will now formally define define feeding, starving, and other concepts that describe how configurations π and δ relate to sets I of possible input collections. Definition 6.2 (Reading and blind actors). Given a scope σ and an input type τ . We define the relations reading, always reading, sometimes blind, and always blind as given in table 6.1. We say, for example σ is reading τ iff ∃i ∈ τ . σ(i) 6= [ ]. Remark 6.3. If a read scope σ is blind on an input type τ , then the actor will not change the input set. Thus, the scientific function of the actor will never be called when the workflow is executed, which often is a configuration or design error. For a scope σ and an input type τ such that σ reads τ , we now define concepts that describe if the scientific function will be executed. Definition 6.4 (Feeding and starving actors). Given a scope σ and an input type τ , such that π reads τ . We define the relations feeding, always feeding, sometimes starving, and starving between γ and σ(τ ) as given in table 6.2. We say, for example γ is feeding on σ(τ ) iff ∃i ∈ τ . ∃d ∈ σ(i) . γ(d) 6= [ ]. Note, that we lift the notions of reading, blind, feeding and starving from the configurations to the actors. We, for example, call an actor starving on its input type τ if its scope is either not reading τ or its input assembler starves on σ(τ ). Definition 6.5 (Well-formed workflow). A configured workflow W is well-formed with respect to an input schema τ0 if it is all-reading-and-feeding.

6.3. Compilation of VDAL to FLUX

141

Definition 6.6 (Inconsistent workflow). A configured workflow W is inconsistent iff there is no input schema τ0 such that the workflow is well-formed. We will now define the concepts of a required-for relation between actors, and the notion of required actors. Definition 6.7 (Required-for Relation). Furthermore, an actor Ai is required for another actor Aj in a workflow W (in symbols Ai ; Aj ) iff (1) Aj is feeding in W , and (2) Aj is starving in W when Ai has been removed. The inverse relation is called requires. In general, an actor Ai is required in W if there exists another actor Aj such that Ai is required for Aj . In the following, we will show how VDAL workflows can be compiled to existing XML languages and how we can exploit their existing type systems to detect starving actors, to verify that a workflow is well-formed, and to compute the required-for relation.

6.3

Compilation of VDAL to FLUX

In the following, we will show how a VDAL actor can be translated to FLUX [Che08], an update language that respects referential transparency and is thus especially suited for static analysis.

6.3.1

Necessary FLUX Extensions

To compile VDAL actors to FLUX programs, FLUX and its type system need to be extended in three ways: (1) adding primitive BaseTypes, (2) adding a construct to call the black-box functions, and (3) adding support for the descendent axis for iteration and readscope. We further add two additional typing rules to improve the type-system’s precision.

6.3. Compilation of VDAL to FLUX Adding BaseTypes.

142

FLUX and the core language LUX, to which FLUX queries are

compiled, only contain one primitive type string. However, adding BaseTypes to the type system and extending the expression language to reflect the change, does not pose major problems [CGMS04, Che08]. Adding support to call black-boxes.

Cheney proposes type rules for procedures in

[Che08]. Since black-box functions create a number of named output lists from a number of named input lists, with each of the lists containing only BaseTypes, they can easily be incorporated as procedures into the FLUX update language. In the FLUX implementation, control can be delegated to the wrappers to interface with the actual black-boxes. We will use the black-box function name inside the body of a let statement to denote the function call (see Fig. 6.3 line 9). Input parameters are provided in parenthesis output parameters are bound to the output values inside the let statement. For ease of presentation, parameters are matched by position. Adding support for descendent axises. FLUX does not allow the use of descendent axis to avoid overlapping selections for the focus of an update. Descendent operators in the read-scope are defined to use a first-match semantics to prevent overlapping scope matches. When compiling FLUX to LUX (as it is done in [Che08]) it is therefore possible to rewrite a //operator into a procedure that exactly implements the first-match semantics. Descendent operators in the iteration scope are not used to select input focus and are thus already allowed in FLUX because they are part of µXQ (dos-operator) [CGMS04], which is the XQuery language used in FLUX. A Slightly More precise Productivity Check for If Statements.

Our detection

of starving actors is based on the “dead”-code analysis of FLUX. However, FLUX’s rule for its IF statement only detects the if-statement as unproductive if both alternatives are unproductive. Since µXQ can analyze emptiness of variables [CGMS06], we can slightly improve FLUX’s analysis to Additionally mark the IF statement unproductive if the current type of the expression is the empty sequence and the else branch is unproductive as follows:

6.3. Compilation of VDAL to FLUX

143

We follow the in XQuery common convention that the empty sequence “( )” evaluates to false, if used in if-then-else expressions. We can thus improve the dead-code analysis for unproductive if-statements by replacing the generic rule (6.1): Γ ` e : bool

Γ `a {τ }(s1 )l1 {τ1 }&L1

Γ `a {τ }(s2 )l2 {τ2 }&L2

a

Γ ` {τ }( if e then (s1 )l1 else (s2 )l2 )l {τ1 |τ2 }&(L1 ∪ L2 )[l1 , l2 ⇒ l]

(6.1)

with the following two rules: Γ ` e <: ()

Γ `a {τ }(s2 )l2 {τ2 }&L2

Γ `a {τ }( if e then (s1 )l1 else (s2 )l2 )l {τ2 }&(L2 )[l2 ⇒ l] Γ ` e 6<: ()

Γ `a {τ }(s1 )l1 {τ1 }&L1

Γ `a {τ }(s2 )l2 {τ2 }&L2

Γ `a {τ }( if e then (s1 )l1 else (s2 )l2 )l {τ1 |τ2 }&(L1 ∪ L2 )[l1 , l2 ⇒ l]

(6.2)

(6.3)

While the rule (6.1) determines the if-then-else-statement unproductive only if both sides are unproductive (l1, l2 ⇒ l), the new version (6.2) checks whether the type of e is (), and if so, infers a tighter result type for the update (τ2 instead of τ1 |τ2 ) and marks the if-thenelse-statement unproductive already if the second statement was unproductive (l2 ⇒ l). In case the type of e is not the empty sequence, then the old rule is used (6.3) because a type τ 6= () does not guarantee that all its value are non-empty; consider for example the following type τ =true | ( ), which has an empty and a non-empty instance.

6.3.2

Rewriting VDAL to FLUX

A VDAL workflow W = A1 , . . . , An is compiled to a FLUX program F by rewriting each VDAL actor Ai into FLUX statements fi that are then stringed together in the order of the original actors: W = A1 , . . . , An ; f1 ; . . . ; fn = F We will explain the transformation based on the example VDAL actor in Fig. 6.2. In the

6.3. Compilation of VDAL to FLUX 1 2 3 4 5 6 7 8 9 10 11 12 13 14

144

Scientific Actor: CipresTreeInference Input: method of String geneSequences of GeneSequence* seed of Float maxIterations of Integer Output: tree of PhyloTree* quality of Float VDAL Actor Parameters: σ: //Species γ: method ← foreach $p in //Method return $p/String geneSequences ← return //Alignment//GeneSequence seed ← {42}, {23} maxIterations ← return /MaxIteration/Integer ω: INSERT AS LAST INTO . VALUE Trees[ $result ]

Figure 6.2: Example for VDAL actor configuration compiled version (Fig. 6.3) FLUX update statements, as defined in Fig. 2.10, are written in capital letters whereas statements from the query language µXQ are written with small letters. As shown in Fig. 6.3 line 1, each actor is transformed into one UPDATE .. AS .. BY statement; consequently the update given after BY is performed on each result returned by the read scope (here //Species). Additionally, the current scope is bound to the variable $readS. In the LET-statement (line 2), the result-list $result is created. For each input parameter of the black-box, a variable (e.g., $method) is introduced. If the binding was given via a grouping XPath expression (line 10 in Fig. 6.2), a fresh variable (here $methodGrp) is used in a for loop to iterate over the first path; the second path (here $p/String) is adjusted if it refers back to the variable $p (here it is replaced by $methodGrp). In case the binding was a non-grouping XPath expression in the actor (Fig. 6.2 line 11), the variable (here $geneSequences) is bound via a simple let-statement (Fig. 6.3 line 4). For literal values, for-loops are introduced if additional groups have been indicated via {...} (Fig. 6.2 line 12 and Fig. 6.3 line 6), otherwise a let-statement is used. If a non-list parameter is bound with a simple let-statement, originating from a non-grouping binding (as in Fig. 6.2 line 13), an additional if -statement is used to only call the black-box function if the parameter was bound (line 7). Without this if -statement, the FLUX type system would not type-check the program as the black-box procedure could possibly be called with

6.4. Static Analysis for FLUX-compiled VDAL Workflows 1 2 3 4 5 6 7 8 9 10 11 13 14 15 16 17 18

145

UPDATE //Species AS $readS BY { LET $result := for $methodGrp ∈ $readS//Method return let $method := $methodGrp/String in let $geneSequences := $readS//Alignment//GeneSequence in for $seed ∈ (42, 23) return if ($readS/MaxIteration/Integer) then let $maxIterations := $readS/MaxIteration/Integer return let ($tree, $quality) = CipresTreeInference( $method, $geneSequences, $seed, $maxIterations) in tuple[method[$method], geneSequences[$geneSequences], seed[$seed], maxIterations[$maxIterations], tree[$tree], quality[$quality] ] else () IN IF ($result) THEN INSERT AS LAST INTO . VALUE Trees[ $result ] }

Figure 6.3: FLUX-Code corresponding to VDAL actor given in Fig. 6.2. the value “( )” for the empty sequence. As body of all nested loop and let-statements, the black-box function is called and provided with the input parameters. The output values (or list of values) are bound to the variables (line 9). Then (lines 11-14), the result tuple is created with subtrees that are labeled with the names of input and output parameters and which contain the corresponding data. The write-scope statement, which will update the scope τα if the black-box function was called (and thus $result is not empty) is then pasted in line 18.

6.4

Static Analysis for FLUX-compiled VDAL Workflows

Once a VDAL workflow has been compiled into FLUX programs, static analysis techniques available for FLUX programs can readily be used to provide additional benefits. In particular, during compile time we can (1) guarantee that all black-box functions are provided with the correct base data, (2) predict the schema of workflow output, and (3) display actor dependencies. Type safety.

Using the FLUX type-system, we can verify before the workflow is run

6.4. Static Analysis for FLUX-compiled VDAL Workflows

146

that all binding expressions will select data compatible with the black-box functions. This can be done by adding a type declaration ∆ for each black-box function and by simply type-checking the FLUX program f1 ; f2 ; . . . ; fn . The typing rule for procedures (which use the declarations ∆ (see [Che08]) ensures that black-boxes will be called with compatible base-data only. Output schema prediction. Given a specific input schema (or the Any-type) as input, we can make use of FLUX’s type system and predict the output schema of the workflow by simply applying the rules given in [Che08]. Actor Dependencies. The basis for detecting starving actors and actor dependencies is the dead-code analysis available for FLUX and µXQ. Dead-code analysis for FLUX [Che08] is an extension of the path-error analysis for µXQ described in [CGMS06]. In FLUX, the analysis detects subexpressions (or statements) that are equivalent to the operation skip, i.e., that do not change the input data. For query-expressions in µXQ the analysis finds expressions that are equal to the empty sequence ( ). Examples for dead-code in FLUX is a for loop ranging over a path that will not have any bindings, or an UPDATE Path BY statement, in which the Path will always evaluate to an empty list. We can therefore use the algorithm in [Che08] to detect cases in which no read-scope match will occur. Furthermore, the black-box function will not be called if one of the non-list parameters is not provided any data. In case the parameter is filled with a simple XPath expression, the if statement guarding the let binding for the variable will not be satisfied. If the path is filled with a for-loop no data will be available to be looped over and the black-box function is not be invoked either. Whenever the black-box function is not called, $result will be empty and no update will be performed. With the slightly improved FLUX analysis for if -statements, we can detect starving actors Ai in VDAL workflows by checking the associated FLUX statements fi . To analyze actor dependencies in a workflow W , we would simply check which actors cause other actors to turn starving if removed. To generate the whole “required-for” relation

6.5. Discussion and Related Work

147

INPUT: F = f1 , f2 , . . . , fn OUTPUT: required-for-relation R for i := 1 to n do F’ := RemoveActor(F, i) turned-starving := NotProdIn(F’) ∩ ProdIn(F) foreach j ∈ turned-starving do R.Insert(i, j) return R Figure 6.4: Generating the required-for relation R of actors in a workflow that have been compiled to a FLUX program F. The code uses the checks NotProdIn(..) and ProdIn(..) available for FLUX programs. ProdIn(X) returns the set of non-starving actors in X, NotProdIn(X) returns the starving actors. as defined in Section 6.2.2, the pseudo-code in Fig. 6.4 can be used. The required-for relation, can be displayed to the user when integrating multiple actors in a workflow. With this information, the developer can verify that there are no typos in the XPath expressions (as otherwise actors would be starving). More importantly, this information also provides feedback on which actors are essential for downstream steps (and should thus not be removed).

6.5

Discussion and Related Work

In comparison to work on XQuery, and XPath research on XML update languages is relatively new and much more confined. Several languages based on XQuery have been proposed. In [SHS04], Sur et al. present UpdateX, a declarative update language that “seamlessly” integrates with XQuery with all of its constructs and capabilities. UpdateX provides a high-level, user-friendly language for updating XML documents. The language is implemented on top of Galax [FSC+ 03], a query engine for XQuery 1.0 written in OCaml. UpdateX supports a complex update construct “For-Let-Where-Update” (FLWU). This construct is similar in expressiveness to actor configurations we are proposing. It is, however, not specifically geared to invoking scientific functions. This work also does not consider the design-aspects nor the issues that arise in a distributed execution of networks of stream processors. In [HPV06], Hidders et al. define an XML update language called LiXQuery+ that

6.6. Future Work: Workflow Resilience

148

contains many constructs of previous proposals for XML update languages. The authors provide a formal definition of their language and investigate subsets of it to examine the expressive power of certain constructs. Bex et al. present a formal syntax and precise semantic for a expressive fragment of XSLT in [BMN02]. They showed that XSLT “can compute all unary monadic secondorder (MSO) structural properties. In brief, MSO is first-order logic extended with set quantification and is an expressive and versatile logic.” In this chapter, we used the well-behaved XML-Update language FLUX [Che08] to implement VDAL actors. We can thus rely on existing work to statically analyze our VDAL workflows. This allows for additional features such as guaranteeing type-safety for Back-box function calls, predicting the output schema of a workflow, and revealing actor dependencies. Static analysis is based on the work on FLUX [Che08], µXQ [CGMS06], and earlier work on regular expression types [HVP05, BCF03]. Our work heavily utilizes research on FLUX and its type system. As we have shown, VDAL actors can be programmed in the FLUX language (with few extensions as described in this work). In addition to existing research on FLUX, we focus on providing an abstraction for creating large FLUX programs from smaller ones, i.e., to construct workflows using actors as components. We show how actors with scopes (i.e., certain FLUX programs) facilitate robust and change-resilient workflows with simple wiring, as well as improve the workflow design process via providing feed-back about starving actors.

6.6

Future Work: Workflow Resilience

In this section, we outline interesting work that could be done with respect to Workflow resilience.

6.6. Future Work: Workflow Resilience

6.6.1

149

Input Resilience

During the introduction (Section 1.3.4) we presented evidence that shows that collectionoriented workflows are generally more resilient against changes to the input data schema than, for example plain PN workflows. It would therefore be good to classify the changes to the input schema that will not effect workflow well-formedness. While well-formedness is no guarantee that the workflow’s operation on the changed input data is the desired “adaptation” to the change, maintaining well-formedness is a good indicator. As future work we plan to characterize the changes to the input schema by queries. We hope to be able to partition the set of all possible queries into three classes according the following distinction: 1. Queries q such that for all input i of the old schema τ , a configured workflow that is well-formed wrt. τ will also be well-formed for the input {q(i)}2 . 2. Queries q that will render some (or all) workflows w that were well-formed wrt. τ not well-formed for some i0 ∈ q(i), with i ∈ τ ; but for which we can automatically change the configurations of the actors in w to obtain a workflow w0 that is well-formed for all i0 ∈ q(i), i ∈ τ . 3. Queries q for which neither of the two other cases applies.

6.6.2

Resilience against Workflow Changes

It is important for scientists to let the workflow system help them creating new workflows from existing ones. For this scientists want to insert, remove or exchange actors.

6.6.3

Inserting Actors

When a new actor is inserted into an already configured workflow with given input schema, there are two cases that can occur: 2

Note that we will deploy a set-oriented type system, which will contain a“singleton-type” τ 0 for each data value d. Calling a workflow well-formed for an input x is then equivalent to calling a workflow well-formed wrt. τ 0 = {x}.

6.6. Future Work: Workflow Resilience

150

1. the actor will be always starving for any possible configuration of it (actor is useless) 2. the actor can be configured to be possibly non-starving (actor is useful ) In the second case, the workflow system could suggest such a configuration. An inserted actor that is useful is called somewhat/completely destructive if some/all of the down-stream actors become definitely starving actors due to the insertion. A workflow tool could highlight these actors and point them out to the user. It is possible that either the inserted actor should be restricted in its read-scope, or that the down-stream actors became obsolete and should be removed.

6.6.4

Deleting Actors

Similar, to inserted actors it would be nice if the workflow system could highlight actors that are effected by the deletion. The scientist can then decide how to proceed further—for example removing them as well or by inserting some other actors that might close the gap.

6.6.5

Replacing Actors

When exchanging one, or multiple configured actors by a new actor, it would be nice if the workflow system could predict and present how the workflow’s well-formedness is effected. Several use case arise here: Given one or multiple consecutive actors o and one or multiple consecutive actors n, we can define the following properties: For both, o and n being configured, we define n to be a perfect replacement iff in all configured well-formed workflows and for all input-schemata τ for these workflows we can replace the o actor(s) by the n actors without loosing the wellformed property of the workflow. More generally, we define an unconfigured list of n actors to be general perfect replacement for an unconfigured list o of actors if for any configuration of the o actors, there is a configuration of n such that n is a perfect replacement for o. These notions can be used in workflow-actor repositories to search for alternative methods for accomplishing a specific task. This can then help the scientist to quickly be able to

6.6. Future Work: Workflow Resilience

151

explore different algorithms or approaches to accomplish a certain task. The notion of perfect replacement can also be considered with respect to a single configured workflow, or even more specific to a single workflow with a particular input schema. It is possible that although the replacement actors n are not a perfect replacement for some other actors o, they would still not change well-formedness of the specific workflow, or for the specific workflow with a specific input schema. If some replacement would destroy well-formedness, the workflow tool can help pointing out which actors are now starving, etc. This knowledge can be very helpful to spot which other actors also need some modifications of their configurations or might need to be replaced completely. In case a configuration-change is sufficient, the workflow tool could suggest one (or a smallest) configuration change.

152

Chapter 7

Static Analysis II: Towards Deciding Workflow Equivalence A problem worthy of attack proves its worth by fighting back. Piet Hein

The static analysis based on the FLUX type system as described in the previous chapter is not exact, but only approximates the behavior of FLUX and thus of the VDAL workflow. Therefore, it is possible that a starving actor is not detected by the FLUX type system. In this chapter, we present work towards an exact static analysis of VDAL workflows. To this end, we will present a novel technique of static analysis for XML-processing programs. Our technique is based on possible-value types, structures that, although they have similarities with regular expression types, capture the semantics of the query exactly. We conjecture that via the here developed static analysis, it will not only possible to detect unproductive code fragments, and thus starving actors, but we will also able to test the equivalence of two programs. Testing equivalence for query languages is a classical problem in databases. While recent research considered the problem of XQuery equivalence for unordered data

153 models [DeH09], query equivalence for restricted fragments of XQuery with an ordered data model has not been well-studied. Considering a for-let-return fragment of XQuery, we show how query equivalence (with input restrictions in form of DTDs) reduces to checking equivalence of pv-types. We then investigate the problem of checking equivalence for pv-types. It turns out that this is a hard problem due to the ordered data model we use. Here, we work towards a solution by defining a mathematical problem, string-polynomial-equivalence (SPE), that lies at the core of pv-type equivalence. We made considerable progress in solving SPE, however, have not fully finished its solution. Our partial results, however, can be used as approximations for query equivalence: We have sound tests for testing equivalence and we have sound tests for testing the opposite. Contributions. The contributions of this chapter are as follows: (1) We develop possible-value types (pv-types), structures suitable for static analysis of XML-processing programs. Pv-types have been inspired by regular expression types with annotations for tracking provenance [Che09]. A pv-type is a mapping that takes a valuation as input and produces a specific value as output. In a given valuation, a typing of a query thus maps an individual input value to the corresponding output value. In other words, a pv-typing, i.e., input and output type, exactly captures the query semantics1 . (2) Inspired by conventional types, we define requirements for a pv-typing to be correct, i.e., sound and complete. We then give propagation rules for a for-let-return fragment of XQuery and formally proof their correctness. Interesting cases occur due to the way how for-loops are handled. (3) We then show how the problem of deciding the equivalence of two XQuery expressions under input restrictions (either as DTDs or pv-types) can be reduced to deciding the equivalence of pv-types. 1

Note that this is in contrast to a set-semantic typing, which only relates sets of values with each other.

7.1. Relation to Conventional Regular Expression Types

154

(4) In Section 7.5, we further define the problem of String-polynomial-equivalence (SPE) and show that pv-type equivalence is at least as hard as SPE. Here, we present several normal-forms for string polynomials that allow a very precise (but not exact) approximation for deciding their equivalence. Solving SPE exactly is left to future work. (5) In Section 7.6, we show that for two pv-types τ1 and τ2 , it is undecidable if they map to different values under all valuations. (6) In Section 7.7, we show that for XQuery with a deep-comparison operator equivalence and related questions are undecidable.

7.1

Relation to Conventional Regular Expression Types

In the context of XML processing, regular expression types [HVP05] (close relatives of DTDs and XML Schema) are often used as a foundation for statically typed languages such as CDuce [BCF03]. Regular expression types employ a set-semantics approach similar to common types like integer or string in general-purpose programming languages. For example, the type int represents the set of all integers whereas the type a[ b[ ]∗ ] represents the set of all trees with an a-labeled root node and a possibly empty list of children nodes whose labels are b. Given a type expression τ , we use [[ τ ]] to denote the set of its values, e.g., [[ int ]] = Z. A query q with input type τ1 has an output type τ2 if, when applied to a database with a schema τ1 , i.e., a value v1 ∈ [[ τ1 ]], q will always produce a value v2 with schema τ2 , i.e., v2 ∈ [[ τ2 ]]. This approach allows to verify properties that are characterized using sets of values. The most prominent example application is semantic type checking. Here, it is verified that a query q with input DTD τ1 produces values consistent with a second DTD τo . The problem is solved by showing that the output type τ2 of q with input type τ1 is a sub-type of τo , or [[ τ2 ]] ⊆ [[ τo ]]. However, set-semantic type systems do not facilitate value-based comparisons of types.

7.2. General Notions

155

For a query q with input and output types τ1 and τ2 , respectively, we do not know which value v ∈ [[ τ1 ]] is mapped to which value v 0 ∈ [[ τ2 ]]. We can thus not answer questions such as “Is q(v) = v for all v ∈ [[ τ ]]?” Or, “Are two queries q1 and q2 equivalent on some input type τ , in symbols, is q1 (v) = q2 (v) for all v ∈ [[ τ ]]?” As a simple example consider the identity function over integers and the function “negate” which inverts the sign of an integer. Both functions have the same conventional typing, i.e., int → int, however, they are clearly not equivalent. In this paper we investigate how to extend set-semantic type systems to capture valuebased relationships, and apply our new type system to decide query equivalence for a for-let-where-return fragment of XQuery. To this end, we combine regular tree types with provenance-like labels and constraints [Che09] and adapt type propagation accordingly.

7.2

General Notions

In this section, we define our data model and review regular expression types. We will also introduce the notion of sound and complete propagations for conventional type systems.

7.2.1

Data Model

Our XML data model is the standard model of labeled, rooted, ordered trees. For presentation reasons we consider only one base type ‘str’ (string), other base types such as integer, Boolean, or other domain-specific types can easily be added to our model and analysis. The set of all possible values is defined by the following grammar: v ::= () | w | v, v 0 | a[v]

w ∈ str, a ∈ Σlabels

Thus, a value is either the empty sequence (), a data value w of some base-type (here: string), a sequence v, v 0 of values, or a tree a[v] with a label a from a label alphabet Σlabels as root and a value as child(ren). We use V to denote the set of all values. We define (deep

7.2. General Notions

156

value) equality as usual: “,” is associative, so x, (y, z) = (x, y), z, or short: x, y, z. Similarly, “()” is neutral in sequences: v, () = (), v = v. However, we do not identify two strings in sequence with their concatenation: f oo, bar 6= f oobar. Note, that this data model is purely value-based and thus does not model node identifiers. It also omits XML attributes.

7.2.2

Core XQuery Fragment XQ

Our core XQuery fragment XQ is a variant of the µXQ core language presented in [CGMS06]2 . The syntax of a query expression e is defined by the following grammar: e ::= () | e, e0 | a[e] | w |

x | x/chldr | x/dos | e :: a

|

for x in e ret e0

|

if e then e0 else e00

—for XQcond

|

deepEQ(e, e0 )

—for XQdeepEQ

(7.1)

Here, a ranges over the label alphabet Σlabels , w is an atomic base-type symbol, in our case a string, and x ranges over variable names. The semantics, or the result-value, of a query expression e is defined in a value-environment Σ that maps variable names to values. We use Σ`e⇒v

(7.2)

to denote that in environment Σ, the expression e evaluates to v. The “input” to a query is simulated by binding the input value to the variable $root. We will sometimes use the shorthand e(v) to denote v 0 with {$root 7→ v} ` e ⇒ v 0 . An example evaluation is the 2 We do not explicitly distinguish between tree and forest variables in our syntax. However, we still require chldr and dos to be called only with variables originating from for-bindings.

7.2. General Notions

157

following: {$root 7→ a[“x”, “y”]} `

for $x in $root/chldr ret b[$x,$x]

(7.3)

⇒ b[“x”, “x”], b[“y”, “y”] The operational semantics of our XQ-variants is defined in Fig. 7.1. Each of the evaluation rules is of the form

premises Σ`e⇒v

to denote that if the premises hold, then e evaluates to v under the environment Σ. We use the auxiliary function :: a to select only those trees in a list of values that are labeled with a. The function trees(v) returns a list of the top-level trees in a value, i.e., it flattens the hierarchical sequence operator “,” and removes top-level empty sequences. Example: trees( ((a[], “x”), ((), b[b[]])) ) = a[], “x”, b[b[]]. We use chldr and dos to select all children, or all descendants and the value itself. Note, that we require that chldr and dos is only used on variables that were introduced by a for-loop. This requirement does not restrict expressiveness as for-loops can easily be introduced; however it simplifies some of the case analyses later. Thus, we do not need to define chldr for a list of values, or the empty sequence.

7.2.3

Expressive Power of QX

Note that a let statement can easily be simulated via a for-statement: let x = e in e0 for x in a[e] ret e00

can be simulated by with e00 = e0 , where x replaced by x/childr

(7.4)

Also, chained XPath expressions (e.g., //b/c//d) can be simulated via for-loops, the singlestep XPath axis and the node filter operation given in XQ.

7.2. General Notions

158

Empty sequence

String constant

Tree constructor

Variable lookup

Σ ` () ⇒ ()

Σ`w⇒w

Σ ` a[e] ⇒ a[v]

Σ`x⇒v

Sequence constructor

Label filter

true

true

Σ`e⇒v

Σ ` e0 ⇒ v 0

Σ`e⇒v

Σ`e⇒v

Σ ` e, e0 ⇒ v, v 0 Empty for loop

Σ(x) = v

v 0 = v :: a

Σ ` e :: a ⇒ v 0 For loop with one element

true

t ∈ {a[v], w}

Σ ` for x in () ret e0 ⇒ ()

Σ, x 7→ t ` e0 ⇒ t0

Σ ` for x in t ret e0 ⇒ t0

For loop over sequence

For loop over expression

Σ ` for x in v1 , v2 ret e0 ⇒ v10 , v20

Σ ` for x in e ret e0 ⇒ v 0

Σ ` for x in v1 ret e0 ⇒ v10 Σ ` for x in v2 ret e0 ⇒ v20

Σ`e⇒v Σ ` for x in v ret e0 ⇒ v 0

If-true

If-false

Σ ` if e then e0 else e00 ⇒ v 0

Σ ` if e then e0 else e00 ⇒ v 00

Σ`e⇒v v 6= () Σ ` e0 ⇒ v 0

deepEQ-true

Σ`e⇒v Σ ` e0 ⇒ v 0 v = v0

Σ ` deepEQ(e, e0 ) ⇒ “true”

Σ`e⇒v v = () Σ ` e00 ⇒ v 00

deepEQ-false

Σ`e⇒v Σ ` e0 ⇒ v 0 v 6= v 0 Σ ` deepEQ(e, e0 ) ⇒ ()

With: w :: a = () chldr(w) = () () :: a = () chldr(a[v]) = v a[v] :: a = a[v] chldr(()) = ⊥ b[v] :: a = (), b 6= a chldr(v1 , v2 ) = ⊥ v1 , v2 :: a = v1 :: a, v2 :: a

dos(()) = () dos(w) = () dos(a[v]) = a[v], dos(v) dos(v1 , v2 ) = dos(v1 ), dos(v2 )

Figure 7.1: Semantics of XQ

7.2. General Notions

7.2.4

159

Regular Expression Types

Conventional regular expression types (conv-types), which are used in several transformation and query languages for XML [HVP05, CGMS06, FSW01, Che08], are structures of the following form: τ ::= () | str | τ, τ 0 | τ |τ 0 | a[τ ] | τ ∗

a ∈ Σlabels

A conv-type can either be the type of the empty sequence (); a base type such as string; a sequence of two already defined types; the alternative of two types; a tree type a[t] with a label a from the label alphabet; or a repetition type t∗ . Each conv-type t corresponds to a set of values according to the semantics [[ . ]]; thus [[ . ]] is a function from types to the powerset of all values: [[ . ]] : conv-types → 2V

(7.5)

It is recursively defined as follows: [[ () ]] = {()} [[ str ]] = {x | x is a str} [[ t, t0 ]] = {x, y | x ∈ [[ t ]], y ∈ [[ t0 ]]} [[ t|t0 ]] = [[ t ]] ∪ [[ t0 ]] [[ a[t] ]] = {a[x] | x ∈ [[ t ]]} [[ t∗ ]] = {a0 , a1 , . . . , an | n ∈ N, 0 ≤ i ≤ n, ai ∈ [[ t ]]} Our current work does not consider recursive regular expression types. Indeed, most DTDs and XML Schemas encountered in practice are non-recursive [Cho02, BNdB04].

7.2.5

Conventional Type Propagation

A type propagation through a query can be seen as an abstract interpretation of the query on the structure of types. The type propagation rules are very similar to the rules that

7.2. General Notions

160

define the operational semantics of XQ as shown in Fig. 7.1. Instead of a value-environment Σ, we have a type-environment Γ that maps variable names to types. We use Γ`e:τ

(7.6)

to denote that under the environment Γ the (output) type of the expression e is τ . An example propagation is: {$root 7→ a[str, str]} ` for $x in $root/chldr ret b[$x,$x] :

(7.7)

b[str, str], b[str, str]

The properties soundness and completeness relate type propagation and value evaluation to each other. In particular, a conv-type propagation {$root 7→ τ } ` e : τ 0 is sound if for all input value v ∈ [[ τ ]] with {$root 7→ v} ` e ⇒ v 0 , the expression’s output value v 0 is in the set denoted by τ 0 , i.e., v 0 ∈ [[ τ 0 ]]. For a sound propagation, we can guarantee the constraint that every output v 0 will conform to a desired type τ 00 by showing that [[ τ 00 ]] ⊇ [[ τ 0 ]]. Furthermore, a conv-type propagation {$root 7→ τ } ` e : τ 0 is complete, if we can find an input value to each (prognosed output) value in [[ τ 0 ]], or more precisely, if for every v 0 ∈ [[ τ 0 ]], there exists v ∈ [[ τ ]] such that {$root 7→ v} ` e ⇒ v 0 . Completeness characterizes the precision of a type system: Returning to the example above, if the output type τ 0 is not a sub-type of the desired output type τ 00 , i.e., [[ τ 00 ]] + [[ τ 0 ]], then completeness guarantees that there is an input v that would actually violate the constraint. For many languages, a non-trivial and useful type system cannot be both sound and complete. Here, one settles for type systems that are sound, but only reasonably complete. The typing given in (7.7), for example, is only sound but not complete for the query from (7.3): We cannot find any input value such that the expression in (7.3) evaluates to the value v 0 := b[“x”, “y”], b[“z”, “w”] although v 0 ∈ [[ b[str, str], b[str, str] ]] = [[ τ 0 ]].

7.3. General Idea of Possible-Value Types

161

Example:

In general: Γ `e : τ v v v

{$x 7→ strc } ` $x, $x : strc , strc v v = {c 7→ “z”} v

v(Γ) ` e ⇒ [[ τ ]]v

{$x 7→ “z”} ` $x, $x ⇒ “z”, “z”

Figure 7.2: Commutativity diagram. A sound and complete typing corresponds to a concrete evaluation if the valuation v is fixed

7.3

General Idea of Possible-Value Types

Like in the conventional typing approach, possible-value typing is a form of abstract interpretation—we evaluate a query expression e under a typing environment Γ to gain a result type. However, with pv-types we maintain the association between each individual input value v and output value v 0 . In fact, we symbolically simulate the evaluation of the query for all possible input values. To achieve this goal, we make the mapping between a type and the set of its values explicit via a valuation v. In particular, given a pv-type τ and a valuation v, the semantics of τ under v, in symbols [[ τ ]]v , is a specific value, i.e., a possibly empty list of trees. Via a valuation we can thus obtain a specific value from a type, and via the obvious lifting, we can obtain a value-environment Σ = v(Γ) from a typeenvironment Γ. Fig. 7.2 depicts how a sound and complete pv-typing Γ ` e : τ corresponds to a specific evaluation Σ ` e ⇒ v via a valuation v: Constructing the value-environment v(Γ) and evaluating e in this environment produces the same value as the semantics of the result type τ under the valuation v. A valuation is a function that, loosely speaking, stores information via which a concrete value can be constructed from a pv-type. If a valuation v provides enough information to fully construct the value, we say v defines τ . Formally, defines is a binary relation between valuations and pv-types. Furthermore, if a valuation v defines a pv-type τ , in symbols v  τ , then the semantics [[ τ ]]v of τ is a particular value v ∈ V, thus [[ . ]]v is a function: [[ . ]]v :: pv-types → V

(7.8)

7.4. Possible-Value Types

162

In the following text, we use vτ to highlight that we only consider valuations v that define τ . We will also omit the subscript if it is clear from the context. To form pv-types, we will add annotations to conventional regular expression types. Then, valuations will map these annotations to actual values: for the pv-type τ = stra , the valuation v = {a 7→ “x”} defines τ whereas v0 = {b 7→ “x”} does not. Thus [[ τ ]]v0 is not defined, whereas [[ τ ]]v = “x”. Extending valuations. If given a valuation v that does not define a type τ , then we can always extend (the function) v to a (function) v0 such that v0 defines τ . It is easy to see that [[ τ ]]v = [[ τ ]]v0 for all types τ that were already defined by v.

7.4

Possible-Value Types

In this section, we introduce the machinery deployed by possible-value types for XML. Using several XQ examples, we will first motivate the structure and details for pv-types, which we will later use to construct correct typings for our XQ variants. Possible-value types for XML are essentially regular expression types with structured annotations for some constructs. Formally, pv-types are of the following form: τ ::= () | ‘w’ | τ, τ 0 | a[τ ] | strcx | τ |ox τ 0 | τ sx with w ∈ str, a ∈ Σlabels , c ∈ Σc , o ∈ Σo , s ∈ Σsi , x ∈ A∗

(7.9)

We use different sets Σc , Σo , Σs for annotating constants, “or”-types , and “star”-types, respectively. The prefix x in annotations is a string over array indices A; we will define these prefixes below. Star annotations can be subscripted by natural numbers. In particular the Σsi used in (7.9) denotes a set of star annotations s ∈ Σ that have an additional integer subscript i added. In symbols: Σsi := Σs ∪ {si | s ∈ Σs , i ∈ N}. We will need this additional structure over the star-annotations later. Definition 7.1 (Annotation-suffixes). Suffixes x for annotations as in (7.9) are strings

7.4. Possible-Value Types

163

over the set A := Σsi ∪ N. We often write each letter in an annotation x inside square brackets to remind on an array notation. An example suffix x with a ∈ Σs is: “[1][5][a3 ][1]”. Example 7.2 (Possible-value types). With Σc = {c1 , c2 , . . . }, Σo = {o1 , o2 , . . . }, and Σs = {s1 , s2 , . . . }, the following are examples for pv-types: 1

• strc

• ((strc 1

1 [s1 ] 1

1 [s1 ] 2

, strc

1

1

)s2 )s1

2

• strc |o ‘f oo’ We now define the structure of valuations as they are necessary for the semantics of pv-types: Definition 7.3 (Valuation). A valuation is a partial function from annotations, that are only suffixed by natural numbers, to string-literals, the constants “Left” and “Right”, and integers. In symbols: v⊆

Σc N∗ → string ∪ Σo N∗ → {Left, Right}

(7.10)

∪ Σs N∗ → N Example 7.4 (Valuation). With Σc = {c1 , c2 , . . . }, Σo = {o1 , o2 , . . . }, and Σs = {s1 , s2 , . . . }, an example valuation is {c1 7→ “foo”, c2 [3][1] 7→ “bar”, o4 [42] 7→ Left, s5 [23][1] 7→ 7}. Definition 7.5 (Free annotation-indexes). Free and bound annotation indexes are defined analogously to free and bound variables, here star-types τ s bind all free occurrences of the suffixes s in τ . Formally, the set of free indexes F I(τ ) of a type τ are defined

7.4. Possible-Value Types

164

recursively as: F I(()) = F I(‘w’) = ∅ F I(τ1 , τ2 ) = F I(τ1 ) ∪ F I(τ2 ) F I(a[τ ]) = F I(τ ) F I(strcx ) = F I(x)

(7.11)

F I(τ1 |ox τ2 ) = F I(τ1 ) ∪ F I(x) ∪ F I(τ2 ) F I(τ sx ) = ( F I(x) ∪ F I(τ ) ) \ {s}

for s ∈ Σsi

F I(x) = annotations from Σsi occuring in x, e.g., F I([s11 ][2][1][s25 ][s13 ][s37 ]) = {s11 , s25 , s13 , s37 } As usual, we call indexes that occur in τ as “bound” if they are not free. 1 [s1 ] 1

Example 7.6. The pv-type ((strc

, strc

1 [s1 ] 2

1

1

)s2 )s1 does not have any free annotation indes. 1 [s1 ] 1

Both indexes s11 and s12 are bound. However, the pv-type (strc

, strc

1 [s1 ] 2

1

)s2 has the free

annotation-index s11 . Via a “defines” criterium that we define below, we can now formally define the semantics of pv-types. A pv-type represents a specific value under a valuation v as defined in Fig. 7.3. The pv-type () represents the empty sequence () independently from the valuation v; for any w ∈ str, the type ‘w’ represents the value w also independently from the valuation v. The sequence type τ, τ 0 represents the sequence of the values represented by its components; similarly, the type a[τ ] represents the tree a[v] if τ represents the value v. The semantics of the other type constructs depends on the valuation. As defined above, each valuation v is a function that maps annotations starting with c ∈ Σc to string values; annotations that start with o ∈ Σo are mapped to either “Left” or “Right”, and annotations starting with a s ∈ Σs are mapped to an integer value l ∈ N. Note that it is easy to see that all bound indexes of a type τ will be replaced by integers via the last semantics rules in Fig. 7.3. Before we provide examples that motivate the interplay between star-types and the suffixes x in annotations, we formally define when a valuation defines a type τ without free indexes:

7.4. Possible-Value Types

165

[[ () ]]v = () [[ ‘w’ ]]v = w for w ∈ str 0 [[ τ, τ ]]v = [[ τ ]]v , [[ τ 0 ]]v [[ a[τ ] ]]v = a[ [[ τ ]]v ] for a ∈ Σlabels [[ strcx ]]v = v(cx) [[ τ |ox τ 0 ]]v = if v(ox) = Left then [[ τ ]]v else [[ τ 0 ]]v [[ τ si x ]]v = [[ τ[si /1] ]]v , [[ τ[si /2] ]]v , . . . , [[ τ[si /v(sx)] ]]v for c ∈ Σc , o ∈ Σo , s ∈ Σs , si ∈ Σsi , x ∈ A∗ Figure 7.3: Semantics [[ . ]]v for XML pv-types without free indexes Definition 7.7 (The defines relation). A valuation v defines a type τ without any free indexes, iff v defines every lookup v(cx), v(ox), and v(sx) that occurs when computing the semantics of τ according to Fig. 7.3. Since the interplay between star-types and the suffixes x in annotations is a little intricate, we will now use examples for motivation and explanation. Non-prefixed annotations. The type strc represents the value v(c) in the valuation v; For an or-type, the valuation determines which branch of the or is to be taken. As example, consider the pv-type τ = stra |b strc ; under a valuation v = {a 7→ “foo”, b 7→ Left, c 7→ “bar”}, τ represents “foo”, in symbols [[ τ ]]v = “foo”. Star-types. A star-type τ s represents a list of values according to its inner type τ and the annotation s. Here, the valuation defines the length of the list, i.e., v(s) ∈ N. A star-type represents the empty sequence () if its annotation is mapped to zero by v: [[ τ s ]]s7→0 = () The inner type τ represents all elements of the sequence simultaneously. Intuitively, we distinguish two cases: (1) all elements in the sequence have equal values, and (2) the values of the individual elements differ. An example for the first case is: [[ (stra )s ]]a7→“x”,s7→3 = “x”, “x”, “x”

(7.12)

7.4. Possible-Value Types

166

A query could create such a type with a for-loop that has a body that does not refer to the looping variable, for example “for $i in $root ret $x”. In case the inner value varies over the sequence, we use the star-annotation (here s) as an array-suffix in the inner type: For each suffixed annotation, the valuation then defines a whole array of mappings: [[ (stra[s] )s ]]s7→3,a[1]7→“x”,a[2]7→“y”,a[3]7→“z” = “x”, “y”, “z”

(7.13)

Note, that using this type, we can construct a valuation to represent any string sequence. That is, the conv-type str∗ lifted to an pv-type is (stra[s] )s , in symbols: [ [[ (stra[s] )s ]]v = [[ str∗ ]]

(7.14)

v

Correlating the annotation s with the inner type allows to differentiate over which sequence the inner type is changing. Consider the two following XQuery expressions: e1 := for $x in $u ret ( for $y in $v ret $x )

(7.15)

e2 := for $x in $u ret ( for $y in $v ret $y )

(7.16)

Assume both expressions are evaluated in the environment Γ = {$u 7→ (stra[u] )u , $v 7→ (strb[v] )v }. We can now capture that when query (7.15) is executed, the value of $x does not change in the inner loop, while in the second query (7.16), the value in the inner loop changes, and the whole sequence is repeated in the outer loop: Γ ` e1 : ((stra[u] )v )u

Γ ` e2 : ((strb[v] )v )u

For the value-environment Σ = {$u 7→ “1”, “2”, $v 7→ “1”, “2”}, the expressions e1 and e2 evaluate to: Σ ` e1 ⇒ “1”, “1”, “2”, “2”

Σ ` e2 ⇒ “1”, “2”, “1”, “2”

7.4. Possible-Value Types

167

However, the above annotations fail for nested loops ranging over the same variable: e3 := for $x in $u ret ( for $y in $u ret $x,$y )

(7.17)

Here, it is important to keep track from which for-loop the data originates from. We thus always create a fresh symbol by using a fresh subscript on the original star-annotation inside for-loops. As an example, the XQ expression (7.17) in the environment Γ = {$u 7→ (stra[u] )u } has a typing of: Γ ` e3 : ((stra[u1 ] , stra[u2 ] )u2 )u1

(7.18)

As example, consider a valuation: v = {u 7→ 3, a[1] 7→ “x”, a[2] 7→ “y”, a[3] 7→ “z”}, then the result type in (7.18) represents the following value: [[ ((stra[u1 ] , stra[u2 ] )u2 )u1 ]]v = [[ (stra[1] , stra[u2 ] )u2 ]]v , [[ . . . ]]v , [[ (stra[3] , stra[u2 ] )u2 ]]v = [[ stra[1] , stra[1] ]]v , [[ stra[1] , stra[2] ]]v , . . . , [[ stra[3] , stra[3] ]]v = (“x”, “x”, “x”, “y”, “x”, “z”), (. . . ), (“z”, “x”, “z”, “y”, “z”, “z”) These considerations lead to the semantics as described in Fig. 7.3: The last two rules “break down” star-types to actual sequences. As illustrated above, the length of the sequence is defined by the (possibly suffixed) annotation sx. The individual elements of the sequence are formed by using the inner type and replacing every occurrence of si by j for the jth element in the sequence. Note, that (1) since we will create a fresh si for each new loop, only the correlated si will be replaced. Furthermore, (2) we drop the index i from si x when “looking up” the length of the star-type annotated with si x; also when star-variables are replaced by the actual index the information from which loops it originated from is dis-

7.4. Possible-Value Types

168

carded. Intuitively, this is because the elements of a sequence are the same independently from which loop they originated.

7.4.1

Propagation Rules for XQ

Via the notion of correctness, we relate a pv-typing with the corresponding evaluations of e as illustrated in Fig. 7.2: Definition 7.8 (Typing-Correctness). A typing Γ ` e : τ is called correct if for all v that define Γ, v also defines τ and v(Γ) ` e ⇒ [[ τ ]]v . Propagation rules for the basic XQ fragment (from Fig. 7.1) are shown in Fig. 7.4. To show type-correctness of this rule set we need to generalize the typing correctness definition above to also cover pv-types that contain free indexes. Furthermore, since we added a for − in − τ − ret statement to XQ operating on types, we also need to consider the case where expressions have types embedded in them. Here, we lift the valuation v to expressions in the obvious way, i.e., v(e) replaces each type τ ocuring in e by the value v(τ ) while leaving everything else as is. Note, that v(e) is only different than e in for loops containing types; further for these expressions, v(e) is a valid XQ expression as the value v(τ ) can be inserted using the obvious XQ value-construction operators. Definition 7.9 (Extended Typing-Correctness). Given a pv-typing T = Γ ` e : τ , and an arbitrary replacement S : Σsi → N of the free indexes in Γ, e and τ by integers. Let ΓS , eS , and τ S denote the structures on which S has been applied, i.e., variables have been replaced by integers. The typing T is correct if for all valuations v that define ΓS ,eS , and τ S the following holds: vS(Γ) ` vS(e) ⇒ vS(τ ) while we write vS(.) to denote v((.)S ). Example 7.10. {$x 7→ strc[s1 ] } ` $x, $x : strc[s1 ] , strc[s1 ] is correct. Proof : Consider a replacement S = {s1 7→ i} with arbitrary integer i. We then need to show that for any defining v, i.e., let v(c[i]) = w for a string w, the following holds: v({$x 7→ strc[i] }) ` v($x, $x) ⇒ v(strc[i] , strc[i] ). This is easy, since v({$x 7→ strc[i] }) = {$x 7→ w}, further

7.4. Possible-Value Types

169

v($x, $x) = ($x, $x), and v(strc[i] , strc[i] ) = w, w. Left to show is {$x 7→ w} ` $x, $x ⇒ w, w which directly follows from the value-semantics of XQ (Fig. 7.1). Theorem 7.11. The type propagation rules in Fig. 7.4 are correct. Although the proof is lengthy and technical, it shows how the indexing and annotation scheme captures the semantics of the query. The proof ends on page 175. Proof Correctness for Γ ` () : () and Γ ` w : ‘w’. Trivial. Correctness for Γ ` x : τ if Γ(x) = τ .

We need to show that for any substitution S

of free indexes and any defining valuation v, the following holds: vS(Γ) ` vS(x) ⇒ vS(τ ). Since Γ(x) = τ , also ΓS (x) = τ S , and thus vS(Γ(x)) = vS(Γ)(x) = vS(τ ), which leads to the desired result via the semantic rule for variables in XQ. Correctness for Γ ` a[e] : a[τ ] if Γ ` e : τ .

Since e is a smaller XQ expression, we

can use induction hypothesis on the premise, i.e., for any substitution S of free indexes, we know that vS(Γ) ` vS(e) ⇒ vS(τ ) holds for all defining v. We now need to show vS(Γ) ` vS(a[e]) ⇒ vS(a[τ ]) for an arbitrary substitution S and defining v. Since vS(a[e]) = a[vS(e)], the goal follows from the value-semantics for a[e] of XQ. Correctness for Γ ` e, e0 : τ, τ 0 if Γ ` e : τ and Γ ` e0 : τ 0 . From induction hypothesis, we know that vS(Γ) ` vS(e) ⇒ vS(τ ) and vS(Γ) ` vS(e0 ) ⇒ vS(τ 0 ) for all substitutions S and all defining v. Now, vS(Γ) ` vS(e, e0 ) ⇒ vS(τ, τ 0 ) for all substitutions S and defining v, follows from vS(e, e0 ) = vS(e), vS(e0 ) as well as vS(τ ), vS(τ 0 ) = vS(τ, τ 0 ) and the value-semantics rule for e, e0 of XQ. To show correctness of the rules with axis and label-filters, we need to relate the operations chldr, dos, and :: a over types as defined in Fig. 7.4 to their counter-parts over values as defined in Fig. 7.1. To distinguish these “overloaded” operations, we will use τ and v as subscripts to denote the type and value version, respectively. Concretely, for each operation op ∈ {chldr, dos, :: a}, we need to show their correctness, i.e., vS(opτ (τ )) = opv (vS(τ )) for

7.4. Possible-Value Types

170

true

true

Γ(x) = τ

Γ ` () : ()

Γ ` w : ‘w’

Γ`x:τ

Γ(x) = τ τ = chldr(τ ) 0

Γ`e:τ

Γ ` a[e] : a[τ ] Γ ` x/chldr : τ 0 Γ`e:τ

Γ`e:τ

Γ ` e0 : τ 0

Γ ` e, e0 : τ, τ 0 Γ(x) = τ τ 0 = dos(τ )

Γ(x) = τ τ 0 = τ :: a

Γ ` x/dos : τ 0

Γ ` e :: a : τ 0

Γ ` for x in τ ret e0 : τ 0

Γ ` for x in e ret e0 : τ 0 true

T ∈ {a[τ ], ‘w’, strcx } Γ, x 7→ T ` e0 : τ 0

Γ ` for x in () ret e0 : ()

Γ ` for x in T ret e0 : τ 0

Γ ` for x in τ1 ret e0 : τ10

Γ ` for x in τ2 ret e0 : τ20

Γ ` for x in τ1 ret e0 : τ10

Γ ` for x in τ2 ret e0 : τ20

Γ ` for x in τ1 , τ2 ret e0 : τ10 , τ20

Γ ` for x in τ1 |ox τ2 ret e0 : τ10 |ox τ20

j ∈ N, fresh

Γ ` for x in τ[si /sj ] ret e0 : τ1 s x

Γ ` for x in τ si x ret e0 : τ1 j

With: chldr(a[τ ]) = τ chldr(‘w’) = chldr(strcx ) = () chldr(τ1 |ox τ2 ) = chldr(τ1 ) |ox chldr(τ2 ) dos(()) = () dos(‘w’) = ‘w’ dos(strcx ) = strcx dos(a[τ ]) = a[τ ], dos(τ ) dos(τ1 , τ2 ) = dos(τ1 ), dos(τ2 ) dos(τ sx ) = (dos(τ ))sx dos(τ1 |ox τ2 ) = dos(τ1 ) |ox dos(τ2 ) () :: a = ‘w’ :: a = strcx :: a = b[τ ] :: a = (), b 6= a a[τ ] :: a = a[τ ] (τ1 , τ2 ) :: a = (τ1 :: a), (τ2 :: a) (τ sx ) :: a = (τ :: a)sx (τ1 |ox τ2 ) :: a = (τ1 :: a) |ox (τ2 :: a) Figure 7.4: Propagation Rules for XQ (constraint-free pv-types)

7.4. Possible-Value Types

171

all pv-types τ , all substitutions S and all valuations v defining τ S . Correctness for chldr.

We will show vS(chldrτ (τ )) = chldrv (vS(τ )) for all pv-types τ ,

substitutions S and defining valuations v via induction over the structure of τ . Further, since /chldr is only applied to tree values and tree-types, we only need to verify for the type constructors a[τ ], ‘w’, strcx , and τ1 |ox τ2 . For τ ∈ {strcx , ‘w’}, chldr is clearly correct as both sides evaluate to (). Next consider the case a[τ ]. Here, vS(chldrτ (a[τ ])) = vS(τ ) according to the definition of childrτ ; also vS(a[τ ]) = a[vS(τ )] according to the definition of evaluations (Fig. 7.3); thus chldrv (vS(a[τ ])) = chldrv (a[vS(τ )]) = vS(τ ), which concludes the case for a[τ ]. For the last case, we need to show vS(chldrτ (τ1 |ox τ2 )) = chldrv (vS(τ1 |ox τ2 )). From the definition of childrτ : vS(childrτ (τ1 |ox τ2 )) = vS(childrτ (τ1 )|ox childrτ (τ2 )) =: X Now, let’s do a case distinction: (1) Case v(S(ox)) = Left. Then, X = vS(childrτ (τ1 )) according to the evaluation semantics of v. With induction hypothesis (since τ1 is strictly smaller than the or-type), we gain X = chldrv (vS(τ1 )). Starting from the right side of the claim and applying the knowledge that v(S(ox)) = Left gives us chldrv (vS(τ1 |ox τ2 )) = chldrv (vS(τ1 )) = X, which concludes case (1). Case (2) is analog. Correctness for dos. To show vS(dosτ (τ )) = dosv (vS(τ )) for all pv-types τ , substitutions S and defining valuations v via induction over the structure of τ . Case (): trivial. Case ‘w’: vS(dosτ (‘w’)) = vS(‘w’) = w = dosv (w) = dosv (vS(‘w’)). Case strcx : vS(dosτ (strcx )) = vS(strcx ) = v(S(cx)) = dosv (v(S(cx))) = dosv (vS(strcx )). Case a[τ ]: vS(dosτ (a[τ ])) = vS(a[τ ], dosτ (τ )) = vS(a[τ ]), vS(dosτ (τ )) = vS(a[τ ]), dosv (vS(τ )) = a[vS(τ )], dosv (vS(τ )) = dosv (a[vS(τ )]) = dosv (vS(a[τ ])). Case τ1 , τ2 : follows from the definitions with applying induction hypothesis on vS(dosτ (τi )).

7.4. Possible-Value Types

172

Case τ sx :

vS(dosτ (τ sx )) = vS((dosτ (τ ))sx ) = vS(dosτ (τ )[s7→1] ), . . . , vS(dosτ (τ )[s7→vS(sx)] ) = vS1(dosτ (τ )), . . . , vSn(dosτ (τ )) = dosv (vS1(τ )), . . . , dosv (vSn(τ )) = dosv (vS(τ[s7→1] )), . . . , dosv (vS(τ[s7→vS(sx)] )) = dosv ( vS(τ[s7→1] ), . . . , vS(τ[s7→vS(sx)] ) ) = dosv (vS(τ sx )). Case τ1 |ox τ2 : analog to or-case of chldr.

Correctness for :: a.

To show vS(τ ::τ a) = (vS(τ ))::v a for all pv-types τ , substitutions

S and defining valuations v via induction over the structure of τ . Case for b 6= a: τ ∈ {(), strcx , ‘w’, b[τ 0 ]}: vS(τ ::τ a) = (vS(τ ))::v a = () Case τ = a[τ 0 ]: vS(τ ::τ a) = (vS(τ ))::v a = a[vS(τ 0 )]. Case τ1 , τ2 , and τ1 |ox τ2 : analog to dos. Case τ sx :

vS(τ sx ::τ a) = vS((τ ::τ a)sx ) = vS(τ[s7→1] ::τ a), . . . , vS(τ[s7→vS(sx)] ::τ a) = vS1(τ ::τ a), . . . , vSn(τ ::τ a) = (vS1(τ ))::v a, . . . , (vSn(τ ))::v a = (vS1(τ ), . . . , vSn(τ ))::v a = (vS(τ sx ))::v a Correctness for Γ ` opτ (x) : τ 0 if Γ(x) = τ and τ 0 = opτ (τ ) with opτ ∈ {:: a, chldr, dos}. We need to show vS(Γ) ` vS(opτ (x)) ⇒ vS(τ 0 ) for every substitution S and valuation v. Since Γ(x) = τ , we can apply substitution S and valuation v on both sides and gain vS(Γ(x)) = vS(τ ); since valuation with substitution and Γ commute, (i) vS(Γ)(x) = vS(τ ). Further, τ 0 = opτ (τ ) implies vS(τ 0 ) = vS(opτ (τ )); and with correctness of op, we have (ii)

7.4. Possible-Value Types

173

vS(τ 0 ) = opv (vS(τ )). With setting Σ := vS(Γ), v := vS(τ ), and v 0 := vS(τ 0 ) the statements (i) and (ii) are the premises of the value-semantics rules for op in Fig. 7.1, which directly conclude the goal with the observation that opv (x) = vS(opτ (x)) since op(x) has no type embedded. Correctness of for loops.

We will start with showing correctness for the for-in-τ -

return variants. We do a secondary induction on the structure of τ according to the different propagation rules. Case (): vS(Γ) ` vS(for x in () ret e0 ) ⇒ vS(()) follows from the value-semantics of for loops. Case T ∈ {a[τ ], ‘w’, strcx }: We need to show that

vS(Γ) ` vS(for x in T ret e0 ) ⇒ vS(τ 0 ) if

Γ, x 7→ T ` e0 : τ 0 .

From induction hypothesis on the smaller expression e0 , we know that vS(Γ, x 7→ T ) ` vS(e0 ) ⇒ vS(τ 0 ). Since vS(T ) ∈ {a[v], w} for a string w, t := vS(T ) according to the treerule in the value-semantics of the for expression, which leads together with t0 := vS(τ 0 ), vS(Γ, x 7→ T ) = vS(Γ), x 7→ vS(T ), and Σ := vS(Γ) to the desired result. Case τ1 , τ2 : We need to show

vS(Γ) ` vS(for x in τ1 , τ2 ret e0 ) ⇒ vS(τ10 , τ20 ), i.e. vS(Γ) ` for x in vS(τ1 , τ2 ) ret vS(e0 ) ⇒ vS(τ10 , τ20 ) if

Γ ` for x in τi ret e0 : τi0 for i = 1, 2.

Applying induction hypothesis implies for i = 1, 2: vS(Γ) ` vS(for x in τi ret e0 ) ⇒ vS(τi0 ) , or vS(Γ) ` for x in vS(τi ) ret vS(e0 ) ⇒ vS(τi0 ) With Σ := vS(Γ), vi := vS(τi ), vi0 := vS(τi0 ), and the fact that vS distributes over sequences, the claim directly follows from the value-evaluation rule of the for-expression for sequences.

7.4. Possible-Value Types

174

Case τ1 |ox τ2 : Follows from a case distinction on the value of v(S(ox)) and applying the

induction hypothesis on the premises of the type-propagation rule. Case τ si x : We need to show s x

vS(Γ) ` vS(for x in τ si x ret e0 ) ⇒ vS(τ1 j ), i.e., s x

vS(Γ) ` for x in vS(τ si x ) ret vS(e0 ) ⇒ vS(τ1 j ) when the premises of the propagation rule hold. Consider: (1) vS(τ si x ) = (2) vS(τ[si /1] ), . . . , vS(τ[si /vS(si x)] ) = (3) vS(τ[si /sj ][sj /1] ), . . . , vS(τ[si /sj ][sj /vS(si x)] ) = (4) vS,{sj 7→1} (τ[si /sj ] ), . . . , vS,{sj 7→vS(si x)} (τ[si /sj ] ) = (5) vS1(τ[si /sj ] ), . . . , vSn(τ[si /sj ] ) Remarks. (1→2) via definition of vS over star-types. (2→3): introduce a new fresh integer j and first replace all free indexes si by sj and then sj by the actual integer. (3→4): move substitution to S, this is possible since every occurrence of si was a free index in τ and thus also sj 3 . (4→5) is a simple renaming to new substitutions. Applying induction hypothesis to the premise of the propagation rule leads to vS(Γ) ` vS(for x in τ[si /sj ] ret e0 ) ⇒ vS(τ1 ), or vS(Γ) ` for x in vS(τ[si /sj ] ) ret vS(e0 ) ⇒ vS(τ1 ) for all substitutions S and defining valuations v. Now, for p = 1, . . . , n, the substitution Sp is essentially the same as S, in particular it only replaces free indexes in Γ and e0 , since both 3 All occurrences have to be free, since we just removed the binding superscript in the star-type; and we do not have si in another deeper-nested superscripts as this cannot be achieved by our propagation rules (can be proven as induction over the rules)

7.4. Possible-Value Types

175

Γ e0 do not contain any occurrence of the fresh sj . We therefore have for all p = 1, . . . , n: vSp(Γ) ` for x in vSp(τ[si /sj ] ) ret vSp(e0 ) ⇒ vSp(τ1 ) Using these as the premises to the sequence rule for the value-evaluation of for on the sequence v := vS1(τ[si /sj ] ), . . . , vSn(τ[si /sj ] ) and exploiting that vS(Γ) = vSp (Γ) and vS(e0 ) = vSp (e0 ) for p = 1, . . . , n since Sp differs from S only on the fresh indexes sj , which are not present in Γ or e0 , yields: vS(Γ) ` for x in v ret vS(e0 ) ⇒ vS1(τ1 ), . . . , vSn(τ1 ) But now, (1) vS1(τ1 ), . . . , vSn(τ1 ) = (2) vS,{sj 7→1} (τ1 ), . . . , vS,{sj 7→vS(si x)} (τ1 ) = (3) vS,{sj 7→1} (τ1 ), . . . , vS,{sj 7→vS(sj x)} (τ1 ) = (4) vS(τ1[sj /1] ), . . . , vS(τ1[sj /vS(sj x)] ) = s x

(5) vS(τ1 j ) Remarks: (1→2) is simply undoing the shortcut of Sp . (2→3) changes the length of the list from vS(si x) to vS(sj x), which are identical since the valuation discards subscripts in starcolors. (3→4) moves the renaming from free-index renaming to global formula-renaming, which is fine, since sj never occurs as a bound index. (4→5) simply applies the valuation s x

of star-types. This yields vS(Γ) ` for x in v ret vS(e0 ) ⇒ vS(τ1 j ). Correctness for for-in-τ -return. We need to show vS(Γ) ` for x in vS(e) ret vS(e0 ) ⇒ vS(τ 0 ). Applying induction hypothesis on the premises yields vS(Γ) ` vS(e) ⇒ vS(τ ) and vS(Γ) ` for x in vS(τ ) ret vS(e0 ) ⇒ vS(τ 0 ). With Σ := vS(Γ), v := vS(τ ), and v 0 := vS(τ 0 ) the value-semantics rule of the for loop yields the claim.

7.4. Possible-Value Types

7.4.2

176

From PV-Typings to Query Equivalence

Via the notion of correctness, we relate a pv-typing with the corresponding evaluations of e as illustrated in Fig. 7.2: This notion is sufficient to solve query equivalence for XQ. To use pv-typings for deciding query equivalence with input restrictions, we need two more building blocks: (1) Constructing a pv-type τ from a DTD D, and (2) comparing pv-types: Lift / Strip: From Conventional Types to pv-Types and Back.

We now relate

conventional, non-recursive, regular expression types with possible-value types. Remember that the pv-type semantics is more fine-grained as it makes the relation to its values explicit via the valuations: [[ . ]]v :: pv-Types → V

versus

[[ . ]] :: conv-types → 2V

However, if we union over a set of valuations, we also gain a set of values. Lemma 7.12 (Lift). We can construct a pv-type lift(τ ) from a conv-type T such that [

[[ lift(T ) ]]v = [[ T ]]

v

by using fresh annotations for strings, or-types and star-types; and by suffixing all annotations inside star-types with the star-index. The algorithm for the function “lift” is given in Fig. 7.5. Proof

Correctness of the lift algorithm in Fig. 7.5 is easy to show via a construction

of the valuation v during a structural recursion over τ for any arbitrary v ∈ [[ T ]]. Since all annotations are distinct, each of their value in v can be defined uniquely whenever it occurs. Inside-types in star-types as fully suffixed, so for each element in the sequence the values can be chosen accordingly to the value v. Strip. By simply dropping all anotations in an XML pv-type τ and replacing the superscript star-colors by ∗, we obtain a conventional type T ; we call this operation “strip”. It

7.4. Possible-Value Types

177

Input: non-recursive regular expression S type T Output: pv-type lift(T ) with [[ T ]] = v [[ lift(T ) ]]v Algorithm: structural recursion collecting star-annotations lift(T ) := Lε (T ) — start with no prefix ε Lx ( () ) := () Lx (T1 ,T2 ) := Lx (T1 ), Lx (T2 ) Lx (a[T ]) := a[Lx (T )] Lx (str) := strcx — suffix a fresh c from Σc ox Lx (T1 |T2 ) := Lx (T1 ) | Lx (T2 ) — suffix fresh o ∈ Σo Lx (T ∗ ) := (Lx[s] (T ))xs — fresh s ∈ Σs ; use x as suffix for s; add [s] to curr. suffixes Figure 7.5: Lift Algorithm for XML pv-types is evident that all representations of τ are also of type T ; however, since we can express equality constraints in pv-types but not in conv-types, the sets of denoted values are not equal in general. For a pv-type τ the following holds: [

[[ τ ]]v ⊆ [[ strip(τ ) ]]

v

An example in which the inclusion is strict, is the type τ1 = stra , stra . Here, “foo”,“bar” is in S [[ strip(τ1 ) ]] but not in v [[ τ1 ]]v , which contains only sequences with two identical strings. Note, that strip is an inverse to lift: strip(lift(T )) = T for all conv-types T . Lift-Operation. Let “lift” be an operation to obtain a pv-type from a regular expression type T such that the pv-type τ = lift(T ) satisfies: [

[[ τ ]]v = [[ T ]]

(7.19)

vτ

In the next section, we will show that lift exists and we will give an algorithm to compute lift. Further, we define a relation ≡ for two pv-types: Definition 7.13 (Value-equality). Two pv-types τ1 and τ2 are value-equal (in symbols τ1 ≡ τ2 ) iff [[ τ1 ]]v = [[ τ2 ]]v for all v that define τ1 and τ2 .

7.4. Possible-Value Types

178

Input: Query expressions e1 and e2 , non-rec. DTD D Output: “Equivalent” if for all x ∈ D, e1 (x) = e2 (x), “Different” else. Algorithm: Construct conv-type T from DTD D Construct pv-type τ as τ := lift(T ) Propagate τ through e1 , i.e., obtain τ1 from {$root 7→ τ } ` e1 : τ1 Propagate τ through e2 , i.e., obtain τ2 from {$root 7→ τ } ` e2 : τ2 If τ1 ≡ τ2 return “Equivalent”, ... note, that the crux is to decide τ1 ≡ τ2 Else return “Different”. Figure 7.6: Deciding query equivalence with input restrictions for XQ via possible-value typings. This relation is an equivalence relation, but this is not essential for deciding query equivalence. Theorem 7.14 (Query Equivalence Reduction). With the propagation rules for XQ and the lift operation we can now reduce deciding query equivalence to deciding the equivalence of pv-types via the algorithm given in Fig. 7.6. Proof

(1) Algorithm states “Different”. Then there exists a v for which [[ τ1 ]]v 6= [[ τ2 ]]v .

Extend v to v0 such that v0 defines the input type τ . Let v = [[ τ ]]v0 , and because the propagations are correct, e1 (v) = [[ τ1 ]]v0 and e2 (v) = [[ τ2 ]]v0 . Also, [[ τi ]]v0 = [[ τi ]]v because v0 ⊇ v. Further, since τ = lift(T ) and T corresponds to the DTD D, v ∈ D. We thus have found a v ∈ D, with e1 (v) = [[ τ1 ]]v0 = [[ τ1 ]]v 6= [[ τ2 ]]v = [[ τ2 ]]v0 = e2 (v). (2) Algorithm states “Equivalent”; need to show e1 (v) = e2 (v) for all v ∈ D, i.e., with {$root 7→ v} ` e1 ⇒ v1 and {$root 7→ v} ` e2 ⇒ v2 , we need to show v1 = v2 . Consider arbitrary v ∈ D, then v ∈ [[ T ]] for the conventional regular expression type T and thus S v ∈ v [[ τ ]]v due to (7.19). Choose a v for which [[ τ ]]v = v. From correctness of the two propagations, v also defines τ1 and τ2 . Further, since τ1 ≡ τ2 , we have [[ τ1 ]]v = [[ τ2 ]]v .

7.5. Equality of String Polynomials

7.5

179

Equality of String Polynomials

Deciding pv-type equivalence restricted to a very small subset of the constructors for pvtypes already poses a non-trivial decision problem. When only considering the empty-type (), constant-types ‘w0 , concatenation τ1 , τ2 and star-types without suffixes τ s , then pv-type equivalence is equivalent to the problem of string-polynomial equivalence as defined in the following. Definition 7.15 (String Polynomials). Let Σ be alphabet and let · be the standard string concatination operator of type Σ∗ → Σ∗ , and define a scalar multiplication ↑ of type Σ∗ × N → Σ∗ such that xn := x {z· . . . · x} | ·x·x n many

We also assume a set of variable identifiers V. A string-polynomial with variable multipliers (string polynomial for short) is a well-typed expression over {Σ, V, ·, ↑}, in which all exponents are replaced by variables. Formally, a string polynomial is from the following grammar: E ::= a | E · E 0 | E x

with a ∈ Σ, x ∈ V

Example 7.16. For Σ = {a, b, c} and V = {x, y, z}, example string polynomials are ab, a(ba)x , or (a((by ca)x )y aa)z . Definition 7.17 (Evaluation). Given a valuation function Γ : V → N, the value of a string polynomial at position Γ, in symbols Γ(E), is defined in the obvious way, that is by replacement of all variables by integers as given in Γ and then evaluating the resulting string-expression via the operators · and ↑. Note that Γ can be interpreted as a type-environment that has a domain restricted to Σs = V. Problem 7.18 (String Polynomial Equivalence (SPE)). The problem of stringpolynomial-equivalence (short ≡S[V] or just ≡) is to decide whether two given string poly-

7.5. Equality of String Polynomials

180

nomials E1 and E2 agree on all their positions, i.e., Γ(E1 ) = Γ(E2 ) for all valuations Γ. Remark 7.19. Note that regular multivariate polynomials with natural-number coefficients N[V] can be embedded in string-polynomials by using only one letter a from the alphabet Σ: each natural number i can be encoded unary using concatenations of the single letter a; then, “plus” is encoded as concatenation and “times a variable x” is encoded as “↑ x”. Thus SPE is at least as hard as multivariate polynomial equivalence. Furthermore, it is undecidable whether for two string-polynomials E1 and E2 there exists a valuation Γ that makes their evaluations match. This is an immediate consequence from undecidability of Diophantine equations. On the other hand, two polynomials A, B ∈ N[V] are the same on all valuations iff A and B are identical, hence SPE is decidable for |Σ| = 1. Theorem 7.20. SPE can be reduced to pv-type equivalence. Proof

Given two string-polynomials E1 , E2 over the alphabet Σ with variables over V.

To construct two pv-types τ1 and τ2 , let the set of strings str be Σ, and let the set of star-colors Σs be V. Then, apply the following transformation steps to E1 and E2 : 1. replace each occurrence of a letter a ∈ Σ by the type ‘a’ representing the string a 2. replace the concatenation operator “·” by the sequence operator “,” 3. interpret the power operator ↑ as star-type τ s ; since all exponents are variables they result in correct star-types without annotation prefixes Via the above definitions, and in particular with the associativity of the sequence operator and identity of (), it is easy to see that E1 ≡ E2 iff τ1 ≡ τ2 .

7.5.1

Restriction to a two-letter alphabet

Without loss of generality, we can assume that Σ = {0, 1}. In case Σ has more than two letters, we can encode each letter as a binary string of fixed length. Clearly, two polynomials over the large alphabet are equivalent if and only if there respective encodings are equivalent over {0, 1}.

7.5. Equality of String Polynomials

7.5.2

181

Reduction to Restricted Equivalence

First we show that SPE is decidable iff decidable on valuations Γ for which all variables have a value greater than some constant. Our decision procedure will further need to restrict some variables xi to be greater than some constant ci . For ease of notation we extend the comparison relations to Γ in the obvious way: Γ ≥ c iff Γ(x) ≥ c for all x ∈ V. It is easy to see that once we have a decision procedure P for deciding Γ(E1 ) = Γ(E2 ) for all Γ with Γ(xi ) ≥ ci for xi ∈ V and |V|-many constants ci ∈ N, we can also decide SPE, i.e., Γ(E1 ) = Γ(E2 ) for all Γ without lower bounds on the variables. The strategy is to first verify for the restricted variable domain, and then add the missed variable assignments (at the boundaries of the N|V| space for Γ) by fixing the value of a subset of variables to a value below their respective boundaries. Once all subsets and all assignments below the boundaries (which there are only finitely many for each) have been tested for equivalence, the complete space N|V| has been verified. In the following text, we will assume by default that all variables are non-zero. To make this assumption easier to spot, we will write ≡≥1 for equivalences that hold for all Γ with Γ(x) ≥ 1.

7.5.3

Simple Normal Form

In the following, we will transform a string polynomial E into an extended string-polynomial E 0 of a certain form, called simple-NF. An extended string-polynomial is a string-polynomial in which exponents cannot only be variables in V, but (regular) polynomials over N[V]; we call such a ap with a ∈ Σ, p ∈ N[V] a monomial. Identifying a with a1 for a ∈ Σ, the following rules transform E into E 0 in simple-NF: ap · aq → ap+q

a ∈ Σ, p, q ∈ N[V]

(ap )q

a ∈ Σ, p, q ∈ N[V]

→ apq

7.5. Equality of String Polynomials

182

It is quite obvious that this re-write system is terminating (terms get smaller in number of letters in Σ) and confluent (considering equivalent polynomials in the exponent equivalent). Also, in an expression in simple-NF, two monomials ap and bq can only be directly concatenated with each other if a 6= b. Furthermore, for any scalar multiplication that needs a parenthesis, both 0-monomials and 1-monomials occur within the parenthesis. Example 7.21. Expression 0(00(00)x )y 01111(11)y 00 is rewritten to 02+2y+2xy 14+2y 02 ; the expression 00(1(11)x 00)y 0 is rewritten to 02 (11+2x 02 )y 01 . The second simple-NF contains a parenthesis with both kinds of monomials inside.

7.5.4

Alternating Normal Form

An interesting property of strings over {0, 1} is how many groups of zeros and ones the string has: Definition 7.22 (Alternations). The number of alternations alts(s) of the string s ∈ {0, 1}, and a, b ∈ {0, 1}, a 6= b, and Ai ∈ a∗ , Bi ∈ b∗ is defined as follows: alts(s) := n for s = A1 B2 A3 B4 . . .n Example 7.23. alts(00110) = 3, alts(0001110010001) = 6. For two strings to be equivalent, they clearly need to have the same number of alternations. Furthermore, two strings are equivalent iff they have the same number of alternations, start with the same letter (either both start with 0 or both with 1), and there are the same number of letters in the corresponding groups of repeated letters. Computing alts from string polynomials.

We now design a second “normal-form”

that makes it easy to compute alts(Γ(E)) as a function of Γ from a given polynomial E. Clearly, the number of alternations is not affected if we drop all polynomials p inside monomials ap , since these only affect the length of each zero-or-one block; we call these

7.5. Equality of String Polynomials

183

polynomials the small polynomials of the expression—later, we will have big polynomials around parentheses as well. Intuition: Note that for some expressions in simple-NF with dropped small polynomials, the number of alternations always coincides with their length: | Γ(0(1010)y 1) | = alts( Γ(0(1010)y 1) ) for all Γ, but for others, such as E := 0(010)y 0 they do not: |Γ(E)| > alts(Γ(E)) for many Γ. The reason for the mismatch in the second case is that when the multiplication by Γ(y) is performed, equal letters end up next to each other (in our case the letter 0) resulting in a shorter number of alternations than letters in the string. Definition 7.24 (Alternating-NF). An expression E in simple-NF is in alternatingnormal-form (alt-NF) iff for E 0 , which results from E by dropping all small polynomials, and for all Γ > 0, the length of Γ(E 0 ) equals the number of alternations in Γ(E 0 ). This is a semantic property of expressions, but we will now show that it is easy to test syntactically whether an expression is in alt-NF. Then, we will show how to reduce testing equivalence of two polynomials in simple-NF to a set of equivalence tests for several pairs of polynomials in alt-NF. The syntactic test uses the following notions of first and last letters. Definition 7.25 (First and Last Letters). For a string s ∈ Σ∗ let f (s) and l(s) denote the first and last letter of s, respectively. Note, that if Γ > 0 then f (Γ(E)) and l(Γ(E)) are independent from Γ; so let us lift f and l to expressions, keeping in mind that we restrict valuations to not contain zeroes. It is easy to see that the following recursion computes f (E) and l(E) for a ∈ Σ, x ∈ V,

7.5. Equality of String Polynomials

184

and p ∈ N[V]: f (ap )

:= a

l(ap )

:= a

f (E · E 0 ) := f (E)

l(E · E 0 ) := l(E 0 )

f (E x )

l(E x )

:= f (E)

:= l(E)

Theorem 7.26 (Checking E for alt-NF). An expression E in simple-NF is in alt-NF, if E is collision-free, with collision-free (cfree) being recursively defined as follows: cfree(ap )

:= true

cfree(E · E 0 ) := l(E) 6= f (E 0 ) ∧ cfree(E) ∧ cfree(E 0 ) cfree(E x ) Proof

:= l(E) 6= f (E) ∧ cfree(E).

¬ cfree(E) ⇒ E is not in alt-NF: Choose Γ ≡ 1 to have a valuation in which the

length of E without small polynomials is greater than its alternations. cfree(E) ⇒ E is in alt-NF: simple structural induction. Transforming from simple-NF to alt-NF. Given an expression E in simple-NF, we construct an expressions E 0 in alt-NF that is “almost” equivalent to E. Here, “almost” can be understood in the following sense: we will partially expand some scalar multiplications and substitute the respective variables with a new ones, e.g., (010)x

010(010)x 010. Since 0

our variables are natural numbers, the new expression then “misses” some values (e.g., for Γ(x) = 0, and for Γ(x) = 1). However, if we “move” the variables by the same amount on both sides of the SPE equation, and compare these new polynomials, then we have essentially solved a problem for restricted equivalence of the original polynomials, which is enough according to our earlier reduction. We now explain the crucial insights of the transformation on simple examples. This explanation will not be very formal, but provides enough intuition to understand the general algorithm, which will be presented afterwards. The crucial insights are: (1) Concatenating expressions with simple ends is easy: Consider two expressions E and E 0 with simple ends, i.e., whose tail and head, respectively, are monomials. Example: The

7.5. Equality of String Polynomials

185

violating concatenation 01x · 1y+x 03 is easily fixed by merging the ends to form 01y+2x 03 ; however, concatenating (01)x with 1y+x 03 is harder since we cannot easily merge the ends with each other. (2) From an expression E that does not have simple ends, we can create an expression E 0 that does: We can easily transform (01)x , which does not have a simple head, nor a simple tail, to the expression 01(01)x with a simple head, and further to 01(01)x 01 with 00

0

simple head and tail. Note, that this transformation is only valid for Γ(x) ≥ 2, and with x00 then ranging from 0 to infinity as usual. We will take care of this problem later4 . (3) Any violating expression of the form E x , i.e., with l(E) = f (E), can be “fixed”: Consider E x = (02 103 )x , clearly E x is not in alt-NF although E is. By expanding x once to the right, we obtain E 0 := E x · E = (02 103 )x 02 103 . Note, that the 03 will be right next 0

0

02 for Γ(x0 ) = 1. Further, if Γ(x0 ) > 1 then the outside ends, here 02 and 03 , will always be next to each other and form a regular pattern with the inside, here 1, of E. We can thus rewrite the term by moving the 02 inside the parenthesis to the end of the parenthesis after 03 , add a single 02 before the parenthesis to make up for the loss and removing 02 right after the parenthesis. This turns the expression into alt-NF, in our example, we gain 02 (105 )x 103 . Again, this transformation is only for x ≥ 1 and 0 ≤ x0 ∈ N. 0

The algorithms for the general case are shown in Fig. 7.7 and Fig. 7.8.

7.5.5

Collecting Exponents into Big Polynomials

Starting from string polynomials in alt-NF, we now further develop more machinery for building normal forms. As a relatively simple rewriting, we collect cascading exponentiation into polynomials. Although these are regular polynomials, we cann them big polynomials to emphasize their position in the string polynomial. Given an expression in alt-NF (and 4 In fact, as we will show later, we can just replace x00 by x − 2. Possibly negative exponents can be interpreted as inverse letters that cancel out normal letters.

7.5. Equality of String Polynomials 1

to-alt-NF(ap ): return ap

3

5

7

9

11

13

15

17

19

21

23

to-alt-NF(E1 · E2 ): N1 := to-alt-NF(E1 ) N2 := to-alt-NF(E2 ) if l(N1 ) 6= f (N2 ) then return N1 · N2 else N10 := makeEndingSimple(N1 ) N20 := makeBeginningSimple(N2 ) return to-simple-NF(N10 · N20 ) to-alt-NF(E x ): N := to-alt-NF(E) if f (N ) 6= l(N ) return N x else N 0 := makeEndingSimple(N ) S := makeBeginningSimple(N 0 ) 0 Expand(S x ) to S x · S 0 M := MoveParenthesisIn(S x · S) return to-simple-NF(M ) Figure 7.7: Algorithm to transform into alt-NF

1

3

5

7

9

11

13

15

makeBeginningSimple(E): if E = ap return E 0 if E = F x return makeBeginningSimple(F )·F x // Now E is of type E1 · E2 if E = ap · E2 return E if E = (F x ) · E2 return makeBeginningSimple(F x )·E2 if E = (E1 E2 ) · E3 return makeBeginningSimple(E1 )·E2 E3 makeEndingSimple(E): // analog to makeBeginningSimple MoveParenthesisIn(E x · E): // E needs to have a simple/monomial head and tail // of the same kind, i.e., is of the form: Let ap · E 0 aq := E return ap (E 0 aq+p )x E 0 aq Figure 7.8: Helper Algorithms for alt-NF

186

7.5. Equality of String Polynomials

187

thus in simple-NF), apply the following rewrite rules anywhere in the tree: (E p )q → E pq

p, q ∈ V

It is very easy to see that this rewrite rule is normalizing—again modulo polynomial equivalence in the big polynomials. Note, that all big polynomials we create (so far) are products of variables. Furthermore, this transformation obviously translates E into E 0 with E 0 ≡ E.

7.5.6

Comparing Lists of Monomials

Theorem 7.27 (Monomial equivalence). Testing monomials for equivalence reduces to testing regular polynomial equivalence: ap ≡≥c bq for a c ≥ 1 ⇐⇒ a = b ∧ p ≡ q

Proof

“⇐” clear. “⇒”: For ap and bq to be equivalent for all Γ ≥ c with c ≥ 1, clearly

a = b. Further, p and q need to agree on all Γ ≥ c. Now, since there are infinitely many of such Γ and the maximal degree of p and q is finite, the polynomials have to be equivalent.

Similarly, for a concatenation of simple monomials M1 := ap1 bp2 . . .pn and M2 := a ¯q1 ¯bq2 . . .qm to be equivalent for all Γ ≥ c ≥ 1, they need to start with the same letter (thus, a ¯ = a and ¯b = b), have the same number of alternations (m = n), which make the pi align with the qi , and since there are infinitely many Γ ≥ c the pi s agree with the qi s at infinitely many positions, and therefore pi ≡ qi : Theorem 7.28 (List of monomial equivalence). With the usual symbolic: ¯q1 ¯bq2 . . .qm for a c ≥ 1 ap1 bp2 . . .pn ≡≥c a ⇐⇒ a = a ¯ ∧ b = ¯b ∧ n = m ∧ pi ≡ qi for all i

7.5. Equality of String Polynomials Proof

188

“⇐” clear; “⇒” see explanation above.

Note, that once M1 ≡≥c M2 for some c, then M1 ≡≥c0 M2 for all those c0 for which all small polynomials evaluate to a positive number. During all our transformations we will guarantee that the small polynomials are positive for Γ > 1.

7.5.7

Towards Distributive Alternating Normalform

Given two lists of monomials M1 and M2 as defined above. For ease of notation let us call expressions of this type limos and use Mi to denote them from now on. We now want to compare M1P with M2Q for some big polynomials P and Q. Our goal is to characterize for which Mi , the equivalence of M1P and M2Q implies that P ≡ Q and M1 ≡ M2 . We will see that this is the case iff we cannot distribute the big polynomials over Mi , that is M1P is not equivalent to M1P0 MiP00 with some M10 M100 = M1 , and similarly for M2 . We now make these notions more clear: Definition 7.29 (Distributively-minimal). For a limo M := ap1 bp2 . . .pn and a big polynomial P , M P is distributively minimal, iff there does not exist j such that M = M1 M2 for M1 = ap1 bp2 . . .pj and M2 = dpj+1 . . .pn , and M P 6≡≥c M1P M2P for a c ≥ 1. Checking distributive-minimality for monomials. The following result allows for a simple procedure to check distributive minimality for a limo M . Theorem 7.30 (Non-distributive minimal limos have a repeating core). For a limo M := M1 · M2 with (M1 M2 )P in alt-NF, and 2 ≤ c ∈ N : (M1 M2 )P ≡≥c M1P M2P iff ∃M3 with M1 = (M3 )k and M2 = M3l for k, l ∈ N Proof

(7.20)

“⇐” is easy since M1 M2 ≡≥c M2 M1 . “⇒” via a case distinction as follows:

Case alts(M1 ) = alts(M2 ): Choose infinitely many ci with Γ ≡ ci ≥ c, therefore Γ(P ) ≥ 2

7.5. Equality of String Polynomials

189

for all these infinitely many Γ; For all those Γ, we have M1 M2 M1 . . . M2 ≡≥c M1 M1 . . . M1 M2 M2 . . . M2 | | {z } {z }| {z }

2Γ(P )−many M groups

Γ(P )−many

(7.21)

Γ(P )−many

since alts(M1 ) = alts(M2 ) and (M1 M2 )P is in alt-NF, the first M2 on the left and the second M1 on the right side perfectly align with each other (for all infinitely many Γ). But therefore, the monomials inside M1 and M2 also agree with each other for infinitely many valuations, which requires them to be equal thus M1 = M2 , set M3 := M1 to proof the claim. Case k · alts(M1 ) = alts(M2 ): With a large enough d ≥ c and infinitely many Γ ≥ d, a similar argumentation as in the previous case aligns the first M2 on the left with k-many M1 on the right; with the same arguments as above, M2 = (M1 )k which shows the claim. Case k · alts(M2 ) = alts(M1 ): Analog to previous, align from the right side. Case alts(M1 ) 6= alts(M2 ) and gcd 1 or 2: Let mi := alts(Mi ). If the greatest common divisor of m1 and m2 is 1, then at least one of them has to be an odd number. But then M1P M2P is not in alt-NF, and because (M1 M2 )P is in alt-NF, alt(Γ(M1P M2P )) < alt(Γ((M1 M2 )P )) for Γ ≡ c; consequently, the premise is wrong and we have nothing to show. Now, consider the case that the gcd of m1 and m2 is 2. Then, let A1 B2 . . . Am1 −1 Bm1 := M1 , i.e., it has

m1 2

groups comprising two monomials each. Similarly, M2 has

m2 2

groups:

0 let A01 B20 . . . A0m2 −1 Bm := M2 . Now consider those Γ for which Γ ≥ c and Γ(P ) > 2

2(m1 + m2 + 42); obviously, there are infinitely many of those Γ. We will now show that for all of these Γ and for all i and j the monomial Ai will align with the monomial A0j and similarly for the B monomials: each of it in M1 will match up with each of the B 0 monomials in M2 . Since these are infinitely many Γ in which they agree with each other, they actually have to have the same small polynomial! Therefore, M1 = (A1 B1 )m1 /2 and M1 = (A1 B1 )m2 /2 , q.e.d. To see that they will match up with each other, imagine (7.21) with no M1 on the left side and many, many M1 on the right side. Clearly, the groups would then be aligned after m1 placements of M2 or m2 placements of M1 (not earlier since

7.5. Equality of String Polynomials

190

the gcd of m1 /2 and m2 /2 is 1). In the alignments since then, every single displacement of the groups will occur (gcd=1). Adding in the M1 on the left side only displaces the ending position of the M2 by exactly one m1 , and thus do not change the pattern but only requires more M1 on the right side. Since we place one M2 for each M1 , we need around m1 more of the M1 on the right side, totaling to about m1 + m2 of M1 ’s needed; so with Γ(P ) > 2(m1 + m2 + 42) we are on the safe side. Case alts(M1 ) 6= alts(M2 ) and gcd > 2 and mi 6= kmj : This case can be proven analog to the previous case with the only difference that the group size for the group of monomials that will be aligned with each other is larger, in fact, it equals gcd(m1 , m2 ). Thus, the alwaysrepeating core in M1 and M2 does not have length 2, but the length of l := gcd(m1 , m2 ). Since l < m1 and l < m2 for mi 6= kmj , we are done. Algorithm for: “Is M = (M 0 )k for some k > 1?”. Note, that we only consider M that are in alt-NF. For each pair of small polynomials pi and pj in M , test if they are equivalent; then build the small-polynomial-characteristic strings s0 and s1 as follows: For s0 , map each 0pi to a new letters C0 (0pi ) from a new alphabet Σ0 such that C0 (0pi ) = C0 (0pj ) iff pi ≡ pj ; map all 1pi to the empty-string C0 (1pi ) = ε. Then, s0 := C0 (M ). For s1 map the 1pi to letters and remove the 0pi . Then, try for all divisors k of

1 2

alts(M ) if s0 = (s0 )k for

some string s0 ∈ Σ0 and if s1 = (s00 )k for the same k and an s00 ∈ Σ0 . If such a k is found, answer YES else NO. Corollary 7.31. For M P , we can check whether it is distributive-minimal; if it is not, we can distribute the P to the smallest repeating group Mg and we can equivalently transform M P to MgP . . . MgP . Then Mg will be distribute-minimal.

7.5. Equality of String Polynomials

7.5.8

191

Deciding M1P ≡≥c M2Q for dist-minimal Mi

Theorem 7.32. For two distributive-minimal limos M1 and M2 and two big polynomials P and Q, the following holds for all c ≥ 2: M1P ≡≥c M2Q ⇐⇒ P ≡ Q ∧ M1 ≡≥c M2

Proof

“⇐” trivial. “⇒”: Case distinction on the relationship between m1 := alts(M1 )

and m2 := alts(M2 ). Case m1 = m2 : Then clearly, M1 ≡≥c M2 since their groups are perfectly aligned; further, P and Q agree on infinitely many Γ and thus have to be equivalent. Case m1 = km2 for a 1 < k ∈ N: Similar to the proof above, now M1 lines up to exactly k many of M2 for infinitely many Γ. Thus, M1 = M2k , which is a contradiction to M1 being distributive-minimal. Case km1 = m2 : analog to previous. Case lm1 = km2 for 1 < l, k ∈ N: Also similarly to the proof above, now M1 and M2 need to be composed of some repeating core Mj since the core aligns with each other for infinitely many Γ. Thus, M1 is not distributive-minimal, contradiction.

7.5.9

Summary of Findings and Future Steps

We proposed steps to transform string polynomials into an alternating normal form with big polynomials as exponents. Aligning the substituted variables with each other and then checking syntactic equivalence of two polynomials in this normal form already provides a sound, but not precise algorithm for checking their equivalence, in other words their syntactic equivalence is a sufficient but not necessary condition for string-equivalence. We have also presented several ideas on how to develop a procedure that errs on the other side, i.e., are a necessary conditions: • One test is to imagine the concatenation operator to be commutative, and thus to test for equivalence of multivariate polynomials with integer coefficients. Clearly, if the string polynomials are equivalent, so are the “commutatively relaxed” polynomials.

7.6. Undecidability of Value-Difference for PV-Types

192

• Another procedure is to compare the polynomials that compute the number of alternations for an alphabet of the size two. Clearly, this is also a sound check. • Yet another check would be to replace all variables by “*” and compare the resulting regular expressions with each other. This is clearly also a sound procedure. We conjecture that SPE is decidable. We further believe that by merging neighboring, identical limos into one limo with a merged big polynomial and applying this process also to more complex string polynomials bottom-up will actually yield a precise decision procedure for SPE. For this we would need to design another normal form in which neighboring groups are not equivalent (otherwise they would have been merged). We would then need to prove that for this normal form two polynomials are only equivalent if they have the same syntactic form. Once we have found a decision procedure for SPE, we can then add the other constructors of pv-types, i.e., annotation suffixes and or-types. We hope that handling these is orthogonal to the SPE problem.

7.6

Undecidability of Value-Difference for PV-Types

In this section we will show that for two XML pv-types τ1 and τ2 , it is undecidable whether they represent different values under all valuations. Besides being an interesting result by itself, it also motivates our later proof for the undecidability of equivalence for XQ with a deep equality operator. Before we continue, we quickly define a symbol ≶ for this relation over pv-types: Definition 7.33 (Value-difference). Two pv-types τ1 and τ2 are (always)-value-different (in symbols τ1 ≶ τ2 ) iff [[ τ1 ]]v 6= [[ τ2 ]]v for all v that define τ1 and τ2 . We now show that the value-different relation is undecidable for general pv-types. We show this by reducing the problem of solving Diophantine equation with integer coefficients and variables in natural numbers to the question whether two pv-types are value-different.

7.6. Undecidability of Value-Difference for PV-Types

193

Solving a Diophantine equation is to find integer solutions for a polynomial equation P (x1 , x2 , . . . , xn ) = 0 with integer coefficients. The decision problem is to answer the question whether there exists a solution. From the well-known fact that the decision problem for Diophantine equations in integers (IntD) is undecidable [Mat93], it is fairly easy to see that the decision problem for Diophantine equations restricted to solutions in natural numbers (NatD) is undecidable as well. Theorem 7.34. NatD is undecidable5 . Proof

Assume there is a decider for NatD. We can now construct a decider for IntD as

follows: Given a Diophantine equation E with variables x1 , . . . , xn . Consider the set of 2n equations E, resulting from E by replacing xi by either yi or −yi for all n variables. We now test if any of these equations has a solution in the natural numbers via NatD, if so E has an integer solution else E does not. Proof: Assume E has an integer solution x1 , . . . , xn . Then, the equation in which exactly these variables xi are replaced by −yi for which xi < 0 has a solution in the natural numbers. Furthermore, if any of the equations in E has a solution y1 , . . . , yn in the natural numbers then we can construct an integer solution to E by setting xi := yi if xi was replaced by yi and xi := −yi if xi was replaced by −yi . Theorem 7.35. τ1 ≶ τ2 is undecidable for our pv-types. Proof

NatD can be reduced to deciding τ1 ≶ τ2 : We now show how to encode an

instance of NatD into the decision problem of whether two types are always-value-different. The general idea is to represent the number 1 by the type a[]6 , addition by concatenation of types, and multiplication with a variable x by a star-type colored with x. In particular, any Diophantine equation E : P (x1 , . . . , xn ) = 0 can be transformed into an equation P1 (x1 , . . . , xn ) = P2 (x1 , . . . , xn ) such that each Pi is a summation of products with only positive coefficients. We now create τ1 and τ2 for P1 and P2 , respectively. Each product in Pi is of the form c · xj · · · · · xk . We create types for each product by encoding the constant 5 6

This is a known result, a different proof is in Matiyasevich’s book [Mat93] We use a[] as shorthand for a[()]

7.6. Undecidability of Value-Difference for PV-Types

194

c as a sequence of c times the type a[]; and each multiplication by xj via a star type (..)xj . As example, we would encode 2xy 2 as follows (((a[], a[])x )y )y . The summation of products is encoded as concatenation, so x + y would become (a[])x , (a[])y . It is now easy to see that if E has a solution in the natural numbers, then τ1 6≶ τ2 ; as well as if τ1 ≶ τ2 then E has no solutions in the natural numbers.

Remark Intuitively, in any programing language with a (standard) if-then-else statement for which we want to make non-trivial statements about program output or program behavior, we need to be able to predict for an arbitrary program p if both branches of an if-then-else statements in p are taken or only one of them when p is executed on the set of considered values. Otherwise, we could just use such an “undecidable” if statement to switch between a sub-program p1 that clearly causes the property and another one p2 that clearly causes p to not satisfy the property: p := build x; if “undecidable(x)” then p1 else p2 Those non-trivial properties could for example be “does p ever output a?”, “does p always output a?”, or “does p ever execute an ill-defined statement?”. The second question, for example, is a specific case of program equivalence, and being not able to decide the third one prevents one from creating a sound and complete check for some “well-defindness” property as described in [Van05, dBGV04]. If the language allows to compare arbitrary values, then being not able to decide τ1 ≶ τ2 or τ1 ≡ τ2 is a good indication that an innocent-looking deepEQ($x, $y) operator can cause an unpredictable if-then-else statement. However, the undecidability of τ1 ≶ τ2 or τ1 ≡ τ2 does not necessarily imply the existence of an unpredictable if-statement: The type-system itself might be too complex for the target language: For example, consider a type-system that is able to describe sets of values that

7.7. Undecidability of Query-Equivalence for XQdeep-EQ

195

cannot be created in the program. Comparing these (over-specifying) types can then be very hard even if the language is well-analyzable. It is thus often easier to take the undecidability of τ1 ≶ τ2 or τ1 ≡ τ2 as inspiration to create a class of programs for which it can be shown that they are not analyzable, what we will do next.

7.7

Undecidability of Query-Equivalence for XQdeep-EQ

The idea of encoding Diophantine equations can easily be adapted to show that query equivalence for XQdeep-EQ is undecidable, leading to the following theorem: Theorem 7.36. Query equivalence for XQdeep-EQ is undecidable. Proof

Indirect, assume we could solve equivalence for XQdeep-EQ : To solve an arbi-

trary but fixed Diophantine equation in natural numbers with positive coefficients E : P1 (x1 , . . . , xn ) = P2 (x1 , . . . , xn ), we can construct XQ programs p1 and p2 that would represent the sides of the equation similarly to the previous section. As an example consider how the following statements simulate P1 = 2xy 2 + y: let $p1 = ( let $x = ( for $xh in $root :: x return a[] ) in let $y = ( for $yh in $root :: y return a[] ) in let $s1 = ( for $i in $x return for $j in $y return for $k in $y return a[], a[] ) in let $s2 = $y in $s1, $s2 ) in ...

The input $root 7→ x[], x[], x[], y[], y[], for example, would represent the case x = 3 and

7.8. Related Work

196

y = 2. With a similar sub-program for P2 we can now build two queries: q1 :=

let $p1 = . . . in ( let $p2 = . . . in ( if deepEQ($p1, $p2) then “a” else () ))

q2 :=

()

Clearly, E has a solution iff q1 and q2 are not equivalent.

7.8

Related Work

Related to the work on string-polynomials is the work from Bogdanov et al. [BW05] and Raz et al. [RS05], which consider noncommutative polynomial identity testing. However it is not clear to me how the models used in these work can be used to solve SPE. Andrej Bogdanov mentioned in a personal communication that he does not think that the here suggested problem of SPE has been considered before. The remainder of the related work section discusses work related to XML processing: Colazzo et al. [CGMS04] describe a sound and complete type system for “path-correctness” of XML Queries. That is, there method can statically decide whether a XQuery sub-query will create a non-empty result set for some input to the whole query. Their type language, supporting recursion with minor restrictions, is as expressive as regular tree languages and thus powerful enough to capture possibly recursive DTDs or XML Schema. Queries are the for-let-return queries with child and descendant-or-self axis for navigation. The data model is equivalent to ours, i.e., they have lists of ordered labeled trees, where leaves from a base data type are allowed. We plan on extending their results in the following way: Being able to solve query equivalence subsumes path-correctness as being equivalent to the query “()” essentially amounts to the question of path correctness. For the non-recursive types, our work is thus strictly more general.

7.8. Related Work

197

Kepser describes a “simple proof” for the Turing-completeness of a more expressive fragment of XQuery in [Kep04b] by using the fragment’s capabilities of defining recursive functions and XPath’s capability of doing integer arithmetic. Our result is orthogonal as we analyze the core of XQuery with the result that even without functions and without integer arithmetic, Diophantine equations can be encoded causing the fragment XQdeepEQ to be “not analyzable”. Vansummeren [Van05] analyzes well-definedness for XQuery fragments under a depthbounded type system. Well-definedness is closely related to query equivalence (for both, if-statements have to be predictable) and thus his work can be seen complementary to ours, since we consider the problem of query equivalence, and present a different approach (pvtyping) to our positive and negative results. It is noteworthy that [Van05] mentions that XQuery is not analyzable if the language contains + and × as base operations to modify atoms, because of a possible reduction from Diophantine equations. In our work, we show that Diophantine equations can be encoded into core XQuery if a deep-equality operator is allowed—even without explicit operations on base values. In [Van07], he characterizes for which base-operations well-definedness is decidable for XQuery. In particular, these are the monotone base operations. Hidders et al. [HPV06] study the expressive power of XQuery fragments. Our work is complementary to theirs, since consider equivalence, of an, admittedly very small fragment of XQuery. In recent work, DeHaan [DeH09] studies the equivalence of nested queries with mixed semantics. He shows how to decide equivalence under several unordered data models by encoding nested relations into flat relations. It is not clear how this approach could be ported for an ordered data model. Work that also considers containment and equivalence of queries returning nested objects is Levy and Suciu’s work in [LS97]. Again, this also mainly focuses on unordered data models. There is also work on XPath containment and equivalence [Woo03, MS04, Sch04], however these works do not consider for-loops.

7.9. Summary

198

Recent work by Cate et al. [CL09] studies query containment for XPath 2.0 fragments (with for-loops), however, the semantics of XPath expressions is defined as relation over sets of nodes, which is different to the XQuery semantics, which returns labeled, ordered trees. Bunemann et al. use colors in [BCV08] to track individual data values through complex operations. Besides being applied to the Nested Relational Algebra instead of to ordered, labeled trees, our approaches also significantly differ in goals and methods. Bunemann et al. propose a coloring scheme with propagation rules to automatically track provenance– or the origin of data; while we use colors to statically analyze queries—to check query equivalence. Consequently, Bunemann et al. color data values instead of types. Issues that arise in our approach due to star-types and for loops do not occur if coloring is performed at the value-level. Coloring at the value level, however, does not allow to decide query equivalence. While the authors note in Lemma 1 that two functions f : s → t and g : s → t are equivalent if f (v) = g(v) for all distinctly colored v ∈ s, they do not provide a decision procedure to check the premise. Using colors only at the value-level, we would need to check all possible values. Our approach performs this check on the type-level via a symbolic simulation of many values at once. A second difference between provenance and checking for value-equality is that data origin matters for the first but not for the latter. From a provenance point of view there is a difference between grabbing an element X from the input and creating a new X from scratch. In contrast, when comparing two queries for equivalence, we are interested in their input/output behavior, that is it does not matter how or from where the output was constructed as long as it is a specific value.

7.9

Summary

In this chapter, we introduced the concept of possible-value types for semantic simulation. In contrast to conventional, set-semantic types, which denote a set of values, pv-types denote a function from a set of possible worlds to values. For a concrete set of pv-types, suited for

7.9. Summary

199

XML processing, we showed that it is undecidable whether two pv-types denote the same value in all worlds; while it is decidable whether they are the same. The negative result translates to XQuery, where we showed that when a deep equality operator is allowed, query equivalence is not decidable any more. We furthermore, adapted the concept of sound and precise typing from conventional set-semantic typing and showed how the problem of query equivalence (with input restrictions) can be reduced to type questions. We introduced the problem of string-polynomial equality, which lies at the core of pv-type equivalence. Here, we further proposed several normal-forms that provide sound (but not complete) tests for string-polynomial equivalence, as well as several other insights resulting to decidable necessary conditions.

200

Chapter 8

Concluding Remarks Men love to wonder, and that is the seed of science. Ralph Waldo Emerson

This dissertation considers the problem of designing and optimizing data-driven workflows, in particular dataflow-oriented scientific workflows. The main challenges arise from the heterogeneity of existing algorithms, libraries and tools, their computational complexities, the large amounts of inter-related data, and the exploratory nature of the scientific process. This problem involves three main areas of computer science. Finding the right process for scientists to build these systems is a software engineering challenge; designing a language for workflow specification together with appropriate methods for its static analysis lies in the realm of programming languages; and inventing domain-specific query languages and data models that are suited for efficient execution is at the heart of database research. Virtual data assembly lines, our proposed paradigm for building such scientific workflows, combines existing ideas from dataflow networks with principles of XML data processing. Virtual assembly lines deploy a tree-based data-model, with an “assembly-line” of existing tools that interact with the data via XQuery-based configurations. We showed that this approach solves many design-problems that are common to scientific workflows

201 and arise in dataflow oriented modeling approaches. We further demonstrated how static analysis can be used to support the design process, and to guide the scientist during workflow construction, maintenance and evolution. Moreover, we showed how VDAL-workflows can be parallelized and how static analysis can be used to compile VDAL-workflows into equivalent dataflow networks that exhibit a more efficient data routing. We also developed a type system for XQuery with an ordered data model that is more precise than existing ones, and showed that query equivalence reduces to type-equivalence. Type-equivalence inspires theoretical questions about equality of polynomials over strings with multiplications. This structure exhibits a non-commutative “+” operation together with a scalar multiplication from scalars forming a ring. Here, we made progress towards solving this problem, by constructing a normal-form that allows a sound approximation.

Future Work This work opens many opportunities for future research and development, ranging from solving open theoretical questions, investigating more aspects about workflow design support and resilience, evaluating new strategies for efficient execution, as well as providing a better integration with Kepler. We will now detail each of these points: Basic theoretical research. As a first milestone, we want to either provide a sound and complete procedure to decide string-polynomial equality, or to prove its undecidability. From there, it is still a long way to deciding pv-type equivalence: the first step here would be to add types with sufficed annotations, then, it is interesting to investigate how the different conditional tests (emptiness and base-value equivalence) effect pvtype equivalence. Workflow design-support and resilience. Based on an exact type system for VDAL workflows, the use cases from Chapter 2 can be re-considered and addressed more precisely than we did here. We will further investigate the use cases that have not

202 been addressed in this dissertation: creating a schema-level provenance graph, and a canonical input schema with sample instance data. It is also interesting to show how all use cases can be solved in an unordered data model. Recent work about static analysis of XQuery on unordered data [DeH09] looks very promising as a foundation here. Other interesting questions to be addressed are the ones related to workflow resilience outlined at the end of Chapter 2. Alternative VDAL execution strategies. Besides the data-parallel MapReduce strategy and the pipeline-parallel strategy, other approaches should be investigated. It would, for example, be interesting to create DAGMan [Dag02] models to execute VDAL workflows. Since the intermediate data products are dynamically generated during workflow execution, the DAGMan model would need to be extended while it is running. A completely different strategy, that is worth benchmarking, is to store the VDAL data as “shredded relations” in a standard relational database. A VDAL workflow could then be compiled down to SQL. With horizontal fragmentation, this approach could also lead to an acceptable performance. Implementation. Our light-weight PPN engine as well as our Hadoop-based MapReduce implementation are research prototypes. In our ongoing work, we would like to improve their respective implementations and make them available to a broader user base.

203

Bibliography [ABB+ 03]

Ilkay Altintas, Sangeeta Bhagwanani, David Buttler, Sandeep Chandra, Zhengang Cheng, Matthew Coleman, Terence Critchlow, Amarnath Gupta, Wei Han, Ling Liu, Bertram Lud¨ascher, Calton Pu, Reagan Moore, Arie Shoshani, and Mladen A. Vouk. A Modeling and Execution Environment for Distributed Scientific Workflows. In SSDBM, pages 247–250, 2003. xi, 8, 9

[ABC+ 03]

Serge Abiteboul, Angela Bonifati, Gr´egory Cob´ena, Ioana Manolescu, and Tova Milo. Dynamic XML documents with distribution and replication. In SIGMOD ’03: Proceedings of the 2003 ACM SIGMOD international conference on Management of data, pages 527–538, New York, NY, USA, 2003. ACM Press. 93

[ABE+ 09]

David Abramson, Blair Bethwaite, Colin Enticott, Slavisa Garic, and Tom Peachey. Parameter Space Exploration Using Scientific Workflows. In Intl. Conf. on Computational Science, LNCS 5544, pages 104–113, 2009. 57

[ABJF06]

Ilkay Altintas, Oscar Barney, and Efrat Jaeger-Frank. Provenance Collection Support in the Kepler Scientific Workflow System. In Intl. Provenance and Annotation Workshop (IPAW), pages 118–132. 2006. 13

[ABL09]

Manish Kumar Anand, Shawn Bowers, and Bertram Lud¨ascher. A navigation model for exploring scientific workflow provenance graphs. In Deelman and Taylor [DT09]. 13

[ABML09]

Manish Kumar Anand, Shawn Bowers, Timothy M. McPhillips, and Bertram Lud¨ ascher. Efficient provenance storage over nested data collections. In Martin L. Kersten, Boris Novikov, Jens Teubner, Vladimir Polutin, and Stefan Manegold, editors, EDBT, volume 360 of ACM International Conference Proceeding Series, pages 958–969. ACM, 2009. 4, 13

[AJB+ 04]

Ilkay Altintas, Efrat Jaeger, Chad Berkley, Matthew Jones, Bertram Lud¨ ascher, and Steve Mock. Kepler: An Extensible System for Design and Execution of Scientific Workflows. In 16th Intl. Conf. on Scientific and Statistical Database Management (SSDBM), pages 423–424, Santorini, Greece, 2004. 14

204 [AvLH+ 04] K. Amin, G. von Laszewski, M. Hategan, NJ Zaluzec, S. Hampton, and A. Rossi. GridAnt: A Client-controllable Grid Workflow System. System Sciences, 2004. Proceedings of the 37th Annual Hawaii International Conference on, pages 210–219, 2004. 134 [BBD+ 02]

B. Babcock, S. Babu, M. Datar, R. Motwani, and J. Widom. Models and issues in data stream systems. Proceedings of the twenty-first ACM SIGMODSIGACT-SIGART symposium on principles of database systems, pages 1–16, 2002. 118

[BBMS05]

Magdalena Balazinska, Hari Balakrishnan, Samuel Madden, and Mike Stonebraker. Fault-Tolerance in the Borealis Distributed Stream Processing System. In Proceedings of the 2005 ACM SIGMOD International Conference on Management of Data, Baltimore, MD, June 2005. 93

[BCC+ 05]

Louis Bavoil, Steven P. Callahan, Patricia J. Crossno, Juliana Freire, Carlos E. Scheidegger, Claudio T. Silva, and Huy T. Vo. VisTrails: Enabling Interactive Multiple-View Visualizations. In Proceedings of IEEE Visualization, pages 135–142, Minneapolis, Oct 2005. 11, 57

[BCF03]

V´eronique Benzaken, Giuseppe Castagna, and Alain Frisch. CDuce: an XMLcentric general-purpose language. In Intl. Conf. on Functional Programming (ICFP), pages 51–63, New York, NY, USA, 2003. 66, 148, 154

[BCG+ 03]

C. Barton, P. Charles, D. Goyal, M. Raghavachari, M. Fontoura, and V. Josifovski. Streaming XPath Processing with Forward and Backward Axes. In Proceedings of the International Conference on Data Engineering, pages 455– 466. IEEE Computer Society Press; 1998, 2003. 71

[BCV08]

Peter Buneman, James Cheney, and Stijn Vansummeren. On the expressiveness of implicit provenance in query and update languages. ACM Transactions on Database Systems (TODS), 33(4):28, 2008. 198

[Bio09]

Bioperl tutorial. http://www.bioperl.org/wiki/Bptutorial.pl, 2009. 5

[BKC+ 01]

M.D. Beynon, T. Kurc, U. Catalyurek, C. Chang, A. Sussman, and J. Saltz. Distributed processing of very large datasets with DataCutter. Parallel Computing, 27(11):1457–1478, 2001. 134

[BKW98]

A. Bruggemann-Klein and D. Wood. One-unambiguous regular languages. Information and Computation, 142(2):182–206, 1998. 127

[BL04]

Shawn Bowers and Bertram Lud¨ascher. An Ontology Driven Framework for Data Transformation in Scientific Workflows. In International Workshop on Data Integration in the Life Sciences (DILS), LNCS 2994, pages 25–26, Leipzig, Germany, March 2004. 13, 57

205 [BL05]

Shawn Bowers and Bertram Lud¨ascher. Actor-Oriented Design of Scientific Workflows. In 24st Intl. Conference on Conceptual Modeling (ER), LNCS, Klagenfurt, Austria, October 2005. Springer. 57

[BLL+ 08]

Christopher Brooks, Edward A. Lee, Xiaojun Liu, Stephen Neuendorffer, Yang Zhao, and Haiyang Zheng. Heterogeneous Concurrent Modeling and Design in Java (Volume 1: Introduction to Ptolemy II). Technical Report No. UCB/EECS-2008-28, April 2008. 12

[BLNC06]

Shawn Bowers, Bertram Lud¨ascher, Anne H.H. Ngu, and Terence Critchlow. Enabling Scientific Workflow Reuse through Structured Composition of Dataflow and Control-Flow. In Post-ICDE Workshop on Workflow and Data Flow for Scientific Applications (SciFlow), Atlanta, GA, April 2006. 38

[BML+ 06]

Shawn Bowers, Timothy McPhillips, Bertram Lud¨ascher, Shirley Cohen, and Susan B. Davidson. A Model for User-Oriented Data Provenance in Pipelined Scientific Workflows. In Intl. Provenance and Annotation Workshop (IPAW), pages 133–147. 2006. 13

[BML08]

S. Bowers, T.M. McPhillips, and B. Lud¨ascher. Provenance in collectionoriented scientific workflows. Concurrency and Computation: Practice & Experience, 20(5):519–529, 2008. 13

[BMN02]

G.J. Bex, S. Maneth, and F. Neven. A formal model for an expressive fragment of XSLT. Information Systems, 27(1):21–39, 2002. 148

[BMR+ 08]

Shawn Bowers, Timothy McPhillips, Sean Riddle, Manish Anand, and Bertram Lud¨ ascher. Kepler/pPOD: Scientific Workflow and Provenance Support for Assembling the Tree of Life. In Intl. Provenance and Annotation Workshop (IPAW), 2008. 33

[BNdB04]

Geert Jan Bex, Frank Neven, and Jan Van den Bussche. DTDs versus XML Schema: A Practical Study. In WebDB, pages 79–84, 2004. 98, 159

[Boo]

Boost C++ Libraries. http://www.boost.org/. 121

[Bor07]

Dhruba Borthakur. The Hadoop Distributed File System: Architecture and Design. Apache Software Foundation, 2007. http://svn.apache.org/repos/ asf/hadoop/core/tags/release-0.15.3/docs/hdfs_design.pdf. 64, 73, 85

[BPSM+ 08] Tim Bray, Jean Paoli, C. M. Sperberg-McQueen, Eve Maler, and Fran¸cois Yergeau. Extensible markup language (xml) 1.0 (fifth edition), November 2008. W3C Recommendation. http://www.w3.org/TR/2008/REC-xml-20081126/. 99 [BW05]

Andrej Bogdanov and Hoeteck Wee. More on noncommutative polynomial identity testing. In Proceedings of the 20th Annual IEEE Conference on Computational Complexity, pages 92–99. Citeseer, 2005. 196

206 [CBB+ 03]

M. Cherniack, H. Balakrishnan, M. Balazinska, D. Carney, U. Cetintemel, Y. Xing, and S. Zdonik. Scalable Distributed Stream Processing. CIDR Conference, 2003. 118

[CCD+ 03]

Sirish Chandrasekaran, Owen Cooper, Amol Deshpande, Michael J. Franklin, Joseph M. Hellerstein, Wei Hong, Sailesh Krishnamurthy, Sam Madden, Vijayshankar Raman, Fred Reiss, and Mehul Shah. TelegraphCQ: Continuous Dataflow Processing for an Uncertain World. In Proceedings of the 1st Biennial Conference on Innovative Data Systems Research (CIDR’03), Asilomar, CA, January 2003. 93, 118

[CDG+ 97]

H. Comon, M. Dauchet, R. Gilleron, F. Jacquemard, D. Lugiez, S. Tison, and M. Tommasi. Tree automata techniques and applications. Available on: http://www.grappa.univ-lille3.fr/tata, 1997. release October, 1st 2002. 104

[CDTW00] Jianjun Chen, David J. DeWitt, Feng Tian, and Yuan Wang. NiagaraCQ: A Scalable Continuous Query System for Internet Databases. In Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, pages 379–390, Dallas, Texas, USA, 2000. ACM Press. 93, 118 [CDZ06]

Yi Chen, Susan B. Davidson, and Yifeng Zheng. An Efficient XPath Query Processor for XML Streams. In Intl. Conf. on Data Engineering (ICDE), page 79, 2006. 93, 118

[CGMS04]

D. Colazzo, G. Ghelli, P. Manghi, and C. Sartiani. Types for path correctness of XML queries. In Proceedings of the ninth ACM SIGPLAN international conference on Functional programming, pages 126–137. ACM New York, NY, USA, 2004. 142, 196

[CGMS06]

Dario Colazzo, Giorgio Ghelli, Paolo Manghi, and Carlo Sartiani. Static analysis for path correctness of XML queries. J. Funct. Program., 16(4-5):621–661, 2006. 142, 146, 148, 156, 159

[Che08]

James Cheney. FLUX: functional updates for XML. In ICFP ’08: Proceeding of the 13th ACM SIGPLAN international conference on Functional programming, pages 3–14, New York, NY, USA, 2008. ACM. xi, 48, 49, 141, 142, 146, 148, 159

[Che09]

J. Cheney. Provenance, XML, and the Scientific Web. In ACM SIGPLAN Workshop on Programming Language Technology and XML (PLAN-X 2009), 2009. Invited paper. 153, 155

[Cho02]

Byron Choi. What are real DTDs like? In WebDB, pages 43–48, 2002. 159

[CKR+ 07]

Peter Couvares, Tevfik Kosar, Alain Roy, Jeff Weber, and Kent Wenger. Workflow Management in Condor, pages 357–375. In Taylor et al. [TDGS07], 2007. 11

207 [CL09]

B. Cate and C. Lutz. The complexity of query containment in expressive fragments of XPath 2.0. Journal of the ACM (JACM), 56(6):31, 2009. 198

[CLL06]

R. Chirkova, C. Li, and J. Li. Answering queries using materialized views with minimum size. The VLDB Journal The International Journal on Very Large Data Bases, 15(3):191–210, 2006. 119

[CPE]

Center for plasma edge simulation. http://www.cims.nyu.edu/cpes/. 4

[Dag02]

The directed acyclic graph manager (DAGMan), 2002. http://www.cs.wisc. edu/condor/dagman/. 202

[DBE+ 07]

Susan B. Davidson, Sarah Cohen Boulakia, Anat Eyal, Bertram Lud¨ascher, Timothy M. McPhillips, Shawn Bowers, Manish Kumar Anand, and Juliana Freire. Provenance in Scientific Workflow Systems. IEEE Data Engineering Bulletin, 30(4):44–50, 2007. 8

[dBGV04]

Jan Van den Bussche, Dirk Van Gucht, and Stijn Vansummeren. WellDefinedness and Semantic Type-Checking in the Nested Relational Calculus and XQuery. CoRR, cs.DB/0406060, 2004. 194

[Dee05]

E. Deelman. Pegasus: A framework for mapping complex scientific workflows onto distributed systems. Scientific Programming, 13(3):219–237, 2005. 8, 57, 91

[DeH09]

David DeHaan. Equivalence of nested queries with mixed semantics. In Proceedings of the twenty-eighth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, pages 207–216. ACM, 2009. 153, 197, 202

[DG08]

Jeffrey Dean and Sanjay Ghemawat. MapReduce: simplified data processing on large clusters. Communications of the ACM, 51(1):107–113, 2008. 22, 62, 92, 93, 94

[DGST08]

Ewa Deelman, Dennis Gannon, Matthew Shields, and Ian Taylor. Workflows and e-Science: An Overview of Workflow System Features and Capabilities. Future Generation Computer Systems, In Press, 2008. 7

[DGST09]

Ewa Deelman, Dennis Gannon, Matthew Shields, and Ian Taylor. Workflows and e-Science: An Overview of Workflow System Features and Capabilities. Future Gen. Computer Systems, 25(5):528–540, 2009. 11

[DT09]

Ewa Deelman and Ian Taylor, editors. Proceedings of the 4th Workshop on Workflows in Support of Large-Scale Science, WORKS 2009, November 16, 2009, Portland, Oregon, USA. ACM, 2009. 203, 215

[EJ03]

J. Eker and J.W. Janneck. CAL Language Report: Specification of the Cal Actor Language. Technical Report UCB/ERL M03/48, EECS Department, University of California, Berkeley, 2003. 57

208 [Fel04]

J. Felsenstein. PHYLIP (phylogeny inference package) version 3.6. Distributed by the author. Department of Genome Sciences, University of Washington, Seattle, 2004. 32

[FG06]

Geoffrey C. Fox and Dennis Gannon, editors. Concurrency and Computation: Practice & Experience, Special Issue: Workflow in Grid Systems, volume 18(10). Wiley, August 2006. 7, 210

[FJM+ 07]

Mary F. Fern´ andez, Trevor Jim, Kristi Morton, Nicola Onose, and J´erˆ ome Sim´eon. Highly distributed XQuery with DXQ. In Proc. ACM SIGMOD, pages 1159–1161, 2007. 93

[FLL09]

X. Fei, S. Lu, and C. Lin. A MapReduce-Enabled Scientific Workflow Composition Framework. In IEEE International Conference on Web Services (ICWS), pages 663–670, 2009. 92

[FMJG+ 05] R.A. Ferreira, W. Meira Jr, D. Guedes, L.M.A. Drummond, B. Coutinho, G. Teodoro, T. Tavares, R. Araujo, and G.T. Ferreira. Anthill: A Scalable Run-Time Environment for Data Mining Applications. Proceedings of the 17th International Symposium on Computer Architecture on High Performance Computing-Volume 00, pages 159–167, 2005. 134 [FPD+ 05]

T. Fahringer, R. Prodan, R. Duan, F. Nerieri, S. Podlipnig, J. Qin, M. Siddiqui, H.L. Truong, A. Villazon, and M. Wieczorek. ASKALON: A Grid Application Development and Computing Environment. International Workshop on Grid Computing, pages 122–131, 2005. 91

[FSC+ 03]

Mary F. Fern´ andez, J´erˆome Sim´eon, Byron Choi, Am´elie Marian, and Gargi Sur. Implementing XQuery 1.0: The Galax Experience. In VLDB, pages 1077–1080, 2003. 147

[FSC+ 06]

Juliana Freire, Claudio Silva, Steven Callahan, Emanuele Santos, Carlos Scheidegger, and Huy Vo. Managing Rapidly-Evolving Scientific Workflows. In Intl. Provenance and Annotation Workshop (IPAW), LNCS 4145, pages 10– 18, 2006. 11, 57

[FSW01]

M. Fernandez, J. Simeon, and P. Wadler. A semi-monad for semi-structured data. In Proceedings of the 8th International Conference on Database Theory, pages 263–300. Springer, 2001. 159

[GDR07]

C.A. Goble and D.C. De Roure. myExperiment: social networking for workflow-using e-scientists. In Proceedings of the 2nd workshop on Workflows in support of large-scale science, page 2. ACM, 2007. 15

[Gen01]

W. Gentzsch. Sun Grid Engine: Towards Creating a Compute Power Grid. In First IEEE/ACM International Symposium on Cluster Computing and the Grid, 2001. Proceedings, pages 35–36, 2001. 85

209 [GGM+ 04] Todd J. Green, Ashish Gupta, Gerome Miklau, Makoto Onizuka, and Dan Suciu. Processing XL Streams with Deterministic Automata and Stream Indexes. ACM Transactions on Database Systems (TODS), 29(4):752–788, 2004. 93, 118 [Goo07]

D. J. Goodman. Introduction and evaluation of Martlet: a scientific workflow language for abstracted parallelisation. In International World Wide Web Conference (WWW), pages 983–992, 2007. 92

[GRD+ 07]

Yolanda Gil, Varun Ratnakar, Ewa Deelman, Gaurang Mehta, and Jihie Kim. Wings for Pegasus: Creating Large-Scale Scientific Applications Using Semantic Representations of Computational Workflows. In National Conference on Artificial Intelligence, pages 1767–1774, 2007. 57

[GS03]

A.K. Gupta and D. Suciu. Stream Processing of XPath Queries with Predicates. In Proceedings of the 2003 ACM SIGMOD international conference on Management of data, pages 419–430. ACM New York, NY, USA, 2003. 71

[HHMW07] T. H¨ arder, M. Haustein, C. Mathis, and M. Wagner. Node labeling schemes for dynamic XML documents reconsidered. Data & Knowledge Engineering, 60(1):126–149, 2007. 75 [HKS+ 05]

Jan Hidders, Natalia Kwasnikowska, Jacek Sroka, Jerzy Tyszkiewicz, and Jan Van den Bussche. Petri Net + Nested Relational Calculus = Dataflow. In OTM Conferences, LNCS 3760, pages 220–237, 2005. 57

[HPV06]

Jan Hidders, , Jan Paredaens, and Roel Vercammen. On the expressive power of XQuery-based update languages. Lecture Notes in Computer Science, 4156:92, 2006. 147, 197

[HS08]

J. Hidders and J. Sroka. Towards a Calculus for Collection-Oriented Scientific Workflows with Side Effects. In Proceedings of the OTM 2008 Confederated International Conferences, CoopIS, DOA, GADA, IS, and ODBASE 2008. Part I on On the Move to Meaningful Internet Systems:, page 391. Springer, 2008. 57

[HSL+ 04]

Duncan Hull, Robert Stevens, Phillip Lord, Chris Wroe, and Carole Goble. Treating shimantic web syndrome with ontologies. In First Advanced Knowledge Technologies Workshop on Semantic Web Services (AKT-SWS04), Open University, Milton Keynes, UK., 2004. CEUR-WS.org ISSN:1613-0073. 57, 58

[HT05]

T. Hey and A.E. Trefethen. 308(5723):817, 2005. 2

[HTT09]

Anthony J. G. Hey, Stewart Tansley, and Kristin M. Tolle. The Fourth Paradigm: Data-Intensive Scientific Discovery. Microsoft Research, 2009. 2

Cyberinfrastructure for e-Science.

Science,

210 [HVP05]

Haruo Hosoya, Jerome Vouillon, and Benjamin C. Pierce. Regular expression types for XML. ACM Transactions on Programming Languages and Systems (TOPLAS), 27(1):46–90, 2005. 98, 100, 104, 148, 154, 159

[Ima08]

Image-magick. http://www.imagemagick.org, 2008. 69

[JAZ+ 05]

Efrat Jaeger, Ilkay Altintas, Jianting Zhang, Bertram Lud¨ascher, Deana Pennington, and William Michener. A Scientific Workflow Approach to Distributed Geospatial Data Processing using Web Services. In 17th Intl. Conference on Scientific and Statistical Database Management (SSDBM), Santa Barbara, California, June 2005. 34

[Kah74]

G. Kahn. The semantics of a simple language for parallel programming. In J. L. Rosenfeld, editor, Proc. of the IFIP Congress 74, pages 471–475. NorthHolland, 1974. 11, 93, 112

[KBLN04]

Miryung Kim, Lawrence Bergman, Tessa Lau, and David Notkin. An ethnographic study of copy and paste programming practices in OOPL. In International Symposium on Empirical Software Engineering, pages 83–92. Citeseer, 2004. 7

[Kep04a]

Kepler Actors User Manual. http://poc.vl-e.nl/distribution/manual/ kepler-1.0.0alpha7/kepler-ActorUserManual.pdf, 2004. 8

[Kep04b]

S. Kepser. A simple proof for the turing-completeness of XSLT and XQuery. In Extreme Markup Languages, 2004. 197

[KLN+ 07]

C. Kamath, B. Lud¨ascher, J. Nieplocha, S. Parker, R. Ross, N. Samatova, and M. Vouk. SDM Center Technologies for Accelerating Scientific Discoveries. Journal of Physics: Conference Series, 78, (2007) 012068:1–5, 2007. 14

[KSC+ 08]

D. Koop, C.E. Scheidegger, S.P. Callahan, J. Freire, and C.T. Silva. VisComplete: Automating Suggestions for Visualization Pipelines. IEEE Transactions on Visualization and Computer Graphics, 14(6):1691–1698, 2008. 57

[KSSS04a]

C. Koch, S. Scherzinger, N. Schweikardt, and B. Stegmaier. FluXQuery: An Optimizing XQuery Processor for Streaming XML Data. Proc. VLDB 2004, pages 1309–1312, 2004. 93

[KSSS04b]

C. Koch, S. Scherzinger, N. Schweikardt, and B. Stegmaier. Schema-based Scheduling of Event Processors and Buffer Minimization for Queries on Structured Data Streams. In 28th Conf. on Very Large Data Bases (VLDB), pages 228–239, 2004. 93, 118

[LAB+ 06]

Bertram Lud¨ ascher, Ilkay Altintas, Chad Berkley, Dan Higgins, Efrat Jaeger, Matthew Jones, Edward A. Lee, Jing Tao, and Yang Zhao. Scientific Workflow Management and the Kepler System. In Concurrency and Computation: Practice & Experience [FG06], pages 1039–1065. 8, 14, 56

211 [LAB+ 09]

Bertram Lud¨ ascher, Ilkay Altintas, Shawn Bowers, Julian Cummings, Terence Critchlow, Ewa Deelman, David De Roure, Juliana Freire, Carole Goble, Matthew Jones, Scott Klasky, Timothy McPhillips, Norbert Podhorszki, Claudio Silva, Ian Taylor, and Mladen Vouk. Scientific Process Automation and Workflow Management. In Arie Shoshani and Doron Rotem, editors, Scientific Data Management: Challenges, Existing Technology, and Deployment, Computational Science Series, chapter 13. Chapman & Hall/CRC, 2009. 7, 10, 31

[L¨am08]

R. L¨ ammel. Google’s MapReduce programming model—Revisited. Science of Computer Programming, 70(1):1–30, 2008. 93

[LBM09]

Bertram Lud¨ ascher, Shawn Bowers, and Timothy McPhillips. Scientific Workflows. In Encyclopedia of Database Systems. Springer, 2009. 11

[LC00]

Dongwon Lee and Wesley W. Chu. Comparative analysis of six XML schema languages. SIGMOD Rec., 29(3):76–87, 2000. 98

[LLF+ 09]

Cui Lin, Shiyong Lu, Xubo Fei, Darshan Pai, and Jing Hua. A Task Abstraction and Mapping Approach to the Shimming Problem in Scientific Workflows. In IEEE Intl. Conf. on Services Computing, Bangalore, India, 2009. 57

[LMP02]

Bertram Lud¨ ascher, Pratik Mukhopadhyay, and Yannis Papakonstantinou. A Transducer-Based XML Query Processor. In 28th Conf. on Very Large Data Bases (VLDB), pages 227–238, Hong Kong, 2002. 118

[LP95]

Edward A. Lee and Thomas Parks. Dataflow Process Networks. Proceedings of the IEEE, 83(5):773–799, May 1995. 14

[LS97]

Alon Y. Levy and Dan Suciu. Deciding containment for queries with complex objects and aggregations. Proc. of PODS, Tucson, Arizona, 1997. 197

[LSS]

Large Synoptic Survey Telescope (LSST). www.lsst.org. 4, 5

[LSV98]

Edward A. Lee and Alberto L. Sangiovanni-Vincentelli. A framework for comparing models of computation. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 17(12):1217–1229, 1998. 17, 57

[Mat93]

Y. Matiyasevich. Hilbert’s 10th Problem. Foundations of Computing Series. The MIT Press, 1993. 193

[MB05]

Timothy M. McPhillips and Shawn Bowers. An Approach for Pipelining Nested Collections in Scientific Workflows. SIGMOD Record, 34(3):12–17, 2005. 17, 25, 58

[MBL06]

Timothy McPhillips, Shawn Bowers, and Bertram Lud¨ascher. CollectionOriented Scientific Workflows for Integrating and Analyzing Biological Data. In 3rd Intl. Workshop on Data Integration in the Life Sciences (DILS), LNCS, pages 248–263, European Bioinformatics Institute, Hinxton, UK, July 2006. Springer. xi, 17, 20, 25, 31, 56, 58

212 [MBZL09]

Timothy McPhillips, Shawn Bowers, Daniel Zinn, and Bertram Lud¨ascher. Scientific workflow design for mere mortals. Future Generation Computer Systems, 25(5):541 – 551, 2009. 1, 14, 18, 57, 91, 94

[MLMK05] M. Murata, D. Lee, M. Mani, and K. Kawaguchi. Taxonomy of XML schema languages using formal language theory. ACM Transactions on Internet Technology (TOIT), 5(4):660–704, 2005. 98, 99 [Mor94]

J. Paul Morrison. Flow-Based Programming – A New Approach to Application Development. Van Nostrand Reinhold, 1994. 58

[MS04]

G. Miklau and D. Suciu. Containment and equivalence for a fragment of XPath. Journal of the ACM (JACM), 51(1):2–45, 2004. 197

[MSM97]

D.R. Maddison, D.L. Swofford, and W.P. Maddison. NEXUS: An Extensible File Format for Systematic Information. Systematic Biology, 46(4):590–621, 1997. 14

[Net]

NetCDF (Network Common Data Form). http://www.unidata.ucar.edu/ software/netcdf/. 14

[OAF+ 04]

T. Oinn, M. Addis, J. Ferris, D. Marvin, M. Senger, M. Greenwood, T. Carver, K. Glover, M.R. Pocock, A. Wipat, et al. Taverna: a tool for the composition and enactment of bioinformatics workflows. Bioinformatics, 20(17):3045, 2004. 34

[OGA+ 02]

Tom Oinn, Mark Greenwood, Matthew Addis, M. Nedim Alpdemir, Justin Ferris, Kevin Glover, Carole Goble, Antoon Goderis, Duncan Hull, Darren Marvin, Peter Li, Phillip Lord, Matthew R. Pocock, Martin Senger, Robert Stevens, Anil Wipat, and Chris Wroe. Taverna: Lessons in Creating a Workflow Environment for the Life Sciences. Concurrency and Computation: Practice & Experience, pages 1067–1100, 2002. 11, 37, 57, 92

[OOP+ 04]

P. O’Neil, E. O’Neil, S. Pal, I. Cseri, G. Schaller, and N. Westbury. ORDPATHs: Insert-friendly XML node labels. In Proceedings of the 2004 ACM SIGMOD international conference on Management of data, pages 903–908. ACM New York, NY, USA, 2004. 75

[PLK07]

Norbert Podhorszki, Bertram Lud¨ascher, and Scott Klasky. Workflow Automation for Processing Plasma Fusion Simulation Data. In 2nd Workshop on Workflows in Support of Large-Scale Science (WORKS), June 2007. xi, 10, 11, 14, 38, 69, 106

[Pto06]

Ptolemy II project and system. Department of EECS, UC Berkeley, 2006. http://ptolemy.eecs.berkeley.edu/ptolemyII/. 56, 128

[PVMa]

Parallel Virtual Machine. http://www.csm.ornl.gov/pvm/. 112

213 [PVMb]

PVM++: A C++-Library for PVM. http://pvm-plus-plus.sourceforge. net/. 121

[QF07]

Jun Qin and Thomas Fahringer. Advanced data flow support for scientific grid workflow applications. In Proceedings of the ACM/IEEE conference on Supercomputing (SC), pages 1–12. ACM, 2007. 92

[RBHS04]

C. Re, J. Brinkley, K. Hinshaw, and D. Suciu. Distributed XQuery. Workshop on Information Integration on the Web, pages 116–121, 2004. 93

[roc]

Rocks clusters. http://www.rocksclusters.org/. 85

[RS05]

R. Raz and A. Shpilka. Deterministic polynomial identity testing in noncommutative models. Computational Complexity, 14(1):1–19, 2005. 196

[SBB+ 02]

J.E. Stajich, D. Block, K. Boulez, S.E. Brenner, S.A. Chervitz, C. Dagdigian, G. Fuellen, J.G.R. Gilbert, I. Korf, H. Lapp, et al. The Bioperl toolkit: Perl modules for the life sciences. Genome research, 12(10):1611, 2002. 5

[Sch04]

T. Schwentick. XPath query containment. ACM SIGMOD Record, 33(1):101– 109, 2004. 197

[Sch07]

T. Schwentick. Automata for XML – A survey. Journal of Computer and System Sciences, 73(3):289–315, 2007. 119

[SCZ+ 07]

K. Stevens, D. Cutler, M. Zwick, P. de Jong, K.H. Huang, M. Koriabine, B. Lud¨ ascher, C. Marston, S. Lee, D. Okou, K. Osoegawa, J. Warrington, D.J. Begun, and C.H. Langley. DPGP Cyberinfrastructure and Open Source Toolkit for Chip Based Resequencing. In Advances in Genome Biology and Technology (AGBT), 2007. 35

[SHS04]

G.M. Sur, J. Hammer, and J. Simeon. An XQuery-based language for processing updates in XML. Proceedings of PLAN-X, 2004, 2004. 147

[SOL05]

A. Stamatakis, M. Ott, and T. Ludwig. RAxML-OMP: An Efficient Program for Phylogenetic Inference on SMPs. Lecture Notes in Computer Science, 3606:288–302, 2005. 33

[SWI]

Simplified Wrapper and Interface Generator. http://www.swig.org/. 121

[TDGS07]

Ian J. Taylor, Ewa Deelman, Dennis B. Gannon, and Mark Shields, editors. Workflows for e-Science: Scientific Workflows for Grids. Springer, 2007. 7, 206

[TMG+ 07] D. Turi, P. Missier, C. Goble, D. De Roure, and T. Oinn. Taverna workflows: Syntax and semantics. In Proceedings from the 3rd IEEE International Conference on e-Science and Grid Computing, Bangalore, India, 2007. 57

214 [TMSF03]

P.A. Tucker, D. Maier, T. Sheard, and L. Fegaras. Exploiting Punctuation Semantics in Continuous Data Streams. IEEE Transactions on Knowledge and Data Engineering, pages 555–568, 2003. 119

[TSWH07] I. Taylor, M. Shields, I. Wang, and A. Harrison. The triana workflow environment: Architecture and applications. Workflows for e-Science, pages 320–339, 2007. 11, 14, 31, 57 [TSWR03] I. Taylor, M. Shields, I. Wang, and O. Rana. Triana Applications within Grid Computing and Peer to Peer Environments. Journal of Grid Computing, 1(2):199–217, 2003. 91 [Van05]

Stijn Vansummeren. Deciding well-definedness of XQuery fragments. In PODS ’05: Proceedings of the twenty-fourth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, pages 37–48, New York, NY, USA, 2005. ACM. 194, 197

[Van07]

Stijn Vansummeren. On deciding well-definedness for query languages on trees. Journal of the ACM (JACM), 54(4):19, 2007. 197

[vHHH+ 09] Kees van Hee, Jan Hidders, Geert-Jan Houben, Jan Paredaens, and Philippe Thiran. On the relationship between workflow models and document types. Information Systems, 34(1):178–208, March 2009. 57 [vL96]

G. von Laszewski. An Interactive Parallel Programming Environment Applied in Atmospheric Science. Making Its Mark, Proceedings of the 6th Workshop on the Use of Parallel Processors in Meteorology, pages 311–325, 1996. 134

[Woo03]

P.T. Wood. Containment for XPath fragments under DTD constraints. Lecture notes in computer science, pages 300–314, 2003. 197

[YB05]

Jia Yu and Rajkumar Buyya. A taxonomy of scientific workflow systems for grid computing. SIGMOD Record, 34(3):44–49, September 2005. 17, 134

[YDHP07]

Hung-chih Yang, Ali Dasdan, Ruey-Lung Hsiao, and D. Stott Parker. Mapreduce-merge: simplified relational data processing on large clusters. In SIGMOD ’07: Proceedings of the 2007 ACM SIGMOD international conference on Management of data, pages 1029–1040, New York, NY, USA, 2007. ACM. 94

[ZBKL09]

Daniel Zinn, Shawn Bowers, Sven K¨ohler, and Bertram Lud¨ascher. Parallelizing XML data-streaming workflows via MapReduce. Journal of Computer and System Sciences, In Press:–, 2009. 1, 27, 59

[ZBL10]

Daniel Zinn, Shawn Bowers, and Bertram Lud¨ascher. XML-Based Computation for Scientific Workflows. In Intl. Conf. on Data Engineering (ICDE), 2010. to appear, see also technical report. 1, 28, 135

215 [ZBML09a] Daniel Zinn, Shawn Bowers, Timothy M. McPhillips, and Bertram Lud¨ascher. Scientific workflow design with data assembly lines. In Deelman and Taylor [DT09]. 1, 26, 31 [ZBML09b] Daniel Zinn, Shawn Bowers, Timothy M. McPhillips, and Bertram Lud¨ascher. X-CSR: Dataflow Optimization for Distributed XML Process Pipelines. In Intl. Conf. on Data Engineering (ICDE), pages 577–580, 2009. Also see Technical Report CSE-2008-15, UC Davis. 1, 27, 96 [ZDF+ 05]

Yong Zhao, Jed Dobson, Ian Foster, Luc Moreau, and Michael Wilde. A notation and system for expressing and executing cleanly typed workflows on messy scientific data. SIGMOD Rec., 34(3):37–43, 2005. 69

[ZHC+ 07]

Y Zhao, M. Hategan, B. Clifford, I. Foster, G. von Laszewski, V. Nefedova, I. Raicu, T. Stef-Praun, and M. Wilde. Swift: Fast, Reliable, Loosely Coupled Parallel Computation. In IEEE Congress on Services, pages 199–206, 2007. 91

[Zin08]

Daniel Zinn. Modeling and optimization of scientific workflows. In Ph.D. ’08: Proceedings of the 2008 EDBT Ph.D. workshop, pages 1–10, New York, NY, USA, 2008. ACM. 1

[ZLL09]

Daniel Zinn, Xuan Li, and Bertram Lud¨ascher. Parallel Virtual Machines in Kepler. Eighth Biennial Ptolemy Miniconference, UC Berkeley, California, April 2009. 1, 28, 120, 129

Design and Optimization of Scientific Workflows, UC ...

Users can leverage semantic type information by checking if actors are compatible with each other, or to find actors that operate on certain data in a large library.

2MB Sizes 1 Downloads 194 Views

Recommend Documents

Design and Optimization of Scientific Workflows, UC ...
In e-Science, the nature of the data that is processed poses ad- ...... workflow, the workflow creator needs to know primarily the XML schema on the stream.

Modeling and Optimization of Scientific Workflows
Mar 25, 2008 - Dataflow Analysis and Optimization. 5. Experimental ... Scientific Workflow Modeling & Design ... Actors “pick up” only relevant data (read.

Modeling and Optimization of Scientific Workflows
Mar 25, 2008 - Department of Computer Science .... with concrete language, with solid .... σ – set of match rules, each of the form X → R with. X ∈ τα, and.

On the Use of Cloud Computing for Scientific Workflows
a Grid cluster for a relatively high cost, but it has complete control and sole access to it. .... acyclic graph XML file (DAX) representing the Montage computation was .... uses a shared file system to access data during the workflow execution,.

Athena Scientific Series in Optimization and Neural ...
Related. Combinatorial Optimization: Algorithms and Complexity (Dover Books on ... Network Flows: Pearson New International Edition: Theory, Algorithms, and ...

[PDF BOOK] Scientific Design of Exhaust and Intake ...
... from the Philadelphia Inquirer Philadelphia Daily News and Philly com These dissertations are hosted by ProQuest and are free full text access to University of ...

Design and Optimization of Multiple-Mesh Clock Network - IEEE Xplore
at mesh grid, is less susceptible to on-chip process variation, and so it has widely been studied recently for a clock network of smaller skew. A practical design ...

[PDF BOOK] Scientific Design of Exhaust and Intake ...
[PDF BOOK] Scientific Design of Exhaust and. Intake Systems (Engineering and Performance). - Full Ebook By Philip Hubert Smith. Juventus Olympiacos ...

Design Specific Joint Optimization of Masks and ...
5 illustrates comparison of Common Process Window (CPW) obtained by this ... With a tool like PD it is able to test our hypothesis #1 using an enumerated contact ..... ai bi i. a b ai bi i i. s s. C s s. = ∑. ∑ ∑. Proc. of SPIE Vol. 7973 797308

Design and Optimization of a Speech Recognition ...
validate our methodology by testing over the TIMIT database for different music playback levels and noise types. Finally, we show that the proposed front-end allows a natural interaction ..... can impose inequality constraints on the variables that s