Process-Oriented Recovery for Operations on Cloud Applications Min Fu, Liming Zhu, Anna Liu, Xiwei Xu, Len Bass Software Systems Research Group, NICTA, Sydney, Australia School of Computer Science and Engineering, University of New South Wales, Sydney, Australia

{Min.Fu, Liming.Zhu, Anna.Liu, Xiwei.Xu, Len.Bass}@nicta.com.au A large number of cloud application failures happen during sporadic operations on cloud applications, such as upgrade, deployment reconfiguration, migration and scaling-out/in. Most of them are caused by operator and process errors [1]. From a cloud consumer’s perspective, recovery from these failures relies on the limited control and visibility provided by the cloud providers. In addition, a large-scale system often has multiple operation processes happening simultaneously, which exacerbates the problem during error diagnosis and recovery. Existing built-in or infrastructure-based recovery mechanisms often assume random component failures and use checkpoint-based rollback, compensation actions [2], redundancy and rejuvenation to handle recovery [3]. These recovery mechanisms do not consider the characteristics of a specific operation process that consists of a set of steps carried out by scripts and humans interacting with fragile cloud infrastructure APIs and uncertain resources [4]. Other approaches such as FATE/DESTINI [5] look at the process implied by a system’s internal protocols and rely on the built-in recovery protocol to detect and recover from bugs. The problem we target is at a different level related to the external sporadic activities operating on a hosted cloud application.

sections consisting of steps. The division criteria can be different for different purposes, such as error diagnosis, conformance checking or recovery. For recovery, our initial division criteria include: 1) atomicity to achieve all-or-nothing for a group of actions making recovery easier; 2) idempotence to enable the same or parameterized actions to be re-executed for recovery; 3) fine-granularity to allow higher-level reuse of existing steps during recovery; 4) alternatives-friendly to allow alternative actions to be executed to reach the same expected result during recovery. We specify a set of assertions representing the expected outcomes produced by the execution of each section. We use our run-time assertion evaluation and monitoring system [7] to help detect errors at the end of each section and trigger recovery actions if necessary. We applied our approach to a typical AMI-driven rolling upgrade process for AWS-hosted applications. Netflix has a well-known tool Asgard [6] supporting this process. We divided Asgard’s internal recovery and error-handling mechanisms using our division criteria. We also enabled log-triggered assertion evaluation at the end of some sections. We were able to detect errors earlier and catch errors that would be missed by Asgard. However, we found that Asgard was not designed with the granularity and idempotence required for re-execution. So we could not conduct reexecution based recovery using selected pieces of Asgard’s code. We did develop external alternative actions for recovering from some subtle errors. For example, upon detection of some unexpectedly stopped instances, our alternative actions either restarted these instances or killed them and relied on the auto-scaling group to restart them. Our recovery significantly shortens the time that Asgard would have taken otherwise.

Our overall approach is to explicitly model and analyze an operation as a process. We divide the process into Copyright © 2013 by the Association for Computing Machinery, Inc. (ACM). Permission to make digital or hard copies of portions of this work for personal or classroom use is granted without fee provided that the copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page in print or the first screen in digital media. Copyrights for components of this work owned by others than ACM must be honored. For all other uses, contact the Owners/Authors. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or a fee. Send written requests for republication to ACM Publications, Copyright & Permissions at the address above or fax +1 (212) 869-0481 or email [email protected].

At the moment, we are developing patterns and frameworks that can allow operators to easily design and instrument their scripts for marking atomic actions, re-execution blocks and alternative actions for recovery purposes. We are also integrating this with our assertion evaluation and monitoring framework [7].

SoCC'13, 1—3 Oct. 2013, Santa Clara, California, USA. ACM 978-1-4503-2428-1. http://dx.doi.org/10.1145/2523616.2525958

1

[3] L. DuBois, ”Disaster Recovery for Virtualized

Acknowledgements NICTA is funded by the Australian Government as represented by the Department of Broadband, Communications and the Digital Economy and the Australian Research Council through the ICT Centre of Excellence program.

[4]

[5]

References [1] D. Oppenheimer and D. A. Patterson, “Why do

Internet services fail, and what can be done about it?”, Proc. 10th ACM SIGOPS European Workshop, September 2002. [2] C. Colombo and G. J. Pace, “Recovery within Long Running Transactions”, ACM Transactions on Computational Logic, pp. 1-40, August 2011.

[6] [7]

2

Environments: A DR Approach to Fit the New Datacentre”, IDC Presentation, March 2013. Q. Lu, L. Zhu, L. Bass, X. Xu, Z. Li and H. Wada, “Cloud API Issues: an Empirical Study and Impact”, Proc. 9th ACM SIGSOFT conference, 2013. H. S. Gunawi, T. Do, P. Joshi, P. Alvaro, J. M. Hellerstein, A. C. Arpaci-Dusseau, R. H. ArpaciDusseau, and K.Sen, and D. Borthakur, “FATE and DESTINI: A Framework for Cloud Recovery Testing”, NSDI, 2011. Website: https://github.com/Netflix/asgard (last access time: 12 Aug 2013, 12:30). I. Weber, X. Xu, and et al., “Detecting Cloud Provisioning Errors Using an Annotated Process Model”, Submitted to Middleware 2013.

Process-Oriented Recovery for Operations on Cloud ...

rolling upgrade process for AWS-hosted applications. Netflix has a well-known tool Asgard [6] supporting this process. We divided Asgard's internal recovery and error-handling mechanisms using our division criteria. We also enabled log-triggered assertion evaluation at the end of some sections. We were able to.

30KB Sizes 0 Downloads 119 Views

Recommend Documents

FalconStor CDP guarantees quick recovery for continuous operations ...
software and converted from physical tape to a virtual tape methodology. Although this improved backup ... All other company and product names contained ...

04_Microsoft Private Cloud Foundation Deployment Kit - Operations ...
Try one of the apps below to open or edit this item. 04_Microsoft Private Cloud Foundation Deployment Kit - Operations Guide.pdf. 04_Microsoft Private Cloud ...

On Deterministic Sketching and Streaming for Sparse Recovery and ...
Dec 18, 2012 - CountMin data structure [7], and this is optimal [29] (the lower bound in. [29] is stated ..... Of course, again by using various choices of ε-incoherent matrices and k-RIP matrices ..... national Conference on Data Mining. [2] E. D. 

recovery for toshiba.pdf
Toshiba recovery wizard laptop reviews best. How to recover a toshiba notebook or tablet device with the hdd. How to reinstall factory os laptop repair 101.

Designing for recovery
Not subject to limitations derived from the laws of physics (so, no natural constraints on their size). • Data intensive, with very long lifetime data. • An integral part ...

On Cloud-Centric Network Architecture for Multi ...
1. INTRODUCTION. As numerous cloud-based applications and services have been introduced for ... vices, network infrastructure and cloud services inherently.

Best Practices for DDoS Protection and Mitigation on Google Cloud ...
Apr 12, 2016 - A Denial of Service (DoS) attack is an attempt to render your service or ... Google Cloud Virtual Network​. View the best practice ​here​. 1 ...

Compatible operations on commutative residuated ...
1900 - La Plata (Argentina) [email protected]. ABSTRACT. Let L be a commutative residuated lattice and let f : Lk → L a function. We give a necessary and sufficient condition for f to be compatible with respect to every congruence on L. We use

Mobile Solutions on Google Cloud Platform
With Google Cloud Platform you can easily build a backend for your mobile solution. ... your application's scenarios and not have to worry about things such as ...

articles on cloud computing pdf
articles on cloud computing pdf. articles on cloud computing pdf. Open. Extract. Open with. Sign In. Main menu. Displaying articles on cloud computing pdf.

Study on Cloud Computing Resource Scheduling Strategy Based on ...
proposes a new business calculation mode- cloud computing ... Cloud Computing is hotspot for business ... thought is scattered through the high-speed network.

report on cloud computing pdf
Loading… Page 1. Whoops! There was a problem loading more pages. report on cloud computing pdf. report on cloud computing pdf. Open. Extract. Open with.

report on cloud computing pdf
There was a problem previewing this document. Retrying... Download. Connect more apps... Try one of the apps below to open or edit this item. report on cloud ...

Wondershare Data Recovery for Mac
Easy to Use & Clean Interface. 2. Safety First. [*** Download Wondershare Data Recovery for Mac Here ***]. Best Video Editing Applications for iOS And Android ...

Mobile Solutions on Google Cloud Platform
Orchestrating push notification to Android and IOS devices ..... processing [10], your code that runs on Google App Engine can enqueue tasks into a pull queue ( ...

HIPAA Compliance on Google Cloud Platform
This guide is intended for security officers, compliance officers, ... practice for information security controls based on the ISO/IEC. 27002 specifically for cloud services. Our ISO ... Google's comprehensive third party audit approach is designed t

cloud nothings attack on memory.pdf
... below to open or edit this item. cloud nothings attack on memory.pdf. cloud nothings attack on memory.pdf. Open. Extract. Open with. Sign In. Main menu.

Apparatus and method for enhanced oil recovery
Nov 25, 1987 - The vapor phase of the steam ?ows into and is de?ected by the ?ngers of the impinge ment means into the longitudinal ?ow passageway ol.

effects of different recovery interventions on anaerobic ...
intervention (i.e., passive, dry-aerobic exercises, water-aerobic .... ternoon training performances and percentages of variations for the 4 recovery modes. Morning (s) Afternoon (s) Delta (%). Sitting rest. 1.81. 0.1. 1.83. 0.1. 99. 3. Dry warm-down

On the Intermediate Symbol Recovery Rate of Rateless ...
To generate an output symbol, first its degree is randomly .... where Zγ is the highest possible z (upper bound on z) at γ for ..... of Computer Science, 2002.

Recovery School District in New Orleans - Research on Reforms
Nov 5, 2012 - 2. New Orleans; e.g. the Orleans Parish School Board (OPSB) . ..... .louisianaschools.net/offices/infomanagement/student_enrollment_data.html.

effects of different recovery interventions on anaerobic ...
Data were collected on 4 occa- ..... without any manipulation of the experimental condition ..... tions over a Big Ten soccer season in starters and nonstarters. J.

Apparatus and method for enhanced oil recovery
25 Nov 1987 - Appl. No.: Filed: [51} Int. Cl.5 pocket mandrel or other downhole tools. Along with the impingement device, a centralizer to guide tools. Nov. 1, 1985 through the impingement device and to cause a pressure. E21B 43/24. [52] US. Cl. 166/