preventing mixed version race conditions.pdf

Viewer
Transcript

A Method for Preventing Mixed Version Race Conditions Len Bass, Hiroshi Wada NICTA, Sydney Abstract A mixed version race condition occurs when two incompatible versions of an application are simultaneously active during a rolling upgrade. This happens in large Internet service organizations and it has been identified by the Director of Engineering at Facebook as their most significant technical problem. In this paper, we propose a method that will prevent mixed version race conditions for browser/front end interactions using HTTP. Our method involves identifying a version in HTTP messages and making load balancers version aware. The overhead introduced by the method consists of having a small number of idle instances whenever an upgrade is performed.

1. Introduction Organizations that run applications on a large cluster of servers have a problem when upgrading from one version of the application to the next. One option is to partition the servers, upgrade one partition at a time, and restrict clients of the applications to use a single partition, whether the new version or the old. A second option is to upgrade one server at a time (rolling upgrade) and clients can use any active server. Rolling upgrades are considered best industrial practice [3]. Partitioning the servers reduces the capacity of the pool of servers available to clients but rolling upgrades introduce the possibility of what has come to be called the “mixed version race condition”. When upgrading an application from one version to the next, it is possible that the two versions are incompatible. This does not mean that either version is incorrect, but version N+1 could have introduced new functionality not present in version N. Then a client that interacts initially with version N+1 and subsequently interacts with version N may suffer a race condition since it assumes functionality that is present in version N+1 that is not present in version N. This is the missed version race condition. The mixed version race condition problem was presented at the keynote address by David Reiss at the HotSWUp’09 and subsequently identified by the Facebook Director of Engineering as their most significant technical problem [4]. More recently, the chief architect of an e-Bay,PayPal company stated “[The mixed ver-

sion race condition] is definitely a problem that's commonly experienced by web-based properties”[2]. In this paper, we present a method that introduces minimal overhead and that prevents the mixed version race condition browser/front end interactions. The method requires the server to identify the version number of the relevant application in response messages, the client to request a version number in its messages, and the load balancers to be version sensitive in their scheduling. The overhead introduced by our method consists of an additional field added to HTTP messages, a small amount of additional scheduling overhead introduced into the load balancer, and the creation of a small number of unused instances of load balancers, To our knowledge, this is the first solution to the mixed version race condition that preserves the essence of the rolling upgrade. We begin by introducing the problem in more detail. We then present our solution and close by considering next steps.

2. Mixed Version Race Condition A rolling upgrade proceeds in the following fashion: For each server, 1. take the server out of service 2. upgrade the application 3. return the server to service. A rolling upgrade has the virtue that at any point in time, only one server is out of service. It has the draw-

back that it takes time to upgrade a large cluster with multiple thousands of servers.

2.

Messages are as balanced across servers after introducing a solution as they were prior to introducing that solution. That is, we make no a priori assumptions about the effectiveness of the load balancing scheduling strategy but we require that any solution does not impact that effectiveness.

4. Our Solution In essence, our solution requires each version to identify itself in messages to clients and to adapt the load balancer to understand version information in routing. We begin by discussing the most critical portion of our solution - the load balancer.

4.1 The load balancers Figure1. A mixed version race condition [5]. During the execution of a rolling upgrade, the sequence shown in Figure 1 can occur. In this sequence, the rolling upgrade begins (1), a browser sends a message (2) that is handled by version N+1 (3) of the application. Version N+1 responds to that message with embedded JavaScript that will control the browser’s behavior. The next message sent from the browser (4) is handled by version N (5) since both versions are simultaneously running, albeit on different servers. This causes an error because the versions are incompatible.

3. Assumptions and Requirements We make an assumption without which our solution will not work - versions are upward compatible. That is, any request that can be satisfied by version N can also be satisfied by version N+1. We also assume that it is possible to distinguish between a failed installation and a slow installation such as caused by the instance performing an fsck. The difficulty of detecting such a condition was pointed out by Tudor Dumitras [5]. Although this is a valid problem, it is a problem with upgrades in general and not specific to the mixed version race condition. Thus, we are assuming that the distinction between failure and slowness can be made by some agent associated with the upgrade process. Within our assumptions, we have two requirements for any solution. 1. Clients never interact with decreasing versions during a single session. That is, once a client interacts with version N, it will never interact within the same session with version M where M < N.

Adapting the load balancer to prevent a mixed version race condition has been mentioned in the literature but the authors claim that it would add significant overhead. [5]. They state “The mixed-version race described above could have been avoided by extending the load balancer…to determine the appropriate server side version for each request. This approach would require adding significant complexity and processing delays to a key component of the enterprise infrastructure [the load balancer]…” Obviously, we disagree. There are two issues with load balancers: managing distributed load balancers and implementing the scheduling rules. We deal with these in turn. Distributed load balancers Large internet services will, typically, have more than a single load balancer. They will have a hierarchy of load balancers. Our solution is that the hierarchy be structured as shown in Figure 2. That is, messages are distributed based initially on version number. Then the subsequent levels schedule as they normally would with guaranteed version compatibility. New instances are placed on an endpoint specific for the version of the application running in the instance. /service

/service/vN

server

server

/service/vN+1

server

server

Figure 2: load balancer hierarchy

Figure 2 shows how our solution prevents the mixed version race condition. It does that by brute force at the /server level as long as all distributed /service load balancers have been made aware of a version N+1 instance. We next discuss how to solve the distributed /service problem.

2.

Managing multiple /service load balancers /service

/service/vN

server

server

/service

/service/vN+1

server

server

/service/vN

server

server

Figure 3: multiple /service load balancers Figure 3 shows a situation where there are two /service load balancers. The one on the left has been made aware of version N+1 instances and the one on the right has not. Now if a client message is serviced by a version N+1 server via a path on the left side and a subsequent message from that client arrives on the right side, an error will occur since the /service node on the right side is unaware of version N+1. Before dealing with this problem, we make an observation that is true of every installation process. Our observation is that registration of an instance must follow the sequence: installation of the instance, registration with its parent load balancer, and so forth until the root node is reached. Any parent load balancer that does not exist must be created. In our case, if the parent is a /service load balancer, the new scheduling rule pertaining to version N+1 is added. This sequence insures that there is always a path to an instance prior to messages being sent specifically to it. Now we present our solution to the distributed /service load balancer consistency problem. It has three portions. The first portion serves to reduce the time window during which the /service load balancers are inconsistent, the second portion specifies what should happen if a failure occurs, and the third provides a fallback if a message for version N+1does arrive at an un-updated /service load balancer.. 1. Install one version N+1 instance for each /service load balancer prior to registering any of the instances. Then they all register more or less simulta-

3.

neously, each with a different /service load balancer. This registration is not an atomic process although it will reduce the time window during which a message might cause a race condition. If a failure occurs during the installation of one of the instances then another instance is created as a child of the relevant /service load balancer. A failure can be detected by the agent we are assuming can distinguish between slow and failed. If a message arrives at the /service load balancer specifically targeted at an end point that does not exist, return a “retry” message to the client.

Our solution will require the client to retry messages until one of the efforts reaches an instance of version N+1. The reason for the first portion of the solution is to reduce the time window so that the number of retry messages is reduced. The solution is guaranteed as long as the client continues to retry messages. This solution represents several tradeoffs. 1. Delaying the client is chosen rather than causing an error because of the race condition. 2. Having idle top level load balancers prior to placing any of them in service in order to reduce the time window where retries are required is at the cost of the resources for the idle load balancers. Scheduling rules for load balancers L7 load balancers such as [9] allow custom distribution policies. In our case, this custom policy will choose the endpoint based on the version number used as an endpoint indication in the message to be scheduled. It will never schedule a message targeted for a specific version with an instance with a smaller version number. Unlabeled messages may be scheduled with any top level load balancer. When we discuss the client actions, we will describe how the endpoint is specified. Scheduling solely based on version number will avoid the mixed version race condition but will not guarantee load balancing. For example, the number of instances containing the new version will increase over the lifetime of the upgrade and the number of instances with the old version will decrease. Using a scheduling policy at the /server level which is strictly based on local information such as the dispatch of the last message, e.g. round robin, will not adapt to the changing number of instances of each version. Instead, we propose that the heartbeat protocol used between load balancers to indicate health also carry the number of instances they are currently managing. Each load balancer will then know the number of instances in

the sub-tree of which it is the root and can use a standard load balancing schedule based on that number, e.g. dispatch the message to the child with the smallest number of instances in its portion of the sub-tree.

4.2 The Application The application has three responsibilities for our solution to work: 1. The JavaScript that the application sends to the client must not have any hard coded version numbers. Version numbers will be appended by the client when constructing a message. Hard coding the version numbers will introduce problems when the JavaScript is cached since the client may receive a message from version N+1 but be using a cached version of the JavaScript that is unaware of the existence of version N+1. 2. The application must include its version number in each reply message that it sends to a client. The version number can either be a header item or a cookie. If it is a cookie, the expiration data must such that the cookie value does not persist through the end of a session. The application can know its version number through a wide variety of simple techniques. 3. The application registers itself with a version specific load balancer once it ensures that the load balancer is in existence.

4.3 The Client We are assuming that the actions of the client are controlled by the mobile code received from the application and that the client is a browser. Consequently, our requirements for the actions of the client are really requirements on the mobile code sent to the client by the application. There are two such requirements. 1.

2.

The client must retrieve the application version number from each message and maintain the highest version number that it has seen in local memory so that it does not persist across sessions. If the version number is included as a cookie value, then the expiration time must be such that the cookie value is not persisted across sessions. The client must append /version number to the URI of any message it sends to the application.

5. Meeting Requirements We identified two requirements that any solution that prevents mixed version race conditions must meet. It must dispatch messages to prevent a client interacting

with instances in reverse version number order and it must preserve the load balancing while a rolling upgrade is in progress. Our solution prevents a client interacting with instances in reverse version order by controlling the dispatch of messages at the load balancer. It balances messages through using a load balancing algorithm that utilizes knowledge of the load placed on various instances without changing the fundamental characteristics of the existing load balancing scheduling algorithm.

6. Related work The mixed version race condition was identified during the Second HotDep on Software Updates [4]. [5] described the condition very clearly and identified scheduling the load balancer as a key element in the solution. They did not present a solution to prevent the mixed version race condition but did present a model that could be used to manage the risk that it might occur. [8] presents a survey of different problems that occur during upgrade and proposes explicit versioning of code and data as a programming language feature. Including the version number in response messages from the application is a portion of our solution but without the overhead of modifying a programming language. [1] proposes a version of upgrading one partition at a time utilizing the cloud’s ability to easily add resources but is focused on the problem of rolling back an upgrade that is incorrect. The mixed version race condition occurs even though both versions are bug free. The only requirement for it to occur is that the two versions are incompatible.

7. Discussion and Next Steps The most difficult to implement aspect of our solution is the modification of the heartbeat protocol used by load balancers. Given that the mixed version race condition problem is as pervasive as [2] and [4] indicate, the modification of this protocol should have sufficient political clout behind it to be easily adopted. The mixed version race condition was identified for browser facing applications. Our solution should work for different application level protocols as long as both the server and the client are able to implement the solution, whether directly or through the use of mobile code. The limits of our proposed solution should be tested with a variety of different application level protocols, as well as with HTTP.

A clear next step is to develop an implementation. The implementation will allow us to test the correctness of our solution as well as the overhead introduced by the load balancer scheduling strategy Intuitively, this overhead is small but measurements will enable us to provide some evidence for this assertion. The particular scheduling strategy we propose is one of many possible. How well load balancing scheduling strategies perform is an active area of research. Having version knowledgeable load balancers will add another dimension to this research.

8. Acknowledgements We would like to thank Tudor Dumitras for pointing out several problems with the first version of our proposed solution. We would also like to thank Jeromy Carriere for confirming the continuing existence of the mixed version race condition. National ICT Australia is funded by the Australian Government’s Department of Communications, Information Technology, and the Arts and the Australian Research Council through Backing Australia’s Ability and the ICT Research Centre of Excellence programs.

9. References [1] Cadar, C and Hosek, P, Multi-version Software Updates, Proceedings of the Fourth International Workshop on Hot Topics in Software Upgrades, 2012 [2] Carriere, J. Personal communication, May, 2012. [3] Dumitras, T. and Narasimhan, P. Why Do Upgrades Fail and What Can we Do about It. Proceedings of the ACM/IFIP/USENIX 10th international conference on Middleware, 2009 [4] Dumitras, T., Neamtiu, I, Tilevich, E. Report on the Second ACM Workshop on Hot Topics in Software Upgrades (HotSWUp’09). Proceeding of the 24th ACM SIGPLAN conference companion on Object oriented programming systems languages and applications, 2009 [5]Dumitras, T., Narasimham, P. Tilevich, E. To Upgrade or Not to Upgrade, ACM Sigplan Notices 45(10), 2010 [6] Dumitras, T., Personal communication [7] McKeag, L. Layer 7 Load Balancing, part 1 http://howto.techworld.com/networking/488/layer-7load-balancing-part-1/ [8]Neamtiu, I and Dumitras, T. Cloud Software Upgrades: Challenges and Opportunities. Proceedings 2011 IEEE International Workshop on the Maintenance and Evolution of Service-Oriented and Cloud Based Systems.

[9] Redhat http://docs.redhat.com/docs/enUS/Red_Hat_Enterprise_Linux/6/html/Load_Balancer_ Administration/s1-lvs-scheduling-VSA.html

preventing mixed version race conditions.pdf

/service/vN. /service. server server. Page 3 of 5. preventing mixed version race conditions.pdf. preventing mixed version race conditions.pdf. Open. Extract.

Download PDF

328KB Sizes 1 Downloads 183 Views

Report

preventing mixed version race conditions.pdf

Recommend Documents