ACM PODC 2017 Tutorial

High-Level Executable Specification of Distributed Algorithms Y. Annie Liu Scott Stoller Bo Lin Computer Science Department Stony Brook University 1

Age of distributed programming search engines social networks cloud computing mobile computing ... distributed systems are increasingly important and challenging at the core are distributed algorithms 2

Outline introduction to distributed algorithms and languages; examples include Paxos for distributed consensus a method for programming distributed algorithms with (1) high-level control flows as in pseudocode, and (2) precise semantics as in formal specification languages a language, DistAlgo, that minimally extends object-oriented languages for programming distributed algorithms, and methods for efficient implementations successes with an implementation of DistAlgo in Python, on well-known complex algorithms and services as examples with demonstrations and hands-on practices From clarity to efficiency for distributed algorithms [OOPSLA 12, arxiv 16/ TOPLAS 17] High-level executable specifications of distributed algorithms [SSS 12, arxiv 17]

Example problems distributed mutual exclusion leader election distributed consensus atomic commitment replication clock synchronization distributed hash table (DHT) distributed file system (DFS) distributed database MapReduce... snapshot termination detection deadlock detection failure detection distributed garbage collection routing, broadcast... 4

Example: distributed mutual exclusion problem: multiple processes access a shared resource, and need to access it mutually exclusively, in what is called a critical section (CS), i.e., no two processes are in CS at the same time. • how do processes communicate? • can processes or communications fail? • correctness properties? • efficiency measures?

5

Example: distributed consensus problem: multiple processes agree on a value, or a sequence of values, essential for replicated services. an easier problem than mutual exclusion • same basic questions, and more can be asked • how can processes or communications fail? • could it be impossible to solve at all?

6

Models of computation communication models: shared memory message passing synchronous asynchronous failure models: process failure: crash failure, fail stop link failure: message loss, message delay Byzantine failure: arbitrary 7

Correctness and efficiency correctness criteria: safety liveness fairness complexity measures: message complexity round complexity

8

Kinds of algorithms master-slave, primary-backup majority quorum client-server peer-to-peer gossip distributed graph algorithms, e.g., BFS, MST wave: e.g., floodmax echo, traversal... deterministic randomized probabilistic... impossibility results... 9

Example algorithms for distributed mutex based on logical timestamps: ask permission from all processes • Lamport: 3(n − 1) messages; n: number of processes requires: no process failures and reliable FIFO channel • Ricart-Agrawala: 2(n−1) messages, does not require FIFO token-based: ask permission from one process • Ricart-Agrawala: n messages • Suzuki-Kasami: n + 1 messages • Raymond: O(log n) quorum-based: ask permission from a subset of processes • majority voting: O(n) √ √ √ • Maekawa: O( n), 3 n w/deadlock, 5 n w/o w/timestamps • Agarwal-ElAbbadi: O(log n), tolerates n-O(log n) failures 10

Example algorithms for distributed consensus many variants of problems and algorithms: synchronous vs. asynchronous crash failures vs. Byzantine failures deterministic vs. randomized vs. probabilistic Paxos for state machine replication: • basic Paxos: agree on one value, for crash failures (n=2f+1) • fast Paxos: save a round of message delay, by more msgs • vertical Paxos: allow reconfiguration, with an aux master • Byzantine Paxos: allow arbitrary failures (n=3f+1) • multi-Paxos: agree on a sequence of values 11

Example: Lamport’s for distributed mutex Lamport developed it to show logical timestamps he invented n processes access a shared resource, need mutex, go in CS a process that wants to enter critical section (CS) • send requests to all • wait for replies from all • enter CS • send releases to all each process maintains a queue of requests • order by logical timestamps • enter CS only if its request is the first on the queue • when receiving a request, enqueue • when receiving a release, dequeue reliable, fifo channel — safety, liveness, fairness, efficiency 12

Example: multi-Paxos for dist consensus [vR11, vRA15]

13 Figure 5: The time diagram shows a client, two replicas, a leader (with a scout and a commander), and three acceptors, with time progressing downward. Arrows represent messages. Dashed arrows are messages that end up being ignored. The leader first runs a scout in order to become active. Later, when a replica

Specifying/Programming algorithms significant advances in programming languages: ... ALGOL ... C++ ... Java ... Python ... Prolog ...

• statements: assignments, conditionals, loops • expressions: arithmetic, Boolean, other data (sets) • subroutines: functions, procedures (recursion) • logic rules: predicates, deduction, though less used • objects: keep data and do operations (organization)

that’s mostly sequential and centralized. 14

Concurrent programming threads: multiple threads accessing shared data

threads as concurrent objects • the concurrent programming model in Java • adopted by other languages, such as Python and C#

Java made concurrent programming easier.

15

Distributed programming low-level or complex libraries, or restricted programming models • sockets: C, Java, ... most widely used languages • MPI: Fortran, C++, ... for high performance computing • RPC: C, ... just about any language, Java RMI • processes: Erlang, and more theoretically studied languages ... study of distributed algorithms, not building real applications • pseudocode, English: high-level but imprecise, not executable • formal specification languages: precise but lower-level much less progress 16

Pseudocode languages in many textbooks and papers...

17

Formal specification languages TLA and PlusCal by Lamport, with TLA Toolbox IOA and TIOA by Lynch’s group at MIT process algebra: CSP by Hoare, CCS by Milner, pi-calculus... EventML by Constable’s Nuprl group at Cornell... Promela by Holzmann originally at AT&T Bell Labs Alloy by Jackson’s group at MIT Dafny by Leino at Microsoft Research... ProVerif at INRIA, Scyther at Oxford, Tamarin at ETH Zurich, AVISPA...

18

Programming languages Erlang by Joe Armstrong etc at Ericsson Argus by Barbara Liskov etc at MIT Emerald from Eden by Andrew Black etc at U Washington Lynx by Michael Scott at U Rochester SR and MPD by Greg Andrews etc at U Arizona Hermes and Concert/C at IBM Watson DML on top of CML at Cornell... Overlog by Joe Hellerstein etc at UC Berkeley Bloom and Dedalus also by Joe Hellerstein etc Meld by Seth Goldstein etc at CMU... Unity, Seuss, Mozart... libraries in C, C++, Java, Python...: socket, MPI... 19

Languages for distributed algorithms should combine the best of all of: • pseudocode languages: gives high-level flow • specification languages: has precise semantics • programming languages: is executable

should do for solving hard problems in general

20

DistAlgo a new language, simple, powerful: high-level, precise, executable • distributed processes as objects, sending messages • yield points for control flow, handling of received messages • await and synchronization conditions as queries of msg history • high-level constructs for system configuration powerful new optimization: generate efficient implementations transform expensive synchronization conditions into efficient handlers as messages are sent and received, by incrementalizing queries, especially logic quantifications via incremental aggregate ops on sophisticated data structures experiments with well-known algorithms: many, simplified too including Paxos and multi-Paxos for distributed consensus 21

Example: distributed mutual exclusion Lamport’s algorithm: developed to show logical timestamps n processes access a shared resource, need mutex, go in CS a process that wants to enter critical section (CS) • send requests to all • wait for replies from all • enter CS • send releases to all each process maintains a queue of requests • order by logical timestamps • enter CS only if its request is the first on the queue • when receiving a request, enqueue • when receiving a release, dequeue reliable, fifo channel — safety, liveness, fairness, efficiency 22

How to express it two extremes: • English: clear high-level flow; imprecise, informal • state machine based specs: precise; low-level control flow e.g., Nancy Lynch’s I/O automata (1 1/5 pages, most 2-col.) many in between, e.g.: • Michel Raynal’s pseudocode: still informal and imprecise • Leslie Lamport’s PlusCal on top of TLA+: still complex (90 lines excluding comments and empty lines, by Merz) • Robbert van Renesse’s pseudocode: precise, partly high-level lack concepts for building real systems — much more complex most of these are not executable at all. 23

Lamport’s original description in English [Lam78-CACM] The algorithm is then defined by the following five rules. For convenience, the actions defined by each rule are assumed to form a single event. 1. To request the resource, process Pi sends the message Tm : Pi requests resource to every other process, and puts that message on its request queue, where Tm is the timestamp of the message. 2. When process Pj receives the message Tm : Pi requests resource, it places it on its request queue and sends a (timestamped) acknowledgment message to Pi . 3. To release the resource, process Pi removes any Tm : Pi requests resource message from its request queue and sends a (timestamped) Pi releases resource message to every other process. 4. When process Pj receives a Pi releases resource message, it removes any Tm : Pi requests resource message from its request queue. 5. Process Pi is granted the resource when the following two conditions are satisfied: (i) There is a Tm : Pi requests resource message in its request queue which is ordered before any other request in its queue by the relation <. (To define the relation < for messages, we identify a message with the event of sending it.) (ii) Pi has received an acknowledgment message from every other process timestamped later than Tm . Note that conditions (i) and (ii) of rule 5 are tested locally by Pi . There will be an interesting exercise later. 24

Challenges each process must • act as both Pi and Pj in interactions with all other processes • have an order of handling all events by the 5 rules, trying to enter and exit CS while also responding to msgs from others • keep testing the complex condition in rule 5 as events happen actual implementations need many more details • create processes, let them establish channels with each other • incorporate appropriate clocks (e.g., Lamport, vector) if needed • guarantee the specified channel properties (e.g., reliable, FIFO) • integrate the algorithm with the overall application how to do all of these in an easy and modular fashion? • for both correctness verification and performance optimization

Original algorithm in DistAlgo 1 2 3

def setup(s): self.s = s self.q = {}

# set of all other processes # set of pending requests with logical clock

4 5 6 7 8 9 10 11 12 13 14

def cs(task): # for doing task() in critical section -- request self.t = logical_time() # rule 1 send (’request’, t, self) to s # q.add((’request’, t, self)) # await each (’request’,t2,p2) in q | (t2,p2) != (t,self) implies (t,self) < (t2,p2) and each p2 in s | some received(’ack’,t2,=p2) | t2 > t # rule 5 task() # critical section -- release q.del((’request’, t, self)) # rule 3 send (’release’, logical_time(), self) to s #

15 16 17

receive (’request’, t2, p2): q.add((’request’, t2, p2)) send (’ack’, logical_time(), self) to p2

# rule 2 # #

18 19

receive (’release’, _, p2): q.del((’request’, _, =p2))

# rule 4 #

26

Complete program in DistAlgo 0 class P extends process: ... # content of the previous slide 20 21 22

def run(): ... def task(): ... cs(task) ...

23 def main(): ... 24 configure channel = {reliable, fifo} 25 configure clock = Lamport 26 ps = 50 new P 27 for p in ps: p.setup(ps-{p}) 28 ps.start() ...

some syntax in Python: class P( process ) send( m, to= ps ) some( elem in s, has= bexp ) config( channel= {’reliable’,’fifo’} ) new( P, num= 50 ) 27

Optimized algorithm after incrementalization 0 class P extends process: 1 def setup(s): 2 self.s = s 3 self.total = size(s) 4 self.ds = new DS()

# self.q was removed # total number of other processes # aux DS for maint min of requests by other processes

5 6 7 8 9 19 11 12 13 14

def cs(task): -- request self.t = logical_time() self.responded = {} # set of responded processes self.count = 0 # count of responded processes send (’request’, t, self) to s # q.add(...) was removed await (ds.is_empty() or (t,self) < ds.min()) and count == total # use maintained task() -- release send (’release’, logical_time(), self) to s # q.del(...) was removed

15 16 17

receive (’request’, t2, p2): ds.add((t2,p2)) # add to the auxiliary data structure send (’ack’, logical_time(), self) to p2 # q.add(...) was removed

18 19 20 21 22 23

receive (’ack’, t2, p2): if t2 > t: if p2 in s: if p2 not in responded: responded.add(p2) count += 1

# # # # # #

24 25

receive (’release’, _, p2): ds.del((_,=p2))

# q.del(...) was removed # remove from the auxiliary data structure

new message handler test comparison in condition 2 test membership in condition 2 test whether responded already add to responded increment count

Simplified algorithm by un-incrementalization 0 class P extends process: 1 def setup(s): 2 self.s = s 3 4 5 6 7 8 9 10 11

def cs(task): -- request self.t = logical_time() send (’request’, t, self) to s await each received(’request’,t2,p2) | not (some received(’release’,t3,=p2) | t3 > t2) implies (t,self) < (t2,p2) and each p2 in s | some received(’ack’,t2,=p2) | t2 > t task() -- release send (’release’, logical_time(), self) to s

12 13

receive (’request’, _, p2): send (’ack’, logical_time(), self) to p2

29

Better simplified algorithm 0 class P extends process: 1 def setup(s): 2 self.s = s 3 4 5 6 7 8 9 10 11

def cs(task): -- request self.t = logical_time() send (’request’, t, self) to s await each received(’request’,t2,p2) | not received(’release’,t2,p2) implies (t,self) < (t2,p2) and each p2 in s | some received(’ack’,t2,=p2) | t2 > t task() -- release send (’release’, t, self) to s

12 13

receive (’request’, _, p2): send (’ack’, logical_time(), self) to p2

30

Even better simplified algorithm 0 class P extends process: 1 def setup(s): 2 self.s = s 3 4 5 6 7 8 9 10 11

def cs(task): -- request self.t = logical_time() send (’request’, t, self) to s await each received(’request’,t2,p2) | not received(’release’,t2,p2) implies (t,self) < (t2,p2) and each p2 in s | some received(’ack’,=t,=p2) task() -- release send (’release’, t, self) to s

12 13

receive (’request’, t2, p2): send (’ack’, t2, self) to p2

31

Optimized w/o queue after incrementalization 0 class P extends process: 1 def setup(s): 2 self.s = s 3 self.q = {} 4 self.total = size(s)

# self.q is kept as a set, no aux ds # total num of other processes

5 6 7 8 9 10 11 12 13 14 15 16 17 18

def cs(task): -- request self.t = logical_time() self.earlier = q # set of pending earlier reqs self.count1 = size(earlier) # num of pending earlier reqs self.responded = {} # set of responded processes self.count = 0 # num of responded processes send (’request’, t, self) to s q.add((’request’, t, self)) # q.add is kept, no aux ds.add await count1 == 0 and count == total # use maintained results task() -- release q.del((’request’, t, self)) # q.del is kept,no aux ds.add send (’release’, logical_time(), self) to s

19 20 21 22 23 24 25 26

receive (’request’, t2, p2): if t != undefined: if (t,self) > (t2,p2): if (’request’,t2,p2) not in earlier: earlier.add((’request’,t2,p2)) count1 +=1 q.add((’request’,t2,p2)) send (’ack’, logical_time(), self) to p2

# # # # # #

if t is defined test comparison in conjunct 1 if not in earlier add to earlier increment count1 q.add is kept, no aux ds.add 32

27 28 29 30 31 31

receive (’ack’, t2, p2): if t2 > t: if p2 in s: if p2 not in responded: responded.add(p2) count += 1

# # # # # #

new message handler test comparison in conjunct 2 test membership in conjunct 2 test whether responded already add to responded increment count

33 34 35 36 37 38 39

receive (’release’, _, p2): if t != undefined: if (t,self) > (t2,p2): if (’request’,t2,p2) in earlier: earlier.del((’request’,t2,p2)) count1 -=1 q.del((’request’,_,=p2))

# # # # # #

if t is defined test comparison in conjunct 1 if in earlier delete from earlier decrement count1 q.del is kept, no aux ds.del

DistAlgo language overview as extensions to common object-oriented languages including a syntax for extensions to Python

1. distributed processes and sending messages

2. control flows and receiving messages

3. high-level queries of message histories

4. configurations

33

1. Distributed processes, sending messages process definition class p extends process: process body class p (process): process body

setup, run, self

process creation, setup, and start v = n new p at node exp v = new(p , num = n , at = node exp ) pexp .setup(args ) setup(pexp , (args )) pexp .start() start(pexp ) sending messages (usually tuples) send mexp to pexp send(mexp , to = pexp ) 34

2. Control flows, receiving messages yield point with label -- l : -- l handling messages received receive mexp from pexp at l1 ,...,lj : handler body def receive(msg = mexp , from = pexp , at = (l1 ,...,lj )): handler body synchronization (nondeterminism) await bexp await(bexp ) await bexp1 : stmt1 or ... or bexpk : stmtk timeout t : stmt if await(bexp1 ): stmt1 elif ... elif bexpk : stmtk elif timeout(t ): stmt 35

3. High-level queries of message histories message sequences: received, sent received mexp from pexp mexp from pexp in received received(mexp , from = pexp ) (mexp , pexp ) in received 1) comprehensions {exp : v1 in sexp1 , ..., vk in sexpk , bexp} setof(exp , v1 in sexp1 , ..., vk in sexpk , bexp ) 2) aggregates agg op comprehension exp agg op (comprehension exp ) 3) quantifications some v1 in sexp1 , ..., vk in sexpk has bexp each v1 in sexp1 , ..., vk in sexpk has bexp some(v1 in sexp1 , ..., vk in sexpk , has = bexp ) each(v1 in sexp1 , ..., vk in sexpk , has = bexp ) tuple patterns, left side of membership clause 36

4. Configurations channel types configure channel = fifo config(channel = ’fifo’) default is not FIFO or reliable message handling configure handling = all config(handling = ’all’) this is the default logical clocks configure clock = Lamport config(clock = ’Lamport’) call logical time() to get the logical time overall: .da files process definitions, method main, and conventional parts; main: configurations and process creation, setup, and start 37

Optimization by incrementalization • introduce variables to store values of queries • transform the queries to use introduced variables • incrementally maintain stored values at each update new: systematic handling of 1. quantifications for synchronization as expensive queries 2. updates caused by sending, receiving, and handling of msgs in the same way as other updates in the program transform expensive synchronization conditions into efficient tests and incremental updates as msgs are sent and received sequences received and sent will be removed as appropriate only values needed for incremental computation of synchronization conditions will be stored and incrementally updated 38

High-level executable specifications of distributed algorithms exploit high-level abstractions of computation and control 1. high-level synchronization with explicit wait on received msgs 2. high-level conditions for when to send msgs and take actions 3. high-level queries for what to send in msgs to whom 4. collective send-actions for overall computation and control

experiment with important distributed algorithms, including Paxos discovered errors and improvements e.g., a liveness violation in [vRA15-ACMCS] for multi-Paxos exercise: find a correctness violation on Slide 24 for mutex 39

Implementations of Lamport’s algorithm Language C Java Python Erlang PlusCal DistAlgo

Dist. programming features used Total Clean TCP socket library 358 272 TCP socket library 281 216 multiprocessing package 165 122 built-in message passing 177 99 single process simulation with array 134 90 built-in high-level synchronization 48 32

program size in total number of lines of code, and number of lines excluding comments and empty lines

40

Program size for well-known algorithms Algorithm DistAlgo PlusCal La mutex 32 90 La mutex2 33 RA mutex 35 RA token 43 SK token 42 CR leader 30 HS leader 56 2P commit 44 68 DS crash 22 La Paxos 43 83 CL Paxos 63 166 vR Paxos 160

IOA 64

Overlog Bloom

41 85 145

230

157

number of lines excluding comments and empty lines, compared with specifications written by others in other languages 41

Compilation time and generated prog. sizes Compilation DistAlgo Compiled Incrementaltime (ms) size size ized size Algorithm La mutex 13.3 32 1395 1424 La mutex2 15.3 33 1402 1433 RA mutex 12.3 35 1395 1395 RA token 12.9 43 1402 1402 SK token 16.5 42 1405 1407 CR leader 10.7 30 1395 1395 HS leader 18.7 56 1415 1415 2P commit 21.4 44 1432 1437 DS crash 10.5 22 1399 1414 La Paxos 20.7 43 1428 1498 CL Paxos 32.3 63 1480 1530 vR Paxos 43.4 160 1555 1606 compilation time not including incrementalization time (all < 30s), and numbers of lines excluding comments and empty lines of generated programs (including 1300 lines of fixed library code)

0.28 0.26 0.24 0.22 0.2 0.18 0.16 0.14 0.12 0.1 0.08 0.06 0.04 0.02 0

4000

Original time Original memory Incrementalized time Incrementalized memory

3500 3000 2500 2000 1500

Memory (kB)

Time (sec)

Performance of generated implementation

1000 500 25

50

75 100 Number of processes

125

0 150

running time and memory usage for Lamport’s algorithm: CPU time for each process to complete a call to cs(task), including time spent handling messages from other processes, averaged over processes and over runs of 30 calls each; raw size of all data structures created, measured using Pympler

Example grad/undergrad projects in DistAlgo Algorithm Paxos Chubby ZooKeeper Chord Tapestry Pastry Kademlia Dynamo GFS HDFS Bigtable Cassandra Megastore MapReduce AODV NS key

Description fast; vertical; Byzantine Google’s distributed lock service Apache distributed coordination service distributed hash table (DHT) DHT DHT DHT, used in BitTorrent Amazon’s distributed key-value store Google file system Hadoop distributed file system Google’s distributed storage for structured data Apache distributed database Google’s transactional storage for interactive services simplified data processing on large clusters wireless mesh network routing Needham Shroeder public/shared key protocols

each is about 300-600 lines, took about half a semester.

DistAlgo—summary specifying/programming distributed algorithms: need both clarity (both high-level and precise) and efficiency DistAlgo: a new language, simple, powerful • distributed processes and sending messages • yield points and handling received messages • await and synchronization conditions as queries of msg history • high-level constructs for system configuration powerful new optimization transform expensive synchronization conditions into efficient handlers as messages are sent and received, by incrementalizing queries, especially logic quantifications via incremental aggregate ops on sophisticated data structures experiments with well-known distributed algorithms including Paxos and multi-Paxos for distributed consensus 45

DistAlgo resources http://distalgo.sourceforge.net

http://github.com/DistAlgo README can download — unzip — run script without installation to install: add to python path or run python setup.py install or not even download if you have pip: run pip install pyDistAlgo

http://sites.google.com/site/distalgo tutorial (to update) language description formal operational semantics

46

Ongoing and future work formal verification of higher-level algorithm specifications by translating to TLA and other languages of verifiers many additional, improved analyses and optimizations: type analysis, deadcode analysis, cost analysis, ... languages for more advanced computations: security protocols, probabilistic inference, ... generating implementations in lower-level languages C, Erlang, ... deriving optimized distributed algorithms reducing message complexity and round complexity formal verification of multi-Paxos [FM 2016], ... demand-driven incremental object queries [PPDP 2016], ... 47

Thanks !

48

Expensive queries using quantifications expensive computation of synchronization condition: each (’request’,t2,p2) in q | (t2,p2) != (t,self) implies (t,self) < (t2,p2) and each p2 in s | some received(’ack’,t2,p2) | t2 > t

all updates to variables used by expensive computations: 2 3

self.s = s self.q = {}

6 8 13 16 19

self.t = Lamport_time() q.add((’request’, t, self)) q.del((’request’, t, self)) q.add((’request’, t2, p2)) q.del((’request’, _, p2))

*

received.add((’ack’,t2,p2))

transform queries into efficient incremental computation at updates how? 49

Incrementalization of quantifications transform quantifications into aggregates: ({(t2,p2) : (’request’,t2,p2) in q | (t2,p2) != (t,self)} == {} or (t,self) < min({(t2,p2) : (’request’,t2,p2) in q | (t2,p2) != (t,self)})) and size({p2: p2 in s, (’ack’,t2,=p2) in received | t2 > t}) == size(s)

without queue: size({(’request’,t2,p2) in q | (t,self) > (t2,p2)})

==

0 and ...

use incrementally maintained query results: (ds.is_empty() or (t,self) < ds.min()) and count == total

without queue: count1 == 0 and ...

use max and min if no deletion — maintain single value, not set 50

Example: basic Paxos, Lamport’s description [Lam01-SIGACT News] Putting the actions of the proposer and acceptor together, we see that the algorithm operates in the following two phases. Phase 1. (a) A proposer selects a proposal number n and sends a prepare request with number n to a majority of acceptors. (b) If an acceptor receives a prepare request with number n greater than that of any prepare request to which it has already responded, then it responds to the request with a promise not to accept any more proposals numbered less than n and with the highest-numbered proposal (if any) that it has accepted. Phase 2. (a) If the proposer receives a response to its prepare requests (numbered n) from a majority of acceptors, then it sends an accept request to each of those acceptors for a proposal numbered n with a value v, where v is the value of the highest-numbered proposal among the responses, or is any value if the responses reported no proposals. (b) If an acceptor receives an accept request for a proposal numbered n, it accepts the proposal unless it has already responded to a prepare request having a number greater than n. ... To learn that a value has been chosen, a learner must find out that a proposal has been accepted by a majority of acceptors. The obvious algorithm is to have each acceptor, whenever it accepts a proposal, respond to all learners, sending them the proposal. ...

basic Paxos, in DistAlgo 1 process Proposer: 2 def setup(acceptors): 3 self.n := undefined 4 self.majority := acceptors 5 def run(): 6 n := self 7 send (’prepare’,n) to majority

# take in set of acceptors # proposal number # any majority of acceptors, so all is fine # Phase 1a: select proposal num n # send prepare n to majority

8 await count {a: received (’respond’,=n,_) from a} # Phase 2a: wait to receive 9 > (count acceptors)/2: # responses to n from majority 10 v := any ({v: received (’respond’,=n,(n2,v)), # find in responses, value in 11 n2 = max {n2: received (’respond’,=n,(n2,_))} } # max num’d proposal 12 or {any 1..100}) # or any value, here in 1..100 13 responded := {a: received (’respond’,=n,_) from a}# find responded 14 send (’accept’,n,v) to responded # send accept for proposal n,v 15 process Acceptor: 16 def setup(learners): pass # take in set of learners 17 def run(): await false # wait only to handle messages 18 receive (’prepare’,n) from p: # Phase 1b: receive prepare n 19 if each sent (’respond’,n2,_) has n > n2: # if n > each responded n2 20 max_prop := any {(n,v): sent (’accepted’,n,v), # find max numbered proposal 21 n = max {n: sent (’accepted’,n,_)} } 22 send (’respond’,n,max_prop) to p # respond with n,max_prop 23 receive (’accept’,n,v): # Phase 2b: receive accept n,v 24 if not some sent (’respond’,n2,_) has n2 > n: # if not responded with larger n2 25 send (’accepted’,n,v) to learners # send accepted n,v to learners 26 process Learner: 27 def setup(acceptors): pass # take in set of acceptors 28 def run(): 29 await some received (’accepted’,n,v) has # wait for some proposal that 30 count {a: received (’accepted’,=n,=v) from a)} # has been accepted by 31 > (count acceptors)/2: # majority of acceptors 32 output(’learned’,n,v) # output accepted proposal num n and value v

33 def main(): 34 acceptors := 3 new Acceptor # create 3 Acceptor processes 35 proposers := 3 new Proposer(acceptors) # create 3 Proposer processes, pass in acceptors 36 learners := 3 new Learner(acceptors) # create 3 Learner processes, pass in acceptors 37 acceptors.setup(learners) # to acceptors, pass in learners 38 (acceptors + proposers + learners).start() # start acceptors, proposers, learners

53

Example: commander in multi-Paxos [vR11, vRA15-ACMCS but replace p with c]

process Commander(λ, acceptors, replicas, hb, s, pi) var waitfor := acceptors;

process Scout(λ, acceptors, b) var waitfor := acceptors, pvalues := ∅

∀α ∈ acceptors : send(α, hp2a, self(), hb, s, pii); for ever switch receive() case hp2b, α, b0 i : if b0 = b then waitfor := waitfor − {α}; if |waitfor| < |acceptors|/2 then ∀ρ ∈ replicas : send(ρ, hdecision, s, pi); exit(); end if; else send(λ, hpreempted, b0 i); exit(); end if; end case end switch end for end process

∀α ∈ acceptors : send(α, hp1a, self(), for ever switch receive() case hp1b, α, b0 , ri : if b0 = b then pvalues := pvalues ∪ r; waitfor := waitfor − {α}; if |waitfor| < |acceptors|/2 th send(λ, hadopted, b, pvalues exit(); end if; else send(λ, hpreempted, b0 i); exit(); end if; end case end switch end for end process

(a)

54 (b)

Figure 3: (a) Pseudo code for a commander. Here λ is the identifier of its leader, acceptors the set of a identifiers, replicas the set of replicas, and hb, s, pi the pvalue the commander is responsible for. (b)

commander in multi-Paxos, in DistAlgo 1 2 3 4 5 6 7 8

class Commander extends process: def setup(leader, acceptors, replicas, b, s, p): pass def run(): send (’p2a’, b, s, p) to acceptors await count {a: received((’p2b’, =b) from a)} > (count acceptors)/2: send (’decision’, s, p) to replicas or received(’p2b’, b2) and b2 != b: send (’preempted’, b2) to leader

no local update — synchronization condition clear exercise: find a bug on Slide 52 if < is integer division 55

Example: multi-Paxos pseudocode in [vRA15-ACMCS] http://dl.acm.org/citation.cfm?id=2673577 or http://www.cs.cornell.edu/courses/cs7412/2011sp/paxos.pdf Figures 1,4,6,7 and two definitions in the middle of page 14 corresponding code in Java http://paxos.systems/code.html

made simple in DistAlgo: [arxiv 17] http://arxiv.org/abs/1704.00082 Figure 3

56

Example: two-phase commit a coordinator and a set of cohorts try to commit a transaction phase 1: • coordinator sends a prepare to all cohorts. • each cohort replies with a ready vote if it is prepared to commit, or else replies with an abort vote and aborts. phase 2: • if coordinator receives a ready vote from all cohorts, it sends a commit to all cohorts; each cohort commits and sends a done to coordinator; coordinator completes when receives a done from all cohorts. • if coordinator receives an abort vote from any cohort, it sends an abort to all cohorts who sent a ready vote; each cohort who sent a ready vote aborts. agreement, validity, weak termination, 4n-4 msgs 57

How to express it two extremes, and many in between 1. English: clear high-level flow; imprecise, informal 2. state machine based specs: precise; low-level control flow Nancy Lynch’s I/O automata: book p183-184, but 2n-2 msgs in between: • Michel Raynal’s pseudocode: still informal and imprecise • Leslie Lamport’s PlusCal: still complex (P2TwoPhase, 68 lines excluding comments and empty lines) • Robbert van Renesse’s pseudocode: precise, almost high-level lack concepts for building real systems — much more complex most of these are not executable at all. 58

Original description in English Phase 1:

Summary of the protocol [KBL06 DB and TP]

1. The coordinator sends a prepare message to all cohorts. 2. Each cohort waits until it receives a prepare message from the coordinator. If it is prepared to commit, it forces a prepared record to its log, enters a state in which it cannot be aborted by its local control, and sends “ready” in the vote message to the coordinator. If it cannot commit, it appends an abort record to its log. Or it might already have aborted. In either case, it sends “aborting” in the vote message to the coordinator, rolls back any changes the subtransaction has made to the database, release the subtransaction’s locks, and terminates its participation in the protocol. Phase 2: 1. The coordinator waits until it receives votes from all cohorts. If it receives at least one “aborting” vote, it decides to abort, sends an abort message to all cohorts that voted “ready”, deallocates the transaction record in volatile memory, and terminates its participation in the protocol. If all votes are “ready”, the coordinator decides to commit (and stores that fact in the transaction record), forces a commit record (which includes a copy of the transaction record) to its log, and sends a commit message to each cohort. 2. Each cohort that voted “ready” waits to receive a message from the coordinator. If a cohort receives an abort message, it rolls back any changes the subtransaction has made to the database, appends an abort record to its log, releases the subtransaction’s locks, and terminates it participation in the protocol. If the cohort received a commit message, it forces a commit record to its log, releases all locks, sends a done message to the coordinator, and terminates its participation in the protocol. 3. If the coordinator committed the transaction, it waits until it receives done message from all cohorts. Then it appends a completion record to its log, deletes the transaction record from volatile memory, and terminates it participation in the protocol.

Original algorithm in DistAlgo 1 2 3 4 5 6 7 8 9 10 11 12 13

class Coordinator extends process: def setup(tid, cohorts): pass # transaction id and cohorts def run(): send (’prepare’,tid) to cohorts await each c in cohorts | received(’vote’,_,=tid) from c if each c in cohorts | received(’vote’,’ready’,=tid) from c: send (’commit’,tid) to cohorts await each c in cohorts | received(’done’,=tid) from c print(complete’+tid) else: s = {c in cohorts | received(’vote’,’ready’,=tid) from c} send (’abort’,tid) to s print(’terminate’+tid)

14 15 16 17 18 19 20 21 22 23 24 25 26 27 28

class Cohort extends process: def setup(f): pass # failure rate def run(): await(False) receive (’prepare’,tid) from c: if prepared(tid): send (’vote’,’ready,tid) to c else: send (’vote’,’abort’,tid) to c abort(tid) receive (’commit’,tid) from c: commit(tid) send (’done’,tid) to c 29 receive (’abort’,tid): 30 abort(tid) 31

# await ’commit’ or ’abort’ after this?

def prepared(tid): return randint(0,100) > f def abort(tid): print(’abort’+tid) def commit(tid): print(’commit’+tid)

Complete program in DistAlgo 0

from random import randint

... # content of the previous slide 32 33 34 35 36

def main(): cs = 25 new Cohort(10) c = new Coordinator(0,cs) cs.start() c.start()

# # # #

create 25 cohorts create 1 coordinator start cohorts start coordinator

Optimized after incrementalization (part 1) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

class Coordinator extends process: def setup(tid, cohorts): ncohorts = size(cohorts) svoted = {} nvoted = 0 sready = {} nready = 0 sdone = {} ndone = 0

# # # # # # #

number set of number set of number set of number

of cohorts voted cohorts of voted cohorts ready cohorts of ready cohorts done cohorts of done cohorts

def run(): send (’prepare’,tid) to cohorts await nvoted == ncohorts # replaced if nready == ncohorts: # replaced send (’commit’,tid) to cohorts await ndone == ncohorts # replaced print(’complete’+tid) else: s = sready # replaced send (’abort’,tid) to s print(’terminate’+tid)

universal quantification universal quantification universal quantification

set query

62

Optimized after incrementalization (part 2) 21 22 23 24 25 26 27 28 29

# new message handler receive (’vote’,v,tid) from c: if c in cohorts: if c not in svoted: svoted.add(c) nvoted += 1 if v == ’ready’: if c not in sready: sready.add(c) nready += 1

30 31 32 33 34

# new message handler receive (’done’,tid) from c: if c in cohorts: if c not in sdone: sdone.add(c) ndone += 1

35 52

class Cohort extends process: ... # no change

53 57

def main(): ... # no change

63

Performance of generated implementation 0.07

Original(Commit) Original(Abort) Incrementalized(Commit) Incrementalized(Abort)

0.06

Time (s)

0.05 0.04 0.03 0.02 0.01 0 25

50

75

100

125

150

Total number of cohorts

for two-phase commit, for failure rates of 0 (Commit) and 100 (Abort), averaged over 50 rounds and 15 independent runs. 64

DistPL-tut-slides.pdf

Page 3 of 65. Outline. introduction to distributed algorithms and languages;. examples include Paxos for distributed consensus. a method for programming ...

318KB Sizes 2 Downloads 127 Views

Recommend Documents

No documents