Andrew Newell (Purdue University), Gabriel Kliot (Google), Ishai Menache (Microsoft), Aditya Gopalan (Indian Institute of Science), Soramichi Akiyama (University of Tokyo), and Mark Silberstein (Technion)
¡ Dynamic, state changes at runtime ¡ Interactive, users demand fast responses ¡ Run at large scale
2
Interactive, users actively chatting Dynamic, state is always changing
Chatroom
Other examples
¡ ¡ ¡
Social networks Online gaming Internet of Things
Millions of users 3
Server Actors
User clients Bob Sue Jon
¡ ¡
Hi Hi Hi
Scaling to millions of users • CPU to handle requests • Memory to store state
State kept by actors Upon receiving message § Update state § Send message to actors § Create actors
4
Server Actors Many servers
¡ ¡
Examples: Orleans, Erlang, Akka Eliminate cost of development at scale § Add enough servers to handle load § Fault tolerance and correctness
¡
Latency suffers
5
¡ Inter-‐server: messaging overhead ¡ Intra-‐server: resource allocation
¡ Scaling dynamic interactive services
6
¡ Inter-‐server messaging problem & solution ¡ Intra-‐server resource allocation problem &
solution ¡ Evaluation on Orleans
7
Server Actors
At scale, many messages cross server boundaries
8
Profile of request latency
Worker 32%
Other 10%
Network 1%
Receiver 32% Sender 25% Typical workload on multiple Orleans servers
Over half of latency is due to inter-‐server message processing
Goal, reduce remote messaging with better actor placement 9
Random placement
Colocation placement First call
Load balancing
Static workload remote messaging
Remote messaging
Load balancing Dynamic workload remote messaging
Remote messaging always high on dynamic workloads 10
¡
Balanced graph partitioning
4 vertices
§ Vertices: actors § Edge weights: messaging § Partitions: servers
¡
Messaging graphs
2 cut edges
3 vertices
3 vertices
§ Reasonable partition exists § Dynamically changes
¡
Cost constraints
§ Scales with actors and servers § Minimize actor movements 11
1.
Decentrally find a good partner
Round Round 213
1. 1 swap at a time 2. Cooldown timer
2.
Perform swap protocol
1. Improve balance 2. Reduce
messaging
3.
Repeat
Swap request
12
1. 2. 3.
A identifies and sends candidate actor set to B B selects candidate set B picks swapping subsets
Server A
Server B
19 20 aactors ctors
19 18 actors
1. Improve balance 2. Reduce remote
4.
messaging
B responds with swap decision
13
¡ Inter-‐server messaging problem & solution ¡ Intra-‐server resource allocation problem &
solution ¡ Evaluation on Orleans
14
Orleans server
Orleans server Actors
Actors
Staged Event Driven Architecture (SEDA)
Receive
Worker
Individual thread pools
Staged Event Driven Architecture (SEDA)
Send
Receive Orleans default: thread per core per stage
Worker
Send
How to allocate threads and does it matter? 15
64 different thread allocation runs, average latency collected 8
Worker threads
7 6 5 4 3 2 1
32
50 30.7 31.7 30.8 38.2 38.2 37.2 32.3
Orleans default thread allocation
50 24.9 26.9 30.7 30.7 31.4 36.6 32.4 50 25.4 24.5 23.6 25.7 25.6 25.2 28.5
3x reduction by reducing to just enough 18.1 18.6 18.6 p 18.8 threads er stage
50 19.1 18.6 20.4 20.4 23.1 23.1 23.7 50 16.8 15.8 18.7 50 11.9 12.7
10
14 14.7 15.8 15.8 15.5
50 13.2
9.9 11.4 10.4 12.1 12.1 12.9
50
50
∞ 1
50 2
3
50 4
50 5
50 6
50 7
50
Too few threads has huge repercussions
8
Sender threads 16
Stages
¡
Arrival rate Arrival rate Service rate Arrival rate Service rate Arrival rate Processor Service ursage ate Processor Service ursage ate Processor usage Processor usage
Existing work § Allocate, check, repeat
¡
Our solution
§ Measure and directly find
global optimum among all stages
Measurement input
Threads per stage
17
¡ Inter-‐server messaging problem & solution ¡ Intra-‐server resource allocation problem &
solution ¡ Evaluation on Orleans
18
¡ Implemented in Orleans § Distributed balanced graph partitioning § Dynamic thread allocation
¡ Mimic production Halo Presence workload § Maintains stats of players in real-‐time § Clients query for stats of all players in some game § Both dynamic and interactive 19
Clients
In-‐game Queries for stats of a game
10 Orleans servers
Dynamically changing at runtime • Players start playing • Players move between games • Players stop playing 100k players 12.5k games 20
Remote Messaging %
Actor Movements Per Minute 8000
100 90 80 70 60 50 40 30 20 10 0
Random placement, 90%
7000 6000 5000 4000 3000
Converges at 12%
2000 1000
0
17 32 44 Time (minutes)
0 0
16 32 48 Time (minutes)
<10 minutes to go from 90% -‐> 12% 21
At 90% remote messaging
At 12% remote messaging
Send
Receive
Worker
1 thread
5 threads 2 threads
Send
Receive
Worker
1 thread
6 threads 1 threads
22
Median latency (ms)
50 40 30
-‐20%
20 10 0
99th percentile latency (ms)
-‐40%
Baseline
Actor Partitioning
w/ Thread Allocation
800 600
-‐70%
400
-‐20%
200 0
Baseline
Actor Partitioning
w/ Thread Allocation 23
¡ Demonstrated latency problems and
solutions for distributed actor systems § Actor placement § Thread management (in Orlean’s open source)
¡ Better resource management in distributed
actor systems
§ Easy to develop at scale § Low latency -‐> interactive services
¡ Techniques can apply beyond actor systems 24
¡ Orleans, open source
https://github.com/dotnet/orleans ¡ E-‐mails §
[email protected] §
[email protected] §
[email protected] §
[email protected] §
[email protected] §
[email protected]
25