Energy and Performance Benefits of Active Messages

Viewer
Transcript

Energy and Performance Benefits of Active Messages CVA MEMO 131 Version 1.0 R. Curtis Harting, Vishal Parikh, and William J. Dally Electrical Engineering, Stanford University, Stanford, CA 94305 E-mail:{charting, vparikh1, dally}@stanford.edu February 13, 2012 Abstract This paper analyzes the energy and performance benefits of active messages on cache-coherent many-core architectures. By allowing the user to manage data locality, active messages enable programs that run up to 3.5× faster, use as little as 1/3 the energy, and scale up to 6× better than programs that communicate only via shared memory. This paper focuses on how efficient active messages can be used to implement common parallel programming paradigms: reduction operations, contended object updates, and data walks. We propose a low-overhead hardware transfer mechanism and user level API. This paper also describes how active messages and coherent shared memory can be used together advantageously. The energy and performance advantages of active messages are demonstrated on a set of micro-benchmarks and applications.

1 Introduction Data motion consumes large amounts of energy in parallel computing systems. A parallel radix sort expends 45% of the total energy in the on-chip network, as shown in Figure 1(a). This network energy is consumed by the cache coherence protocol moving cache lines and sending coherence messages. For example, Figure 1(b) shows how at least six messages are required to lock and update a shared object. Active messages [13, 15] eliminate the overhead of communicating through coherent shared memory. Messages are delivered to a destination core and cause a message handler to run upon arrival. Shared object updates are done in-place using a message handler, removing the coherency overhead. This use of active messages, one of three we present, runs 14x faster and uses 65% less energy than doing updates via coherent shared memory. Combined with shared-memory, active messages provide a scalable mechanism that enables locality-aware programming. 1

Thread 2

Thread 1 Load Lock Rcv Lock Load Data

Core 31%

Load Lock

Rcv Data

Network 45%

Fail Lock

Execute

Inv Unlock (Invalidates)

Load Lock

Caches 10%

Rcv Lock Load Data

Time

DRAM 14%

Wait

Rcv Data Execute

(a)

(b)

Figure 1: The inefficiencies of cache coherency. The left graph shows that the on-chip network consumes 45% of the energy during a 1M element SPLASH-2 [39] radix sort. The diagram on the right shows the cache coherency messages caused by two threads updating a shared object.

This paper presents active-messages in a modern, many-core context and describes their use to enable programs with little data movement. We demonstrate both the energy and performance improvements of active messages. In the 1980s and 1990s, user-level and active messages were integrated into many shared-memory research systems [13–15, 17, 20, 25, 31, 34, 35, 38]. Much has changed since the mid 1990s and while some conclusions of this prior work are still valid, others are not. These papers focused on using active messages for fast communication between nodes and to accelerate basic synchronization mechanisms. In contrast, our work focuses on using active messages to avoid communication — by moving the computation to the data. Our paper brings hardware active messages onto a modern CMP, measures energy and performance gains, and provides insight into these gains.

1.1

Paper Contributions

This paper demonstrates the utility of active messages in many-core CMPs. We quantify, for the first time, the impact of active messages on energy-efficiency. Using an energy-first approach, we present novel usage scenarios and programming paradigms. Because most of the energy in our applications is in data movement, we use active messages to send computation to the data. This is in contrast to the fast data-movement scenarios of the literature. We use message handlers to execute complex remote atomic operations. Doing so further increases throughput and energy-efficiency beyond using using active messages to simply transfer data, acquire locks, and perform barriers. Executing critical sections and reduction operations remotely minimizes data movement. We enumerate three programming paradigms where active messages provide benefit over shared memory: generalized reductions, updating contended variables, and

2

Thread T on Core Y=Home(A)

Thread U on Core X AM Send AM

Dest*: A

Wait

H. Sel: S

Continue Execution

Execute Handler[S]

Body: B

Reply

Time

Optionally Wait

Wait

Figure 2: Basic active message execution. Thread U sends an active message to the home node of address A. At the home node, a handler thread executes the function specified by handler selector S and sends a reply.

data walks. Identifying these paradigms allows programmers and runtimes to better select between communication mechanisms. Our work is the first to study active messaging on a CMP. We find that the low overheads require rethinking “best practices” in communication, such as message coalescing [10]. In the CMP context, the energy to address and pack messages into the L1 cache is higher than to send a separate message containing each individual datum.

2 Active Messages In this paper, an active message is a communication sent from a user-level thread U on a CMP to a virtual address A, shown in Figure 2. The message, consisting of a handler selector S and a body B, is delivered to a processing core associated with A where a special handler thread T is dispatched to run a message handler H specified by S. H may access B, perform normal computations, and may itself send active messages, including a reply message to U . H cannot, however, wait on an event (such as the arrival of a message) and must terminate in a bounded amount of time. Only a single handler is active on each core at a given time, providing atomicity for active messages directed to the address space associated with that core. Active messages that arrive while a handler is running on the core are buffered and handled in the order of arrival. We study active messages in the context of the 256-core CMP shown in Figure 3. Each of the 256 nodes in the system contains a processor core, an L1 cache, and slices of the upper level caches. L2 caches are distributed over 16 nodes, and the global L3 is spread across all 256 nodes. All caches are coherent with one another and coherency messages often only travel to the lowest common cache level. For simplicity, we consider only a single parallel program running on the chip. However, our active message system can be extended to handle multiprogrammed workloads by including a process

3

Slice Of Shared L3 Cache & Directories Slice of Shared L2 Cache & Directories

Global L3 Neighborhood

16 Core L2 Neighborhood

Network Interface Core

L1 Cache Mesh Router

Figure 3: Diagram of node and system. Each node contains a simple core, private L1 cache, and slices of the upper level caches and directories. The caches are built up hierarchically such that 16 cores share an L2 cache, and all 256 cores share the last level cache.

identifier in each message. Further details about the implementation of active messages in multi-programmed processors can be found in [20,27]. For more information on our simulated system, refer to Section 5.3.

2.1

Example

Figure 4 demonstrates what happens when two threads T 0 and T 1 send messages AM 1 and AM 2 to update the same shared object. Each thread assembles its message in a dedicated active message register file (AMRF, Section 3.2) and transmits the message by executing an AM Send instruction (Section 3.1). After its message has been sent, a thread may continue to execute. If the thread depends on a reply from the message, it may block until reception of the reply message. At the receiver, arriving messages are buffered in a message queue mapped in the L1 cache. If the queue is empty and the handler thread T is idle when the first message arrives, T is activated and begins processing the first message by dispatching on its handler selector. Otherwise, if T is busy, it finishes handling its current message and then dispatches on the next message in sequence. If the message queue is empty when a handler terminates, T is deactivated until the next message arrives.

3 Implementation This section describes our hardware and software implementations of active messages. We offer a simple API to the programmer and use specialized hardware to reduce overheads.

4

T0 Assemble & Send Execute

T1

Home Node AM1 AM2

Execute AM1 Queue AM2

Wait AM1_Reply

Execute AM2

Execute Time

AM2_Reply

Figure 4: Two threads updating a shared object with active messages. Each thread sends a message to the home node of the object. The first message to arrive executes its handler to completion. The second message is queued at the destination and begins execution when the previous handler completes.

3.1

Software API

A user thread sends an active message by filling in the fields of a structure and then executing an AM Send intrinsic. The structure consists of a mandatory header containing the destination virtual address (8B), handler selector (1B), message size (1B), and a body containing user-defined data. For example, the code on the left side of Table 1 assembles and sends an active message to insert a pair into a hash table. Although the code appears to be writing the AM hash structure in memory, the compiler optimizes this sequence so that short messages are assembled in the active message register file(AMRF), avoiding the energy and latency of writing and then reading the L1 cache. The AM Send intrinsic compiles into a single instruction that launches the message from the AMRF into the network. The send is a non-blocking operation and the sending thread continues to execute after the active message is sent. The thread may compose new messages by reusing fields of the old message so that only the fields that change need be updated. Message handlers are procedures that accept two arguments: a pointer to the message, and the destination virtual address. For example, the handler code on the right side of Table 1 is executed when the message sent by the code on the left side of the table arrives. This code inserts the pair in the hash table and sends a reply to the specified address. The SendReply macro populates a reply message’s destination and return value, using AM Send to send the reply message. Because the handler must execute in a bounded amount of time, the insert function is not allowed to wait on an event — such as the availability of a lock. The serialization of message handlers eliminates the need to lock the hash bucket. After the sending thread launches the message, it continues executing until it needs the return value from the insert function. At this point, it executes the AM Wait for Reply function which waits for the reply message from the handler to update the reply struc5

Sender

Handler

struct AM_hash{ AM_Header head; long* replyAddr; long key; long value; };

void am_hash_handler(void * daddr, void * msg){ ///Statically cast inputs AM_hash *amh = msg; Hash_Bkt *bkt = daddr; int retVal = bkt->insert(amh->key, amh->value); SendReply(amh->replyAddr, retVal); }

AM_hash am ; AM_reply replyVal ; ... am.head.daddr = hash(hash_table,key) ; am.head.handler = am_hash_handler ; am.head.size = AM_HASH_SIZE ; am.replyAddr = &replyVal ; am.key = key ; am.value = value ; AM_Send(am) ; ... AM_wait_for_reply(replyVal) ;

Table 1: Code for a hash table insert. On the left, we show the user defined struct for inserting a key-value pair into a destination bucket and code for assembling and sending the message. The handler (right) casts the arguments, calls an insert function, and sends a reply message. The insert function does not have any locking because of handler atomicity.

ture. There is nothing unique about reply messages; they are just like any other active message. In the example, the reply is sent to an argument of the original message. By putting a different address in this argument, the message could just as easily have been sent to a third node, rather than back to the original sender. This section describes the API that is directly assembled into instructions. The programmer burden is one of designing the algorithm, declaring structure types, and writing handlers. Partitioned global address spaced languages (PGAS) such as X10 [9] or UPC [12] can be directly compiled into active messages. The compilers, however, would need to optimize for the much smaller message overheads presented in this paper. In the future, we envision portable languages that expresses inherent sharing patterns. A compiler would then select the optimal sharing mechanism.

3.2

Hardware

Short active messages are assembled into and sent from the active message register file (AMRF), a small register file with low access energy. Larger messages can be stored into the L1, adding overhead. Each AM Send intrinsic is compiled into either an AM Send.R or an AM Send.M instruction depending on the location of the message. When the AM Send instruction executes, a hardware state machine loads the fixed sized header and, using the size field, sequences the message into the network. The AM Send instruction is retired only when it is safe to overwrite the message. The network delivers each active message to the receive buffer of its destination core. Each receive buffer is allocated in global address space and cached in the L1 cache of its associated core. The receive buffers are sufficiently large so that the net-

6

work never blocks needing buffer space. This allows us to send reply messages on the same network virtual channel as request messages without causing protocol deadlock. It also allows message handlers to send an arbitrary amount of active messages, provided they do not wait for a reply. When a message is at the head of the receive queue, the dedicated handler thread on the receive core is activated to handle the message by vectoring it to the handler specified by the selector field of the message. A pointer to the message and the destination address of the message are passed to the handler in registers. When the handler terminates, the message is dequeued. When a thread waits for a reply, it is simply waiting for a memory location to change value. Local changes, by the co-located handler, are immediately observed, while remote changes are known via the cache coherence protocol. When an invalidation message is received for the line, the thread is activated and re-polls the value. This mechanism for synchronizing on a shared memory location is orthogonal to the active message mechanism.

4 Usage scenarios This section qualitatively describes three common programming paradigms that benefit from active messages. These paradigms consume an increasing fraction of program runtime and energy as benchmarks scale to more cores.

4.1

Object contention

Multiple threads in a parallel program often update a shared variable. This occurs, for example, when incrementing a global counter or inserting an object into a shared queue. To update a shared variable in memory, the programmer must lock the shared data structure, update it, and release the lock. A single update requires six or more coherence messages as illustrated in Figure 1(b). It is not unusual for several threads to attempt to lock the data structure simultaneously, creating a global serialization point. The multiple coherence messages increase the time the lock is held, lengthening the critical section. Also, local copies of the lock are invalidated each time the lock changes hands, further increasing network traffic. With active messages, Figure 4, each thread sends an active message to the home node of the shared object. The update handler runs atomically without locks. In total, each thread sends a single, 26B message across the network and receives a 18B response. Accesses to the shared object from its home node are likely to hit in the L1 cache because the object is not accessed directly from other nodes. This approach works both with globally and locally shared variables. It is more general than remote hardware operations such as compare-and-swap.

4.2

Reduction Operations

At the end of each timestep or iteration, many programs perform a global barrier and/or reduction operation. This operation can be simple, like a barrier, or complicated, like

7

(a)

(b)

Figure 5: Hierarchical reduction communication patterns. With active messages (a) all threads send a message in parallel to a barrier object. At a single stage of a reduction hierarchy without messages (b) the barrier’s cache line moves in a serial random fashion from core to core. The active message barrier object does not move from this home node’s L1 cache, reducing the total barrier execution time.

the computation of bucket starting positions in radix sort. We distinguish reductions from contention in that reduction operations require all threads to arrive at a final answer before continuing computation. The all-to-one requirement allows us to build trees both with and without active messages. When each thread reaches the reduction, it sends an active message up the tree to an intermediate reduction object. This object stores the state of the reduction between handler executions. When a message arrives at the reduction object, it updates the reduction variable and an incoming message count. When the message count indicates local completion, the handler sends a message up the tree to a higher-level reduction object. When the root of the tree reaches completion, it sends a release message to each of its children. Each child, in turn, forwards the release message to each of its children. Ultimately, every thread participating in the reduction receives a release message. Figure 5(a) shows the reduction communication pattern. When initializing the reduction, we place each intermediate object near its children. We compare this active message reduction to a hierarchical reduction using shared memory. At each level, the cache line that holds the state of the reduction moves from core to core via the cache coherence protocol as the line is updated. This serial random movement, Figure 5(b), adds latency and network energy. Not shown are a number of coherency messages spawned by each line movement. Hardware barriers as in [2, 7] lack the flexibility to perform arbitrary reduction operations. Complex reduction operations require locks and atomic sections, even with a hardware barrier.

4.3

Data Walks

A data walk is when a thread iterates over co-located data to compute a substantially smaller result. For example, summing an array and walking a linked list are data walks. In each, a large amount of data is iterated over once, a smaller result is computed, and

8

the initial data is not reused. When a thread walks data, it fetches each line into its L1 cache, generating network traffic and polluting the cache. For example, walking a linked list of 20 elements causes 20 cache lines to traverse the network and evict 1.28kB of data from the L1. If another thread traverses the same data, it also transfers the lines across the network and into its L1 cache. The use of prefetching and block transfers reduces the latency of such operations, but not the energy. Uncached loads do not pollute the L1 cache, but still consume network traversal energy. Sending an active message to the data performs the data walk without data motion. Since all data accesses are made by a message handler on the data’s home node, the data remains in one location — eliminating the latency and energy consuming cache line transfers. Moreover, because handlers are serialized, they can modify the data with no energy or latency penalty. A shared memory implementation must first lock the data structure and invalidate other cache lines before modifying data. Data walks where a single thread repeatedly visits the same read-only data are better performed via shared memory than with active messages. In this case, the cache is better able to exploit the re-use. Data walks where there is no reuse and no sharing with other threads also do not benefit from active messages. In this case, the data is fetched from DRAM either way, so the active message and reply represent additional overhead.

5 Experimental Methodology This section describes our simulation infrastructure, benchmarks, and hardware configuration used for our evaluation.

5.1

Micro-Benchmarks

We compare active message implementations to threads communicating via shared memory on micro-benchmarks representing the three paradigms discussed in Section 4. To test performance on contended objects, each thread randomly selects one of sixteen shared variables, locks it, and modifies it. We sweep the number of threads that share each value from 2 to 256. To evaluate reductions, every compute thread repeatedly executes a barrier, a reduction of 1 value, or a reduction of 16 values. For a data walk, we initialize a large (217 -entry) array and then randomly access it. Each access reads or writes an aligned block of indexes from the array. We sweep the size of the read or write block from 1 to 8 cache lines.

5.2

Applications

We use four benchmark applications to demonstrate how active messages improve the efficiency, performance, and scalability of whole programs compared to communicating via shared memory. Each of the benchmarks use one or more of the programming paradigms, as shown in table 2. Both the basline and active message versions of all applications were hand coded and use hierarchical reductions where possible.

9

Benchmark Breadth First Search Hash Map Kmeans Radix Sort

Size 262k nodes 512k operations, 80%/20% read/write mix, 2048 buckets 131k random points, calculating 16 4-D means 1M elements, 256 buckets

Contention X X

Reduction X

Data Walk X

X X

Table 2: Test applications, input sizes, and programming paradigms optimized.

Breadth first search (BFS) creates a sparse graph represented as an adjacency matrix, as done in [4]. Active messages are used to check if potentially unvisited vertices are actually unvisited and to push vertices onto the work queue. The results we present are the average of three different graphs, each with three different source nodes. The hash table code does an 80/20 mixture of reads and writes into the table. The table uses a linked list of pairs at each hash bucket. The pthread version we use is unsafe and does not use locks when traversing and modifying the linked lists to cast shared memory in the best possible light. The advantage of the active message implementation is due to the reduction in data movement, not the serialization caused by per bucket locks. Kmeans is implemented using a brute force approach, alternating between assigning each node to a group and averaging the groups. The averaging reductions are implemented with either active messages or barriers and locks. The final application that we evaluate is a radix sort. Each core iterates over a statically partitioned piece of an initial array of 220 values, counting how many values it must insert into each bucket. We then do a complex reduction to find the start position for each bucket and core in the destination array. Each compute thread stores the data into the destination array and repeats the process.

5.3

Hardware Configuration

Cache Level L1 L2 L3 Total

Cores per Cache 1 16 256 N/A

Caches per Chip 256 16 1 N/A

Size Per Core 16KB 32KB 64KB 112KB

Size Per Cache 16KB 512KB 16.8MB 28MB

Table 3: System data cache parameters. Every node has 112KB of coherent storage. This storage is split between the L1, L2, and L3 caches.

Our results are from simulations of a 256-core chip. Each core contains a private L1 data cache and slices of the L2 and L3 caches, summarized in Table 3. A group 10

Component Core [3, 18] Link [23] Router [6] L1 Cache [29] Upper Level Caches [29] DRAM [37]

Latency 1 cycle per non-memory instruction 1 cycle per hop 2 cycles per hop 1 cycle 3 cycles 50 cycles

Energy 15 pJ per 64 bit operation 0.08 pJ per byte per mesh hop 0.78 pJ per byte routed 10 pJ per 64B cache line 35 pJ per access 32 pJ per byte read/written

Table 4: Latency and energy of operations.

of 16 cores share an L2 cache and all 256 cores share a global L3 cache. The chip is kept fully coherent using an MSI directory based protocol. These directories are assumed to be ideal, but [24] demonstrates the ability to realistically scale coherency to 1000 cores. When data is shared between two cores, most coherency messages need only propagate to the lowest shared cache. All cores are interconnected using a mesh network. We assume a perfect instruction cache. Table 4 shows the latency of each operation in our system.

5.4

Simulator

Our evaluation uses a execution-driven simulator. A PIN [28] frontend parses and functionally executes native x86 binaries. This frontend then passes RISC-like instructions to a timing simulator. The timing simulation occurs while the program executes, applying back-pressure to the frontend during long-latency operations. The timing simulator itself uses a simple core model with single-cycle non-memory instruction latencies and a four-wide issue window (our result ratios do not appreciably change with larger windows). The cache hierarchy and coherency protocol are implemented in detail. Both the memory controllers and network use latency and bandwidth models. The timing simulator executes an idealized version of the pthread library. Barriers and locks simply write the appropriate cache lines and we assume a perfect wake-up mechanism.

5.5

Energy Estimation

Each architectural element of the system has a corresponding energy model that we use to determine the energy usage of the applications. We assume a 22nm process based on the ITRS roadmap [22] with a supply voltage of 0.8V. The core is assumed to be highly efficient and energy optimized, much like those explored in [3]. Our core model is split into two parts: a fixed overhead representing datapath and instruction cache energy and models for integer and floating point operations, derived in [18]. The caches and SRAMs are modeled using Cacti [29]. The routers are modeled based on place-and-route results for a detailed Verilog model developed in [6]. Wires are modeled using the capacitive model described in [23] and are full swing. A full list of the energy parameters used in this study are listed in Table 4. We assume fine-grained clock-gating.

11

Only results for the dynamic energy consumption of programs are presented. Static power, or leakage, is highly dependent on process and operation parameters. Moreover, the static power is roughly the same for both cases and hence does not substantially affect our results. Active messages reduce the amount of leakage energy expended during application execution by an amount proportional to the runtime reduction.

6 Results This section quantitatively describes the advantages of active messages. We first present the results from our microbenchmarks and applications. Next, we use the breadth-first search application as an example of synergistic messaging — where the combination of cache coherent shared memory and active messages provides better efficiency than either one alone. Finally, we show that active messages provide significantly better scalability than standard shared-memory code.

6.1

Micro-Benchmarks

6.1.1

Contended

Figure 6 shows the speedup and energy results of running our contended micro-benchmark on 256 cores. When accessing global variables, active messages provided a total speedup of 14.6× over our baseline implementation using threads and locks. With multiple outstanding requests, the active message code hides communication latency, bounding the runtime to just the computation time of the serialized updates. As the number of sharers per group decreases, the performance of the shared memory implementation improves, due to less lock contention. The performance of the active message implementation remains constant because it is bounded by the fixed amount of serial work, not the communication latency. Hence the speedup is reduced as the number of sharers decreases. Reducing the number of sharers lowers the network energy consumed by both implementations. The baseline case sees a large gain when going from 64 to 16 sharers. This is because with 16 sharers, all sharing nodes have a common L2 cache and global communication to the L3 is largely eliminated. The active message case sees a consistent decrease in energy as the distance each message travels declines with the number of sharers. 6.1.2

Reduction

Figure 7 shows the performance, energy, and network traffic benefits of implementing reductions using active messages. A barrier operation has a speedup of 4.9× and a reduction involving 16 variables has a speedup of 11.3× using active messages. The reduction in network traffic is the major source of energy savings. By effectively pinning the cache lines under contention to a single node, we remove the random walk of data around the chip. Active message barriers also scale better than hierarchical pthread barriers. As the number of cores increases, the latency of data movement from one node to another at 12

Speedup Relative to Baseline

16 14 12 10 8 6 4

2 0 256

64

16

4

2

Sharers Per Variable Set

(a)

Core

L1 Cache

L2/L3 Cache

DRAM

Network (Memory)

Network (AM)

Normalized Energy

1 0.8

0.6 0.4 0.2

0 BL

AM

256

BL

AM

BL

AM

BL

AM

64 16 4 Sharers Per Variable Set, Implementation

BL

AM

2

(b)

Figure 6: Speedup (a) and energy (b) of contended micro-benchmark as a function of the number of sharing threads. Figure (a) shows the speedup of active messages relative to the baseline with the same number sharers. Figure (b) is the energy consumption normalized to the baseline(BL) with 256 sharers.

13

Cycles(BL)/Cycles(AM)

Energy(BL)/Energy(AM)

Traffic(BL)/Traffic(AM)

Normalized to Baseline (Higher always better)

14 12 10

8 6 4 2 0 Barrier

1 Variable

16 Variables

Number of Reduction Values

Figure 7: The benefits of active messaging in reductions. Each bar represents a comparison of active messages against the baseline with the same number of reduction values. Higher is better, for example the network traffic of a 1 variable reduction is 10.7x less with active messages than the baseline.

the upper levels of the reduction tree also increases. For example, a baseline barrier implementation takes 2.4× longer on 1024 cores (7500 cycles) than on 256 (3100 cycles). Our version is only 1.65× more expensive when quadrupling the core count: 1045 cycles on 1024 cores and 632 cycles with 256. 6.1.3

Data Walks

Figure 8 shows the speedup, energy reduction, and network traffic reduction for the data walking microbenchmark. Active messages provide increasing savings in time and energy as the number of cache lines accessed increases, reaching a 12.6× savings when writing 8 cache lines. However, the baseline shared-memory implementation is both faster and more energy efficient than active messages when reading a single cache line.

6.2

Benchmark Results

Figure 9(a) shows the speedup of the active message versions of our four benchmark programs compared to the baseline shared-memory code on 256 cores. Speedups range from 1.16× on radix sort to 3.5× on breadth-first search. The energy breakdown of each benchmark variant is shown in Figure 9(b). The active message version of the hash-table benchmark achieves nearly 3× energy savings. The active message version of radix sort, however, consumes slightly more energy than the baseline version. Overall, the use of active messages gives a geometric mean of 2× speedup and uses 34% less energy. BFS benefits from using active messages to update a distributed work queue, a contended object. In the baseline implementation, merging the new values into the

14

Relative to Baseline (Higher always better)

Cycles(BL)/Cycles(AM)

Energy(BL)/Energy(AM)

Traffic(BL)/Traffic(AM)

12

8

4

0

1

2

4

8

Reads: Number of Cache Lines

1

2

4

8

Writes: Number of Cache Lines

Figure 8: Results of the data walk micro-benchmark. The benefits of active messages increase as the number of cache lines read or written increase. For example, the amount of network traffic when writing 4 cache lines is over 7x higher with the baseline than active messages.

work queue occurs only at the end of each time step. Despite the limited frequency of this operation, the high-overhead of remote locking limits overall performance. Also, each thread in the baseline implementation accesses significantly more data than the active message version. The thrashing of data in and out of the L1 cache contributes to the significant energy and latency difference. The active message version of the hash table benchmark reduces L1 cache misses by 2× via the in-place accessing of hash buckets. The performance increase was not from serialization or lock latency, because the baseline is unsafe and has no locks. Rather the performance gain is due to reduced data access latency. The assembly and sending of messages does cause an increase in core energy, but this is small compared to the decrease in network energy. Kmeans is the only benchmark where the number of instructions issued decreased with active messages. This happens because the reduction code in the baseline version makes significantly more pointer accesses than the active message version. This benchmark uses double-precision floats for its points and means. Because no hardware atomic memory operation is available to atomically update these variables, the shared memory code must acquire a lock before each update, adding delay and energy. Radix sort shows the smallest improvement with active messages: 16% performance, -1% energy. This is because most of the time and energy is spent counting data values and storing into the destination array. This code is the same in both versions. If we reduce the amount of work done at each core by either scaling the number of cores (Section 6.4) or sorting a smaller list, active messaging becomes the best solution.

15

4 3.5

3 Speedup

2.5

2 1.5 1

0.5 0 BFS

Hash Table

Kmeans

Radix Sort

Benchmark

(a)

Energy, Normalized to BL

Core

L1 Cache

L2/L3 Cache

DRAM

Network (Memory)

Network (AM)

1.4 1.2 1 0.8 0.6 0.4 0.2 0 BL

AM BFS

BL

AM

Hash Table

BL

AM

Kmeans Benchmark

BL

AM

Splash

Radix Sort

(b)

Figure 9: Speedup(a) and energy use(b) of our benchmarks on 256 cores, compared to the shared memory baseline.

6.3

Integrating Cache Coherency and AMs

In implementing our benchmarks, specifically breadth first search and radix sort, we found that an implementation that combined cache coherency and active messages yielded the best results. In radix sort, we use active messages to do the reduction and computation of destination array positions. We use standard stores, however, to put each value into the destination array. We do so because storing all 8 values to a cache line either requires 8 active messages or extra computation to pack a single message. We sort an array that is large enough where every core will have at least eight 16

elements per bucket. Each cache line will only be written by one core, amortizing the store coherency overhead across seven future L1 store hits. Our implementation of breadth-first search combines a shared-memory array, to record each node’s visited status, with the use of active messages for node update. The status array exhibits considerable temporal locality and thus benefits from being accessed via cache coherent shared memory. Using messages for node updates, on the other hand, eliminates the coherency messages needed to lock and update a shared variable. This hybrid version of the benchmark reduces the number of active messages by 40× compared to a version that used active messages to query the visited status of each node.

6.4

Scalability Results

Benchmark

Cores

Speedup

Formula: BFS

Hash Table

Kmeans

Radix Sort

16 64 256 1024 16 64 256 1024 16 64 256 1024 16 64 256 1024

2.6 3.8 3.7 2.6 0.66 0.94 1.7 4.2 1.4 1.7 2.0 4.7 1.0 1.0 1.2 1.8

Efficiency

BLn /AMn 1.1 1.2 1.2 1.1 1.2 1.6 3.1 5.6 1.3 1.5 1.6 1.9 1.0 1.0 1.0 1.1

Baseline Seedup BL16 /BLn 1 1.6 3.2 5.2 1 2.5 6.4 21 1 3.8 11 20 1 3.2 9.0 10

AM Speedup AM16 /AMn 1 3.2 4.6 5.3 1 3.5 17 130 1 4.5 16 65 1 3.3 11 18

Table 5: Results from running our benchmarks on an increasing number of cores. Columns 3 and 4 represent the gains in speed and efficiency when compared to the baseline implementation running on the same number of cores. Columns 5 and 6 look at the relative speedup of each implementation compared to running on 16 cores. The ideal speedups are 1, 4, 16, and 64.

Table 5 shows that the active message implementations of the benchmarks provide better speedup as the number of cores is increased. As the number of cores scale, the percentage of time spent in portions of the code targeted by our programming paradigms also increases. By improving latency and energy in these situations, we 17

increase overall scalability. With more cores, the energy and latency penalties of cache misses into the last level cache also increase. The reduced number of last level cache accesses with active messages contributes to the improved scalability. The hash table kernel has a super-linear speedup with active messages because of cache effects. At 1024 cores, the working set for each core (2 of the 2048 buckets) fits into the L1 cache with no capacity misses. For the baseline code, the percentage of energy consumed in the network, which active messages explicitly target, steadily increases with the number of cores.

7 Discussion This section discusses the reason why active messages work well, their overheads, and situations where cache coherency provides better latency and energy.

7.1

The benefits of active messages

As our results show, active messages provide an advantage both in speed and energy efficiency. We summarize the particular reasons below: • No wasteful data movement: Sending active messages to the virtual address of a shared object causes all updates to the object to occur at the same core. Typically, shared objects are kept in the home node’s L1 cache. Consistently hitting in the L1 cache limits the number of cache misses and coherency traffic. • Increased effective cache sizes and hit rate: Objects targeted by active messages are not replicated. Without data replication, the aggregate amount of unique data stored on-chip increases. Seen primarily in the hash table and data walking examples, this can significantly reduce cache miss rates, coherency traffic, and main memory accesses. • Handler atomicity: Queuing other incoming active messages during handler execution provides atomicity. This atomicity enables users to write lock-free code. Contention and the coherency messages caused by locks are removed from programs. • Better latency hiding: By enabling threads to have multiple outstanding messages, programmers can overlap object updates with other computation. The latency hiding ability of shared-memory cores is limited by the number of nondependent instructions they may issue. This is a function of the core’s issue window width and the static scheduling of code. Neither are under programmer control. Active messages allow the user to control the scheduling needed to hide the latency of long-latency operations. • Faster serial sections: Parallel applications often have serial sections of code. Active messages decrease the amount of time spent in these sections by removing lock contention and data movement. This enables the program to scale across more cores. 18

7.2

Active message overheads

An active message incurs overheads due to extra executed instructions, loading and storing message arguments, and the network traversal of the packet header. The largest source of overhead in our microbenchmarks was from the core executing additional instructions. Every load or store of message arguments is compiled into a handful of RISC instructions. The reading data walk, for example, needs to populate three values at the sender and read them at the receiver. Because so little other computation is done, the active messaging version issues double the instructions of the baseline. This is primarily an energy overhead, as most of these instructions are either L1 cache hits or single cycle arithmetic. Loading and storing message values results in additional data storage accesses. We use the AMRF to assemble messages because its access energy is over a magnitude less than that of a 10pJ L1-cache. Handlers must read message bodies from the L1 cache. Though not implemented in this paper, an inbound AMRF for queuing the first few small messages would provide further overhead reduction. Depending on the number of messages sent and work done between messages, the additional amount of storage accesses can be small, 2% in radix sort, or large, 92% in hash table. These overheads are all accesses to the AMRF or L1 cache. The header of an active message adds network energy and bandwidth overhead. This is why we reduce the handler selector from a full 8B instruction pointer to a 1B table index. The size field is optimized to list the number of words in a message, not bytes. The 10B overhead is significant because most message bodies tend to only have a few 8B values. Further optimization would include reducing the destination value to a 2B core index and a 1B destination address selector that operates like a handler selector. This limits the number of possible destination addresses per core to 256, but cuts message network overhead in half.

7.3

When to Use Coherent Shared Memory

Situations arise where active messages are slower and less efficient than communicating through the memory hierarchy.. If data is reused locally, caching can be used to prevent sending messages. By duplicating read shared values, coherency allows all threads to use data simultaneously with no overhead. The use of active messages would cause both extra network traffic and execution serialization. Data that is not shared, no matter the size, is better left to the cache hierarchy. Small working sets can be kept in the L1 cache. Large working sets must be loaded from main memory and active messages would simply add overhead.

8 Related Work In [15] active messages are used for fast communication where one processor sends an active message containing data to a remote node. The J-Machine [13,14,31] based their execution model on active messages triggering computation. In contrast, we present a conventional CMP execution model augmented with active messages.

19

Alewife [1, 25, 26] combined shared memory and message passing mechanisms. Like our work, the authors conclude that the integration of both mechanisms is better than shared memory alone. Stanford’s Flash [19, 20] and Wisconsin’s Typhoon [34], make similar conclusions. These papers conclude that fine-grained dynamic communication is best managed by the cache coherency protocol. In contrast, because of our CMP context, we find that even single write-shared cache lines are better managed with active messages. Berkeley’s GASNet [8] implements network-independent communication primitives using active messages. Their active messages are then compiled into systemspecific primitives, all of which are significantly higher overhead than what we present. All of these prior investigations of active messages used the messages primarily for communication. For example, using a handler to “get the message out of the network and into the computation ongoing on the processing node”, [15] presents a communication mechanism that is faster than send-receive. These projects used active messages to move data quickly with simple handlers. In contrast, we use active messages not to send data, but to send computation to the data. Rather than moving data quickly with simple handlers, we avoid communication with remote, complex handlers. None of these prior references explored the effect of active messages on energy efficiency or on-chip communication. In contrast, our work quantifies the energyefficiency advantages of active messages. Processor-in-memory (PIM) architectures [16, 32, 33] also perform computation near data. In these organizations, messages are sent to the memory controller or DRAM. In contrast, our organization sends messages to processor cores in a CMP. This PIM work is complementary to our own. We focus on reducing the energy and latency caused by multiple on-chip threads accessing the same data. In situations where a single thread must iterate a large amount of data that is stored in DRAM, moving computation to the memory may be preferable. In [16], the authors evaluate the performance of software-only messages and conclude that the overheads involved make them inferior to atomic memory operations. We present a low-overhead implementation making active messages an attractive alternative to atomic memory operations under certain types of sharing. The traveling thread execution model [11, 30], where threads move to data, is more coarse-grained than our approach. Sending an entire thread context consumes more network energy than active messages. It also prevents the overlap of communication and computation. Suleman’s 2009 paper [36] demonstrates that serialized critical sections should be executed at a single fast core. This is orthogonal to the use of active messages to move computation to data. Not discussed here, active messages can be used in asymmetrical systems to execute critical code on fast cores. Portable parallel languages that enable a compiler to generate code for different types of machines [5, 10, 21] reduce programming effort. Their code transformations are complementary to our work and can be compiled into active messages. The lowoverheads and usage scenarios presented here can be incorporated into the optimizers to maximize gains.

20

9 Conclusion This paper has described how to integrate active messages with cache-coherent sharedmemory and quantified the benefits. Active messages are sent to the virtual address of an object, trigging a remote computation. These handler functions run uninterrupted to completion. Queuing subsequent messages during handler execution enables correct lock-free code. Any code with object contention, reductions, or data walks benefits from active messages. We present a user level API that adds active messages into a shared-memory C++ programming environment. The API allows the programmer to assemble and send active messages, write and specify message handlers, and wait for reply messages. We sketch a low-overhead hardware implementation of active messages that includes an AMRF for composing short messages and a state machine for moving arbitrary sized messages into and out of the network. This paper has evaluated the utility of active messages for both microbenchmarks and full programs. Updating globally shared objects is done 14× faster and with 65% less energy than through shared memory. The execution time and energy of barriers is reduced to 25% of the baseline. Doing writing data walks in shared memory uses 12× more network energy than active messages. Active messages provide an average 2× speedup and 35% reduction in energy consumption compared to the baseline implementation of four benchmarks. The elimination of unnecessary data movement, removal of lock contention, decreased cache misses, and latency hiding provide these gains. We also show that the advantages of active messages increase with the number of cores. Hybrid implementations that use both active messages and coherent sharing provide better performance and efficiency than either approach alone. The integrated hardware allows programmers to use the optimal communication model for any scenario.

References [1] A. Agarwal, R. Bianchini, D. Chaiken, K. L. Johnson, D. Kranz, J. Kubiatowicz, B.-H. Lim, K. Mackenzie, and D. Yeung. The MIT Alewife Machine: Architecture and Performance. ISCA ’95, pages 2–13, New York, NY, USA, 1995. ACM. [2] G. Alm´asi, P. Heidelberger, C. J. Archer, X. Martorell, C. C. Erway, J. E. Moreira, B. SteinmacherBurow, and Y. Zheng. Optimization of MPI Collective Communication on BlueGene/L systems. ICS ’05, pages 253–262, New York, NY, USA, 2005. ACM. [3] O. Azizi, J. Collins, D. Patil, H. Wang, and M. Horowitz. Processor Performance Modeling using Symbolic Simulation. ISPASS 2008, pages 127 –138, april 2008. [4] D. Bader, J. Berry, S. Kahan, R. Murphy, E. J. Riedy, and J. Willcock. Graph 500 Specification. [5] R. Barik, J. Zhao, D. Grove, I. Peshansky, Z. Budimlic, and V. Sarkar. Communication Optimizations for Distributed-Memory X10 Programs. In Parallel Distributed Processing Symposium (IPDPS), 2011 IEEE International, pages 1101 –1113, may 2011. [6] D. U. Becker and W. J. Dally. Allocator Implementations for Network-on-Chip Routers. Supercomputing ’09, 2009. [7] C. Beckmann and C. Polychronopoulos. Fast Barrier Synchronization Hardware. In Supercomputing ’90. Proceedings of, pages 180 –189, nov 1990. [8] D. Bonachea. GASNet Specification, v1.1. Technical Report UCB/CSD-02-1208, University of California, Berkeley, 2002.

21

[9] P. Charles, C. Grothoff, V. Saraswat, C. Donawa, A. Kielstra, K. Ebcioglu, C. von Praun, and V. Sarkar. X10: An Object-oriented Approach to Non-uniform Cluster Computing. OOPSLA ’05, pages 519– 538, New York, NY, USA, 2005. ACM. [10] W.-Y. Chen, C. Iancu, and K. Yelick. Communication Optimizations for Fine-Grained UPC Applications. PACT ’05, pages 267–278, Washington, DC, USA, 2005. IEEE Computer Society. [11] M. H. Cho, K. S. Shim, M. Lis, O. Khan, and S. Devadas. Deadlock-free Fine-grained Thread Migration. NOCS ’11, pages 33–40, New York, NY, USA, 2011. ACM. [12] U. Consortium. UPC Language Specifications, v1.2. Technical Report LBNL-59208, Lawrence Berkeley National Lab, 2005. [13] W. J. Dally, L. Chao, A. Chien, S. Hassoun, W. Horwat, J. Kaplan, P. Song, B. Totty, and S. Wills. Architecture of a Message-Driven Processor. ISCA ’87, pages 189–196, New York, NY, USA, 1987. ACM. [14] W. J. Dally, A. Chien, S. Fiske, W. Horwat, R. Lethin, M. Noakes, P. Nuth, E. Spertus, D. Wallach, D. S. Wills, A. Chang, and J. Keen. Retrospective: the J-machine. ISCA ’98, pages 54–58, New York, NY, USA, 1998. ACM. [15] T. Eicken, D. Culler, S. Goldstein, and K. Schauser. Active Messages: A Mechanism for Integrated Communication and Computation. ISCA ’92, pages 256 –266, 1992. [16] Z. Fang, L. Zhang, J. B. Carter, A. Ibrahim, and M. A. Parker. Active Memory Operations. ICS ’07, pages 232–241, New York, NY, USA, 2007. ACM. [17] M. Fillo, S. Keckler, W. Dally, N. Carter, A. Chang, Y. Gurevich, and W. Lee. The M-Machine Multicomputer. pages 146 –156, nov-1 dec 1995. [18] S. Galal and M. Horowitz. Energy-Efficient Floating Point Unit Design. IEEE Transactions on Computers, 99(PrePrints), 2010. [19] J. Heinlein. Optimized Multiprocessor Communication and Synchronization Using a Programmable Protocol Engine. PhD thesis, Stanford University, 1998. [20] J. Heinlein, K. Gharachorloo, S. Dresser, and A. Gupta. Integration of Message Passing and Shared Memory in the Stanford FLASH Multiprocessor. ASPLOS-VI, pages 38–50, New York, NY, USA, 1994. ACM. [21] M. Hill, J. R. Larus, and D. A. Wood. Tempest: A Substrate for Portable Parallel Programs. In In COMPCON ’95, pages 327–332. IEEE Computer Society, 1995. [22] International Technology Roadmap for Semiconductors. [23] A. B. Kahng, B. Li, L.-S. Peh, and K. Samadi. ORION 2.0: A Fast and Accurate NoC Power and Area Model for Early-stage Design Space Exploration. DATE ’09, pages 423–428, 3001 Leuven, Belgium, Belgium, 2009. European Design and Automation Association. [24] J. H. Kelm, M. R. Johnson, S. S. Lumettta, and S. J. Patel. WAYPOINT: Scaling Coherence to Thousand-Core Architectures. PACT ’10, pages 99–110, New York, NY, USA, 2010. ACM. [25] D. Kranz, K. Johnson, A. Agarwal, J. Kubiatowicz, and B.-H. Lim. Integrating Message-passing and Shared-memory: Early Experience. PPOPP ’93, pages 54–63, New York, NY, USA, 1993. ACM. [26] J. Kubiatowicz and A. Agarwal. Anatomy of a Message in the Alewife Multiprocessor. ICS ’93, pages 195–206, New York, NY, USA, 1993. ACM. [27] J. D. Kubiatowicz. Integrated Shared-Memory and Message-Passing Communication in the Alewife Multiprocessor. PhD thesis, Massachusetts Institute of Technology, 1998. [28] C.-K. Luk, R. Cohn, R. Muth, H. Patil, A. Klauser, G. Lowney, S. Wallace, V. J. Reddi, and K. Hazelwood. Pin: Building Customized Program Analysis Tools with Dynamic Instrumentation. SIGPLAN Not., 40:190–200, June 2005. [29] N. Muralimanohar, R. Balasubramonian, and N. Jouppi. Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0. MICRO 40, pages 3–14, Washington, DC, USA, 2007. IEEE Computer Society. [30] R. Murphy. Traveling Threads a New Multithreaded Execution Model. PhD thesis, Notre Dame, 2006. [31] M. D. Noakes, D. A. Wallach, and W. J. Dally. The J-machine Multicomputer: an Architectural Evaluation. ISCA ’93, pages 224–235, New York, NY, USA, 1993. ACM. [32] M. Oskin, F. T. Chong, and T. Sherwood. Active Pages: A Computation Model for Intelligent Memory. ISCA ’98, pages 192–203, 1998. [33] D. Patterson, T. Anderson, N. Cardwell, R. Fromm, K. Keeton, C. Kozyrakis, R. Thomas, and K. Yelick. A case for Intelligent RAM. Micro, IEEE, 17(2):34 –44, mar/apr 1997.

22

[34] S. Reinhardt, J. Larus, and D. Wood. Tempest and Typhoon: user-level shared memory. ISCA ’94, pages 325 –336, apr 1994. [35] E. Spertus, S. C. Goldstein, K. E. Schauser, T. von Eicken, D. E. Culler, and W. J. Dally. Evaluation of Mechanisms for Fine-Grained Parallel Programs in the J-machine and the CM-5. ISCA ’93, pages 302–313, New York, NY, USA, 1993. ACM. [36] M. A. Suleman, O. Mutlu, M. K. Qureshi, and Y. N. Patt. Accelerating Critical Section Execution with Asymmetric Multi-core Architectures. ASPLOS ’09, pages 253–264, New York, NY, USA, 2009. ACM. [37] T. Vogelsang. Understanding the Energy Consumption of Dynamic Random Access Memories. Microarchitecture, IEEE/ACM International Symposium on, 0:363–374, 2010. [38] D. A. Wallach, W. C. Hsieh, K. L. Johnson, M. F. Kaashoek, and W. E. Weihl. Optimistic Active Messages: a Mechanism for Scheduling Communication with Computation. PPOPP ’95, pages 217– 226, New York, NY, USA, 1995. ACM. [39] S. C. Woo, M. Ohara, E. Torrie, J. P. Singh, and A. Gupta. The SPLASH-2 Programs: Characterization and Methodological Considerations. SIGARCH Comput. Archit. News, 23:24–36, May 1995.

23

Benefits of an active Google+ page Services

Energy Efficiency Performance Standards - Consumer Federation of ...

Improving Energy Performance in Canada

ENERGY EFFICIENCY PERFORMANCE STANDARDS - Consumer ...

Method of addressing messages and communications system

Cone of Learning A Model of Active Learning Benefits

Optimizing GPGPU Kernel Summation for Performance and Energy ...

Memory hierarchy reconfiguration for energy and performance in ...

Energy Performance of LEED for New Construction Buildings - USGBC

Techno-economic Performance Evaluation of Compressed Air Energy ...

Evaluation of the Performance/Energy Overhead in ...

Techno-economic Performance Evaluation of Compressed Air Energy ...

Energy Performance of LEED for New Construction Buildings - USGBC

Comparing Building Energy Performance Measurement A framework ...

Benefits and uses of pineapple.pdf