[3B2-6]

mmi2009010008.3d

13/1/09

18:17

Page 2

..........................................................................................................................................................................................................................

COST-EFFICIENT DRAGONFLY TOPOLOGY FOR LARGE-SCALE SYSTEMS ..........................................................................................................................................................................................................................

IT IS MORE EFFICIENT TO USE INCREASING PIN BANDWIDTH BY CREATING HIGH-RADIX ROUTERS WITH A LARGE NUMBER OF NARROW PORTS INSTEAD OF LOW-RADIX ROUTERS WITH FEWER WIDE PORTS. BUILDING NETWORKS USING HIGH-RADIX ROUTERS LOWERS COST AND IMPROVES PERFORMANCE, BUT ALSO PRESENTS MANY CHALLENGES. THE DRAGONFLY TOPOLOGY MINIMIZES NETWORK COST BY REDUCING THE NUMBER OF GLOBAL CHANNELS REQUIRED.

...... John Kim Northwestern University William Dally Stanford University Steve Scott Cray Dennis Abts Google

The interconnection network that connects processors and memory modules in scalable multiprocessors and largescale systems significantly impacts the system’s overall performance and cost. As processor and memory performance continue to increase, large-scale interconnection networks are becoming even more critical because they largely determine the bandwidth and latency of remote memory access. A good interconnection network is designed around the capabilities and constraints of available technology. Previous interconnection networks have been built with low-radix routers—that is, routers with a small number of ports. As a result, these networks used low-radix topologies such as 2D or 3D mesh or torus networks. Examples of machines employing such networks include the Cray T3D, T3E, and XT3, SGI Origin2000, and Alpha 21364. Earlier work showed that, for the packaging and technology constraints

of the 1980s and 1990s—and the relatively low pin bandwidth available at that time—lowradix networks provided optimal latency for a given cost.1,2 However, this is no longer the case. Over the past 20 years, the pin bandwidth of router chips has increased by approximately an order of magnitude every five years—a rate similar to Moore’s law (see Figure 1). This increase in bandwidth is a result of both an increase in the signaling rate and an increase in the number of signals available to a router chip. However, the best way to exploit this increasing off-chip bandwidth isn’t just to make the ports wider—that is, building low-radix routers with fat channels. Rather, it’s more efficient to increase the number of ports—to build high-radix routers with thin channels.3 The use of high radix reduces hop count and leads to lower latency and lower cost. The zero-load latency consists of header latency, which is proportional to the

Published by the IEEE Computer Society

0272-1732/09/$25.00 c 2009 IEEE

..............................................................

2

!

13/1/09

18:17

Page 3

hop count of the network, plus serialization latency, which is inversely proportional to the bandwidth of each channel. As a router’s radix or degree increases, hop count and hence header latency decrease, because more nodes can be reached in a single hop. At the same time, serialization latency increases because the bandwidth per channel decreases. The optimum latency, which occurs when we balance these two components, is proportional to the aspect ratio (A) quantity, which can be defined as (Btr logN)/L, where B is the total router pin bandwidth, tr is the per-hop router latency, N is the network size, and L is the packet size. As pin bandwidth and hence aspect ratio increase, the optimal radix also increases, as Figure 2 shows. By 2010, the optimal radix will be approximately 128. With the reduced hop count, high-radix routers also reduce the network’s cost, which is largely determined by its total channel bandwidth. As hop count decreases, each packet consumes less channel bandwidth; hence, the same network performance can be achieved with lower total channel bandwidth. As a result, network cost decreases monotonically as radix increases, and in a similar manner, the network’s total power also decreases with the use of high-radix routers in the network. Migration to high-radix networks presents many benefits. (See the sidebar, ‘‘Cray BlackWidow System,’’ for an example of a recent system using high-radix routers.) However, efficiently exploiting the benefits of highradix routers requires rethinking conventional network topologies. We propose the dragonfly topology, which uses a group of routers as a virtual router to increase the effective radix of the network, and hence reduce network diameter, cost, and latency. Because it reduces the number of global cables in a network, while at the same time increasing their length, the dragonfly topology is particularly well suited for implementations using emerging active optical cables—which have a high fixed cost but a low cost per unit length compared to electrical cables.

Dragonfly topology Topology is a critical aspect of any interconnection network because it sets performance bounds for the network by establishing the

Bandwidth per router node (Gbps)

mmi2009010008.3d

10,000 1,000 100 10 1 0.1 1985

1990

1995 2000 Year

2005

2010

Torus Routing Chip Intel iPSC/2 J-Machine CM-5 Intel Paragon XP Cray T3D MIT Alewife IBM Vulcan Cray T3E SGI Origin 2000 Alpha Server GS320 IBM SP Switch2 Quadrics QsNet Cray X1 Velio 3003 IBM HPS SGI Altix 3000 Cray XT3 YARC

Figure 1. Router bandwidth scaling relationship.

1,000

Optimal radix (k)

[3B2-6]

2010

100 2003 1996

10 1991 1 10

100

1,000

10,000

Aspect ratio

Figure 2. Relationship between optimal radix as a function of the aspect ratio. The labeled points show the approximate aspect ratio for a given year’s technology and an estimate technology for 2010.

network diameter and bisection bandwidth. The topology also largely determines the system’s cost, in terms of both capital and energy consumption—thus, using a cost-efficient network topology is critical. Existing topologies, such as folded-Clos and fat-tree, pay too high a penalty on load-balanced traffic (for example, uniform random) to provide good performance in an adversarial traffic pattern. In essence, they consume costly bandwidth to load-balance traffic that is already balanced. A conventional butterfly network, on the other hand, incurs significantly lower

....................................................................

JANUARY/FEBRUARY 2009

3

[3B2-6]

mmi2009010008.3d

13/1/09

18:17

Page 4

............................................................................................................................................................................................... TOP PICKS

...............................................................................................................................................................................................

Cray BlackWidow System 1

The Cray BlackWidow system, introduced as part of the Cray XT5h system, is one of the first systems to exploit high-radix routers. The network in the BlackWidow system uses radix-64 routers and a variant of the high-radix folded-Clos topology to scale up to 32 K processing nodes with a maximum diameter of only seven hops. The network is a significant departure from previous Cray machines, which relied on low-radix networks such as 2D or 3D mesh or torus networks. The Cray XT MPP series, introduced in 2004, used the 7-ported SeaStar router with a total bandwidth of more than 460 Gbps.2 The more recent BlackWidow system is built from the Cray YARC router, which provides a total bandwidth of 2.4 Tbps, and is divided into 64 ports. The YARC router uses an 8 " 8 array of tiles (see Figure A).3 Each YARC tile consists of an input and an output port, an 8 " 8 crossbar subswitch providing connectivity between the eight tiles in the row and the eight tiles in the column, several sets of buffers, and associated routing logic. The YARC router is implemented in a 90-nm CMOS standardcell ASIC technology, with 192 6.25-Gbps serializer/deserializers (SerDes) around the periphery of the 17 " 17-mm silicon. The 192 SerDes are divided among the 64 ports to provide each channel in the network with a bandwidth of 18.75 Gbps. The Cray BlackWidow employs a variant of the folded-Clos topology. Instead of using only uplinks and downlinks to connect the nodes, BlackWidow employs sidelinks in the topology to connect the subtrees of the networks. The use of sidelinks reduces the network’s cost and latency by reducing intermediate routers. The recently proposed flattened butterfly topology improves on this topology by creating a topology in which all links are essentially sidelinks. The dragonfly topology described in this article improves on the flattened butterfly by reducing the number of global channels required in the topology.

Figure A. Die photo of radix-64 YARC router used in the Cray BlackWidow system.

References 1. D. Abts et al., ‘‘The Cray BlackWidow: A Highly Scalable Vector Multiprocessor,’’Proc. ACM/IEEE Conf. Supercomputing (SC 07), ACM Press, 2007; http://doi.acm.org/10.1145/ 1362622.1362646. 2. R. Brightwell et al., ‘‘SeaStar Interconnect: Balanced Bandwidth for Scalable Performance,’’IEEE Micro, vol. 26, no. 3, May/June 2006, pp. 41-57. 3. S. Scott et al., ‘‘The Black Widow High-Radix Clos Network,’’Proc. Int’l Symp. Computer Architecture (ISCA 06), IEEE CS Press, 2006, pp. 16-28.

cost on balanced traffic—approximately half that of a folded-Clos network. However, because a conventional butterfly has no path diversity, its performance is severely limited on adversarial traffic patterns. The recently proposed flattened butterfly approaches the cost of a conventional butterfly network on balanced traffic while matching the cost/performance of a foldedClos topology on adversarial traffic.4 We derive the flattened butterfly topology from a conventional butterfly network by combining or flattening each row of routers and maintaining the same inter-router connection. However, the flattened butterfly’s scalability is limited by the radix of a single router. In addition, the cost of a network is dominated by channels—especially the long global channels. The flattened

....................................................................

4

IEEE MICRO

butterfly requires each packet to traverse multiple global channels, which increases the network cost. The proposed dragonfly topology, on the other hand, effectively increases the radix of a network by combining a number of highradix routers into a group which acts as a very-high-radix virtual router. The dragonfly also reduces the global diameter (the maximum number of expensive global channels on the minimum path between any two nodes) to one.5 Achieving this unity global diameter requires pffivery ffiffiffi high-radix routers, with a radix of e 2 N (where N is the size of the network), assuming a fully connected pffiffiffiffi topology with a concentration of N . Although radix-64 routers have been introduced,6 building machines that scale to 8 K to 1 M

[3B2-6]

mmi2009010008.3d

...gch−1

gc0 gc1

13/1/09

18:17

Page 5

gch

gck´−1

Group Intragroup interconnection network

Local channels

R0

tc0

tc1

...

R1

Ra−1

tcp−1 tcp

Intergroup interconnection network

Global channels (gc)

tck´−1

G1

G0 Terminal channels (tc) P0

P1

...

Pk´−1

Pk´

Pk´+1

Gg

...

P2k´−1

PN−k´−1 PN−k´

...

PN−1

Figure 3. A high-level block diagram of a dragonfly topology and a diagram of a group or a virtual router.

nodes with unity global diameter requires higher radices. To achieve the benefits of a very high radix, the dragonfly topology uses a group of routers connected into a subnetwork to create one very high-radix virtual router, as shown in Figure 3. This very high effective radix in turn lets us build a network in which all minimal routes traverse at most one global channel. The high effective radix also means the dragonfly topology can provide high scalability—with radix-64 routers, the topology can scale to over 256 K nodes with a network diameter of only three hops. As Figure 3 shows, the dragonfly is a hierarchical network7 with three levels: router, group, and system. At the bottom level, each router has connections to p terminals, a ! 1 local channels (to other routers in the same group), and h global channels (to routers in other groups). Hence the radix (or degree) of each router is k ¼ p þ a þ h ! 1. A group consists of a routers connected via an intragroup interconnection network formed from local channels. Each group has ap connections to terminals and ah connections to global channels, and all of the routers in a group collectively act as a virtual router with radix k 0 ¼ a(p þ h). This very high radix, k 0 >> k, lets us realize the systemlevel network with very low global diameter. Up to g ¼ ah þ 1 groups—N ¼ ap(ah þ 1) terminals—can be connected with a global diameter of one. In contrast, a system-level

network built directly with radix-k routers would require a larger global diameter. In a maximum-size dragonfly—N ¼ ap(ah þ 1)— there is exactly one connection between each pair of groups. In smaller dragonflies, there are more global connections out of each group than there are other groups. These excess global connections are distributed over the groups, with # of groups " each pair connected by at least ah þ 1=g channels. The dragonfly parameters a, p, and h can have any values. However, to balance channel load on load-balanced traffic, the network should have a ¼ 2p ¼ 2h. Because each packet traverses two local channels along its route (one at each end terminal channel), this ratio maintains balance. Because global channels are expensive, deviations from this 2:1 ratio should be done in a manner that overprovisions local and terminal channels, so that the expensive global channels remain fully utilized. That is, the network should be balanced so that a ¼ 2h, and 2p ¼ 2h. Arbitrary networks can be used for the intragroup and intergroup networks in Figure 3. The high-radix topology, especially the dragonfly topology, increases the global channels’ physical length; however, exploiting emerging optical signaling technology can reduce the impact of long global channel lengths. Historically, researchers have proposed many networks using optical signaling,

....................................................................

JANUARY/FEBRUARY 2009

5

[3B2-6]

mmi2009010008.3d

13/1/09

18:17

Page 6

............................................................................................................................................................................................... TOP PICKS

but because of this technology’s high cost, it hasn’t been used in large-scale systems. However, the recent advent of economical optical signaling8 enables topologies with long channels, but they’re still more expensive than electrical channels. The proposed dragonfly results in a hierarchical topology that exploits economical optical signaling for the global channels but uses the cheap electrical channels for short local communication.7 Previously proposed hierarchical topologies have often been built as tree structures that introduce a bandwidth bottleneck and increase hop count as the packets travel up the hierarchy.

Indirect adaptive routing Minimal routing in a dragonfly, from source node s attached to router Rs in group Gs to destination node d attached to router Rd in group Gd , traverses a single global channel and is accomplished in three steps: 1. If Gs ¼ 6 Gd and Rs does not have a connection to Gd , route within Gs from Rs to Ra, a router that has a global channel to Gd. 2. If Gs ¼ 6 Gd , traverse the global channel from Ra to reach router Rb in Gd. 3. If Rb ¼ 6 Rd , route within Gd from Rb to Rd. This minimal routing works well for load-balanced traffic, but results in poor performance on adversarial traffic patterns. To load-balance adversarial traffic patterns, we can apply Valiant’s algorithm9 at the system level—routing each packet first to a randomly selected intermediate group Gi and then to its final destination d. Applying Valiant’s algorithm to groups suffices to balance load on both the global and local channels. This randomized nonminimal routing traverses at most two global channels and requires five steps: 1. If Gs 6¼ Gi and Rs doesn’t have a connection to Gi , route within Gs from Rs to Ra , a router that has a global channel to Gi. 2. If Gs 6¼ Gi , traverse the global channel from Ra to reach router Rx in Gi. 3. If Gi ¼ 6 Gd and Rx doesn’t have a connection to Gd , route within Gi from

....................................................................

6

IEEE MICRO

Rx to Ry , a router that has a global channel to Gd. 4. If Gi ¼ 6 Gd , traverse the global channel from Ry to router Rb in Gd. 5. If Rb ¼ 6 Rd , route within Gd from Rb to Rd. The dragonfly topology’s benefits can’t be fully exploited without adaptive routing—that is, adapting between minimal and nonminimal routing on the basis of the network’s state. Although the topology provides high path diversity, it needs nonminimal global adaptive routing to properly exploit the diverse paths. Achieving good performance on a wide range of traffic patterns on a dragonfly topology requires a routing algorithm that can effectively balance load across the global channels. Global adaptive routing (UGAL)10 can perform such load balancing if the load of the global channels is available at the source router, where the routing decision is made. With the dragonfly topology, however, the source router is most often not connected to the global channel in question. Hence, the adaptive routing decision must be made on the basis of remote or indirect information—relying on backpressure through the queue to sense downstream congestion. With conventional UGAL, the indirectness of this decision (using local queue occupancy to make routing decisions) degrades both latency and throughput. Thus, the dragonfly topology requires indirect adaptive routing to load-balance the global channels. We use UGAL-L (UGAL local) as the baseline routing algorithm where the routing decision is based on local queue information at the current router node. With two modifications, the UGAL-L routing algorithm can overcome its limitation with regard to the dragonfly topology, and indeed yield performance results approaching an ideal implementation using global information. Adding selective virtual-channel discrimination to UGAL (UGAL-LVC-H) eliminates bandwidth degradation due to local-channel sharing between minimal and nonminimal paths. Using credit round-trip latency both to sense global-channel congestion and to propagate this congestion information upstream (UGAL-LCR) eliminates latency degradation by providing much stiffer

13/1/09

18:17

Page 7

backpressure than when the algorithm uses only queue occupancy for congestion sensing. We compared these two routing algorithms to two other UGAL implementations: the baseline UGAL-L and UGAL-G, which uses queue information for all the global channels within the source group. Although UGAL-G is difficult to implement, it represents an ideal implementation of UGAL because it requires load balancing of the global channels, not the local channels. We also compared these routing algorithms to minimal routing (Min) as well as Valiant’s routing (Val), using synthetic traffic patterns that included uniform random traffic and the worst-case traffic pattern, where all nodes in group Gi send their traffic to Gi+1. Load-balancing the worst-case traffic pattern requires nonminimal routing, which spreads the bulk of the traffic across the other global channels. To evaluate the performance of the different routing algorithms, we used cycle-accurate simulations. We simulated a single-cycle, input-queued router switch but provided sufficient speedup to generalize the results and ensure that routers don’t become the network’s bottleneck. We injected packets using a Bernoulli process. We warmed up the simulator under load without taking measurements until it reached steady state. Then, we labeled a sample of injected packets during a measurement interval and ran the simulation until all labeled packets exited the system. Figure 4 shows simulation results for a dragonfly of size 1 K nodes, using the parameters p ¼ h ¼ 4 and a ¼ 8. Simulations of other network sizes follow the same trend. We use single-flit (flow control unit) packets to separate the routing algorithm from flow-control issues such as the use of wormhole or virtual cut-through flow control. The input buffers are assumed to be 16 flits deep. A 1D flattened butterfly topology is assumed for both intragroup and intergroup topology. With Min routing, the network achieves optimal performance (low latency and high throughput) for benign traffic such as uniform random traffic. However, because Min doesn’t exploit the topology’s path diversity, the throughput on worst-case

25 Val UGAL-L UGAL-LVC-H UGAL-G UGAL-LCR Min

20 Latency (cycles)

mmi2009010008.3d

15 10 5 0

0

0.2

0.4

0.6

0.8

1

Offered load (a) 25 20 Latency (cycles)

[3B2-6]

15 10 5 0

0

0.1

0.2

0.3

0.4

0.5

0.6

Offered load (b)

Figure 4. Routing algorithm performance comparison for uniform random traffic (a) and worst-case traffic pattern (b).

traffic is severely limited, at 1/ah. For uniform random traffic, Val achieves approximately half the network capacity, because its load-balancing doubles the load on the global channels; it also achieves similar throughput on adversarial traffic patterns. Thus, global adaptive routing aims to achieve the performance of Min on uniform random traffic while matching the performance of Val on adversarial or worst-case traffic, as illustrated with UGAL-G. By simply relying on local information, UGAL-L matches Min on throughput for uniform random traffic but leads to both limited throughput and high average packet latency at intermediate load. UGAL-LVC-H leads to higher throughput in worst-case traffic pattern by differentiating between minimal and nonminimal traffic to be routed

....................................................................

JANUARY/FEBRUARY 2009

7

[3B2-6]

mmi2009010008.3d

13/1/09

18:17

Page 8

............................................................................................................................................................................................... TOP PICKS

3D torus

200

Cost per node (US dollars)

Folded-Clos 160 Flattened butterfly

120

80

Dragonfly

40

0

0

5,000

10,000 15,000 Network size (N )

20,000

Figure 5. Cost comparison of alternative topologies.

through the same local output port. Although UGAL-LVC-H achieves throughput comparable to UGAL-G, it still leads to high intermediate latency because it needs to route many packets minimally before it can sense the congestion downstream and, in response, route packets nonminimally for load-balancing. In other words, soft backpressure between the congestion and the source router creates high intermediate latency. UGAL-LCR, which uses credit round-trip latency, overcomes this high intermediate latency by providing the appearance of a shallow buffer to stiffen backpressure and propagate the global congestion information.

Cost comparison Figure 5 compares costs of the dragonfly topology to alternative topologies using a detailed cost model.4 By reducing global channels, a dragonfly reduces cost by 20 percent compared to a flattened butterfly, and by 52 percent compared to a folded-Clos network in configurations with more than 16 K nodes. Compared to a 3D torus topology, which requires relatively short electrical cables, the dragonfly still provides a cost savings of up to 60 percent, because it significantly reduces the number of cables (or channels) required. The reduction of network cost in the dragonfly also translates

....................................................................

8

IEEE MICRO

to a reduction of power, as prior work has shown.4 (To provide accurate cost comparisons, we implicitly normalize the throughput—or the performance— of the alternative topologies.)

O

ver time, interconnection networks will become more critical to system performance, and their size will continue to increase. Thus, high-radix routers and networks will become even more significant. Our work on the dragonfly topology is relevant to the networks used in all types of large-scale systems—server clusters, Internet routers, storage area networks, and supercomputers. Our work will also be particularly relevant for data centers. Most computer architecture research, in both academia and industry, has focused on processor architecture and, more recently, multicore architectures. However, with the increasing importance of large-scale Internet services and the large-scale systems required for their support, computer architects must also focus on ‘‘warehouse sized computing systems, made up of thousands of computing nodes, [and] their associated storage hierarchy and interconnection infrastructure.’’11,12 Studies show that a data center’s capital cost is matched by the energy (cooling) cost within the first three years of purchase. Thus, having a cost- and energy-efficient topology and interconnection network such as the dragonfly will be critical in future data centers. In addition, data center networks will require a highly scalable topology to accommodate their increasing number of terminals (or nodes). Through its use of virtual routers, the dragonfly topology provides this scalability. Ultimately, we expect to see the dragonfly topology and variations of it employed widely in future large-scale systems. MICRO

.................................................................... References 1. W.J. Dally, ‘‘Performance Analysis of k-ary n -cube Interconnection Networks,’’ IEEE Trans. Computers, vol. 39, no. 6, June 1990, pp. 775-785. 2. A. Agarwal, ‘‘Limits on Interconnection Network Performance,’’ IEEE Trans. Parallel

[3B2-6]

mmi2009010008.3d

13/1/09

18:17

Page 9

Distributed Systems, vol. 2, no. 4, Oct. 1991, pp. 398-412. 3. J. Kim et al., ‘‘Microarchitecture of a HighRadix Router,’’ Proc. Int’l Symp. Computer Architecture (ISCA 05), IEEE CS Press,

interests include computer architecture and interconnection networks. Kim has a PhD in electrical engineering from Stanford University. He’s a member of the IEEE and the ACM.

2005, pp. 420-431. 4. J. Kim, W.J. Dally, and D. Abts, ‘‘Flattened Butterfly: A Cost-Efficient Topology for High-Radix Networks,’’ Proc. Int’l Symp. Computer Architecture (ISCA 07), ACM Press, 2007, pp. 126-137. 5. J. Kim et al., ‘‘Technology-Driven, HighlyScalable Dragonfly Topology,’’ Proc. Int’l Symp. Computer Architecture (ISCA 08), IEEE CS Press, 2008, pp. 77-88. 6. S. Scott et al., ‘‘The Black Widow HighRadix Clos Network,’’ Proc. Int’l Symp. Computer Architecture (ISCA 06), IEEE CS

William Dally is the Willard R. and Inez Kerr Bell Professor of Engineering and the chair of the Department of Computer Science at Stanford University. He’s also cofounder, chair, and chief scientist of Stream Processors. His research interests include computer architecture, network architecture, and programming systems. Dally has a PhD in computer science from the California Institute of Technology. He’s a Fellow of the IEEE, the ACM, and the American Academy of Arts and Sciences.

Press, 2006, pp. 16-28. 7. A.K. Gupta et al., ‘‘Scalable Opto-Electronic Network (SOENET),’’ Proc. 10th Symp. HighPerformance Interconnects (HOTI 02), IEEE CS Press, 2002, pp. 71-76. 8. Luxtera, ‘‘Fiber Will Displace Copper Sooner Than You Think,’’ white paper, 2005; http:// www.luxtera.com/white-papers.html.

Steve Scott is the chief technology officer at Cray. His research interests include processor and system architecture and interconnection networks. Scott has a PhD in computer architecture from the University of Wisconsin. He’s a member of the IEEE and ACM.

9. L.G. Valiant, ‘‘A Scheme for Fast Parallel Communication,’’ SIAM J. Computing, vol. 11, no. 2, 1982, pp. 350-361. 10. A. Singh, ‘‘Load-Balanced Routing in Interconnection Networks,’’ PhD thesis, Dept. of Electrical Engineering, Stanford Univ., 2005; http://cva.stanford.edu/publications/ 2005/thesis_arjuns.pdf. 11. L.A. Barroso, J. Dean, and U. Ho¨lzle, ‘‘Web Search for a Planet: The Google Cluster Architecture,’’ IEEE Micro, vol. 23, no. 2,

Dennis Abts is a member of the technical staff at Google, where he’s a technical lead for a next-generation large-scale network. His research interests include parallel computer architecture, interconnection networks, memory system design, robust system design, and fault tolerance. Abts has a PhD in computer science from the University of Minnesota. He’s a member of the IEEE, ACM, and IEEE Computer Society.

Mar./Apr. 2003, pp. 22-28. 12. X. Fan, W.-D. Weber, and L.A. Barroso, ‘‘Power Provisioning for a WarehouseSized Computer,’’ Proc. Int’l Symp. Computer Architecture (ISCA 07), ACM Press,

Direct questions and comments about this article to John Kim at Northwestern Univ., 2145 Sheridan Rd., Evanston, IL 60208; [email protected].

2007, pp. 13-23.

John Kim is a research assistant professor at Northwestern University. His research

For more information on this or any other computing topic, please visit our Digital Library at http://computer.org/csdl.

....................................................................

JANUARY/FEBRUARY 2009

9

cost-efficient dragonfly topology for large-scale ... - Research at Google

radix or degree increases, hop count and hence header ... 1. 10. 100. 1,000. 10,000. 1985 1990 1995 2000 2005 2010. Year .... IEEE CS Press, 2006, pp. 16-28.

615KB Sizes 9 Downloads 513 Views

Recommend Documents

Cost-Efficient Dragonfly Topology for Large ... - Research at Google
Evolving technology and increasing pin-bandwidth motivate the use of high-radix .... cost comparison of the dragonfly topology to alternative topologies using a detailed cost model. .... energy (cooling) cost within the first 3 years of purchase [8].

Topology and Control Innovation for Auxiliary ... - Research at Google
provide a constant current (CC) driving current preferably for the best light output ... As the Internet of Things (IoT) continues to proliferate, connected and smart .... communicate (via wireless network) and interact with the environment (via ...

Technology-Driven, Highly-Scalable Dragonfly ... - Research at Google
[email protected]. Abstract. Evolving technology and increasing pin-bandwidth moti- ..... router node. UGAL-G – uses queue information for all the global chan-.

Physics, Topology, Logic and Computation: A ... - Research at Google
email: [email protected], [email protected]. March 15, 2008 ...... To add to the confusion, compact symmetric monoidal categories are often called simply 'compact closed ...... http://www.math.sunysb.edu/∼kirillov/tensor/tensor.html.

Topology Discovery of Sparse Random Graphs ... - Research at Google
instance, the area of mapping the internet topology is very rich and extensive ... oute on Internet. In [9], the ...... Pattern Recognition Letters, 1(4):245–253, 1983.

Mathematics at - Research at Google
Index. 1. How Google started. 2. PageRank. 3. Gallery of Mathematics. 4. Questions ... http://www.google.es/intl/es/about/corporate/company/history.html. ○.

Simultaneous Approximations for Adversarial ... - Research at Google
When nodes arrive in an adversarial order, the best competitive ratio ... Email:[email protected]. .... model for combining stochastic and online solutions for.

Asynchronous Stochastic Optimization for ... - Research at Google
Deep Neural Networks: Towards Big Data. Erik McDermott, Georg Heigold, Pedro Moreno, Andrew Senior & Michiel Bacchiani. Google Inc. Mountain View ...

SPECTRAL DISTORTION MODEL FOR ... - Research at Google
[27] T. Sainath, O. Vinyals, A. Senior, and H. Sak, “Convolutional,. Long Short-Term Memory, Fully Connected Deep Neural Net- works,” in IEEE Int. Conf. Acoust., Speech, Signal Processing,. Apr. 2015, pp. 4580–4584. [28] E. Breitenberger, “An

Asynchronous Stochastic Optimization for ... - Research at Google
for sequence training, although in a rather limited and controlled way [12]. Overall ... 2014 IEEE International Conference on Acoustic, Speech and Signal Processing (ICASSP) ..... Advances in Speech Recognition: Mobile Environments, Call.

UNSUPERVISED CONTEXT LEARNING FOR ... - Research at Google
grams. If an n-gram doesn't appear very often in the training ... for training effective biasing models using far less data than ..... We also described how to auto-.

Combinational Collaborative Filtering for ... - Research at Google
Aug 27, 2008 - Before modeling CCF, we first model community-user co- occurrences (C-U) ...... [1] Alexa internet. http://www.alexa.com/. [2] D. M. Blei and M. I. ...

Quantum Annealing for Clustering - Research at Google
been proposed as a novel alternative to SA (Kadowaki ... lowest energy in m states as the final solution. .... for σ = argminσ loss(X, σ), the energy function is de-.

Interface for Exploring Videos - Research at Google
Dec 4, 2017 - information can be included. The distances between clusters correspond to the audience overlap between the video sources. For example, cluster 104a is separated by a distance 108a from cluster 104c. The distance represents the extent to

Voice Search for Development - Research at Google
26-30 September 2010, Makuhari, Chiba, Japan. INTERSPEECH ... phone calls are famously inexpensive, but this is not true in most developing countries.).

MEASURING NOISE CORRELATION FOR ... - Research at Google
the Fourier frequency domain. Results show improved performance for noise reduction in an easily pipelined system. Index Terms— Noise Measurement, Video ...

Approximation Schemes for Capacitated ... - Research at Google
set of types of links having different capacities and costs that can be used to .... all Steiner vertices in M have degree at least 3 and are contained in the small-.

DISCRIMINATIVE FEATURES FOR LANGUAGE ... - Research at Google
language recognition system. We train the ... lar approach to language recognition has been the MAP-SVM method [1] [2] ... turned into a linear classifier computing score dl(u) for utter- ance u in ... the error rate on a development set. The first .

Author Guidelines for 8 - Research at Google
Feb 14, 2005 - engines and information retrieval systems in general, there is a real need to test ... IR studies and Web use investigations is a task-based study, i.e., when a ... education, age groups (18 – 29, 21%; 30 – 39, 38%, 40. – 49, 25%

Disks for Data Centers - Research at Google
Feb 23, 2016 - 10) Optimized Queuing Management [IOPS] ... center, high availability in the presence of host failures also requires storing data on multiple ... disks to provide durability, they can at best be only part of the solution and should ...

Discriminative pronunciation modeling for ... - Research at Google
clinicians and educators employ it for automated assessment .... We call this new phone sequence ..... Arlington, VA: Center for Applied Linguistics, 1969.

Some Potential Areas for Future Research - Research at Google
Proportion of costs for energy will continue to grow, since. Moore's law keeps ... Challenge: Are there alternative designs that would .... semi-structured sources.