Karthik Sethuraman

Department of Electrical & Computer Engineering Louisiana State University Baton Rouge, LA 70803-5901, USA

fvaidy,[email protected]

Optics is widely acknowledged as the most viable means to meet the bandwidth needs of interconnects of the near future. While the optical medium itself can easily deliver huge bandwidths, this bandwidth is diÆcult to harness. This is because of engineering and technological constraints associated with accommodating a large number of extremely high-speed lasers (transmitters) and photodetectors (receivers) within a small con ne. We address an aspect of this issue of bridging the gap between the system and medium bandwidths of optical interconnects. In this paper we consider the problem of mapping an interconnection topology on an optical system. Speci cally, our results are for mapping weak multidimensional tori on slab waveguides. Our approach hinges on the fact that not all edges of a weak topology are used simultaneously; it uses this fact to employ a single laser/detector to work in multiple capacities at dierent times. We introduce the notion of aggregates to capture the cost of mapping a topology by our approach. We derive a non-trivial lower bound on this cost for a class of mappings and construct mappings, all of which surpass a naive method and some of which match the lower bound. Abstract:

Optical interconnects, interconnection topology, multidimensional torus, lower bounds, slab waveguides

Keywords:

Technical Areas:

Algorithms & Applications, Architecture

Supported in part by the US. National Science Foundation under grant CCR-0310916.

1

1

Introduction

It is expected that in the near future CMOS-based transistors will be capable of supporting data rates of around 20 Gbps [28]. At these (and even lower) data rates, the bandwidth of electrical interconnects is severely limited by RC delay and frequency dependent losses such as skin eect and dielectric loss [14, 15, 16]. Other considerations such as signal distortion, crosstalk and re ections also come into play at these high data rates. Furthermore, in structures such as buses in which multiple taps are made to a wire, capacitive loading limits the rate at which data can be transmitted reliably [3]. Thus, there is a wide gap between the data rates electrical interconnects can deliver and the computing needs of the future. The promise of optical interconnects in lling this gap is well recognized [8, 14, 23]. Indeed, optical communication has been studied at various levels of the computing system hierarchy ranging from bers [19, 20, 27], to communications over much shorter distances such as board-level/backplane buses [2, 7, 10, 16, 17], inter-chip/module connections [1, 18, 28] and intra-chip integrated optics (for example [11]). Optics has also been used to implement traditional topologies such as the k-ary n-cubes (including special cases and variants), and recon gurable models; these have useful computational properties, but are diÆcult to realize with electrical interconnects [4, 6, 21, 22, 24, 26]. In this paper we consider optical interconnects over short distances and used within small con nes (up to 30 cms), for example at the board, interchip or backplane levels. Such interconnects are usually based on slab waveguides, multimode bers or free-space optics [1, 11, 13, 22]. In this setting the \system bandwidth" is often constrained by the speed of light sources (lasers) and photodetectors and the size of the optical apparatus needed to insert and extract the signal from the medium (slab or ber in a waveguided system, or air in a free-space optical system). The optical medium itself is relatively unconstrained, and often has a much higher \medium bandwidth" than what the electrical and optical components can deliver within the constraints of available technology. For example, a slab waveguide about a mm in cross section can easily carry over 10,000 100-Gbps channels (at dierent modes and wavelengths [3]), provided these 10,000 high-speed signals can be inserted into and extracted from the slab. If one factors in size, cost and other engineering constraints involved with inserting and detecting the signals, the gure is likely to be at most 50 10-Gbps channels with current technology [28]. The medium bandwidth (1000 Tera bps in this illustration) is much higher than the system bandwidth (0.5 Tera bps). This paper addresses the problem of bridging this gap between the medium and system bandwidths. Broadly speaking, the idea is to reduce the size of the system by using the fact that not all edges of an interconnection topology are employed simultaneously. Put dierently, for a given system size, this allows for a larger sized topology (requiring more channels) to be mapped 2

2

on to the optical medium. More speci cally, we examine the problem of implementing a \weak" d-dimensional torus topology on an optical system using a slab waveguide. Informally, a weak topology is one in which at any given point in time, each node (or processor) receives information from at most one of its neighbors and sends information to at most one of its neighbors [12, 25]. Our approach exploits this slack to reduce the number of lasers and detectors as well as the cost and size of the optical apparatus. The idea is to map edges of the topology to channels of the waveguide so that a single tunable laser can transmit on several channels (each at a dierent time) and a single photodetector can detect several channels (that are used at dierent times). Although we discuss our results for a slab waveguide, many elements of these results also apply to ber and free-space optical systems as well. Let G = (V ; E ) be an N -node directed graph representing a weak interconnection topology; as usual, nodes represent processors and edges represent communication links; mapping the topology entails assigning each edge of jEj to a channel of the slab waveguide. Assuming G to be strongly connected, N jEj N (N 1). For any processor p 2 V , let Æ (p) and Æ (p) denote its out-degree and in-degree, respectively. At any given time, at most N of the edges are in use (no more than one transmission and reception per processor). Therefore, any mapping of the topology requires at least one laser and one photodetector per processor or a total of N lasers and N photodetectors. At the other extreme, processor p does not need more than Æ (p) lasers and Æ (p) photodetectors, X X resulting in a total of at most Æ (p) = Æ (p) = jEj lasers and jEj photodetectors. That is, 2V 2V the total number of lasers and photodetectors can each be between N and jEj. For a topology that is not very sparse, this gap between N and jEj can be large. For an N -processor, d-dimensional torus, a naive mapping uses as many as 2dN lasers and 2dN detectors. All the mappings presented in this paper use N photodetectors, the smallest number possible, and which reduces optical hardware as described later. For an 1-dimensional torus (or a ring), our mapping uses at most N + 2 lasers. For a d-dimensional torus (where d > 1), our mapping employs 2N d 1 + lasers, where L N d . We also derive a non-trivial lower bound to show that a weak d-dimensional torus (with d 2) that uses N detectors must use at least 2N (d 1) + 2 lasers. Thus, our result exceeds the lower bound only by the quantity 2. For some cases, L = N and our mapping is optimal. Although our results are directed towards slab waveguides, many elements are relevant to freespace optical systems as well. To our knowledge, ours is the rst method that approaches the problem of reducing the gap between the system and medium bandwidths by tuning the mapping. Our approach does not use devices properties and, consequently, its advantages independently add on to bene ts from advances in technology. o

o

i

p

i

i

o

p

1

1

L

2N L

3

0 (0,0)

(0,1)

(0,2)

(0,3)

(1,0)

(1,1)

(1,2)

(1,3)

1

4

5

8 (2,0)

(2,1)

(2,2)

2

3

6

9

10

7

11

(2,3)

(a) (b) Figure 1: A weak communication on a 3 4 torus. For clarity, part (a) shows a pair of oppositely directed edges as an undirected edge. In part (b), only the edges used in the communication are shown as directed edges, and the nodes are numbered in row major for use in subsequent discussion. 2

Preliminaries

2.1

Weak Multidimensional Torus

For integers d 1, 0 i < d and N 2, an N N N d-dimensional torus is a directed Y graph with N = N nodes and edges as described below. Let each vertex v be indexed (v ; v ; ; v ), where 0 v < N , for each 0 j < d. There is a directed edge from vertex v to vertex w = (w ; w ; ; w ) i there is a dimension 0 k < d such that v = (w 1)(mod N ) and for all j 6= k, v = w . That is, there is an edge i there is exactly one dimension k in which the indices of v and w dier by 1 (modulo N ). The edges corresponding to dimension k are called dimension-k edges. Figure 1(a) shows a 3 4 2-dimensional torus. Each vertex of a d-dimensional torus has 2d outgoing edges and 2d incoming edges. If N = 2 for some dimension, then the two dimension-j edges of a vertex v are to the same point (as v + 1 v 1 (mod 2)). In general, we will assume that each dimension has a size of at least 3. The case of size-2 dimensions is touched upon in Section 6. Consider a N -processor parallel processing systemy with set V = fp ; p ; ; p g of processors and connected by a topology represented by a directed graph G = (V ; E ). 0

i

d

1

d

1

1

i

i=0

0

1

d

1

j

0

k

k

1

d

j

k

j

1

j

k

j

j

j

0

1

N

1

A communication among the processors of a topology G = (V ; E ) is any non-empty subset of E . A communication S E is a weak communication, i every pair of distinct directed

De nition 1

y Though this work does not require the parallel system to be synchronous (in which all the processors proceed in a lock step fashion), it is convenient to assume that it is. We also assume that each edge (communication link) of the system is a 1-bit link. This assumption is without loss of generality as the ideas developed in this paper readily extend to higher link widths.

4

p to input light

output light side

Figure 2: Structure of a simple slab waveguide. edges, e = (p ; p ); e0 = (p 0 ; p 0 ) 2 S , satis es p 6= p 0 and p 6= p 0 . A weak topology is one in which every communication is weak. i

j

i

j

i

i

j

j

Remark: In a weak communication, each processor sends at most one message and receives at most one message at any given step. Figure 1(b) shows a weak communication on a 3 4 torus. We note that communications are assumed to be weak; that is, a processor can determine (through the application or algorithm) which of its neighbors has communicated with it at a given point in time. The mappings we present will only deliver the information received to the appropriate processor without any additional interpretation. 2.2

Slab Waveguides

An optical slab waveguide is a piece of transparent material (of appropriate geometry) through which light can be transmitted. As in an optical ber, the light is con ned within a slab by total internal re ection. The main dierence between a ber and a slab is in their cross-section dimensions; a slab is generally much larger with a cross-section area in the order of a mm . This higher area allows light to be transmitted in many modesz within the slab. In contrast, bers are usually single-mode waveguides; indeed, this single mode feature is indispensable for transmission over large distances, but is not needed for interconnects over short distances such as those considered in this paper. We distinguish a slab from a multimode waveguide by assuming that a slab's geometry is designed to preserve modes. That is if two signals are transmitted in separate modes, then they can be received separately. This allows one to employ mode-division multiplexing (or MDM) [3] to multiplex signals in dierent modes on the same slab. To make these ideas more concrete we discuss them in the setting of a simple slab waveguide shaped as a rectangular parallelopiped (henceforth, referred to simply as the slab) shown in Figure 2. Though many waveguide geometries are possible (for example see Feldman et al. [3]), the slab in Figure 2 captures all the waveguide properties needed for our discussion. 2

z For this paper, each mode corresponds to an angle at which light is inserted into the waveguide. For more details see, for example, Fowles [5].

5

a b c

side top

b’ c’ a’

Figure 3: The top view of the slab showing MDM. The gure shows the principal beams of three signals with a band of collimated (parallel) light shown shaded for one case. We note that such a band typically lls the entire cross-section of the waveguide. The slab can multiplex information independently in several modes and wavelengths. To understand how the slab performs mode-division multiplexing it is useful to view it from the \top" (see Figure 3). This gure shows three signals (labeled a, b and c) that are transmitted in dierent modes (angles relative to the axis) that are still separated at the output of the slab (shown as a0 , b0 and c0). The light, as viewed from the top of the slab, is collimated to preserve the modes during transmission. In contrast, the light (as viewed from the side of the slab) need not be collimated. Using MDM only on the \top plane" of the slab provides room for demultiplexing hardware|see Figure 4. A slab that is about 1 mm wide in the top view can easily support a few hundred distinct modes (suÆciently separated to be useful in practice [3]). In addition (and independently of MDM) it can also distinguish channels by their wavelength (wavelength division multiplexing or WDM [19, 20, 27]). (Dense Wavelength Division Multiplexing or DWDM is routinely used to transmit around 80 channels in a ber). Thus, the slab can carry several thousand channels, each with an unique wavelength and mode combination. In general if M modes and W wavelengths are possible, then the slab has M W channels.

Multiplexing:

Typically, light is inserted into the slab by lasers. The output of a laser must be collimated (for MDM) and inserted at the appropriate mode (angle). The cost and size of the input optics is determined primarily by the number of lasers. At the output, the channels must be demultiplexed into spatially separated spots, each incident on a photodetector. Figure 4 shows a schematic of the output optics. The output optical hardware consists of two main parts in series that are responsible for demultiplexing wavelengths and modes. These demultiplexers separate the information (along dierent dimensions) into spots of light on a detector plane. By placing photodetectors on these spots, the signals can be detected and sent to the appropriate processors. For an M -mode, W -wavelength system, the detector plane can be viewed as an M W channel array, representing M W channels

Input and Output Optics:

6

(a) side view

slab

Wavelengths

detector plane

mode demultiplexer

wavelength demultiplexer

(b) top view

Slab Output

Modes

(c) channel array

slab

Figure 4: A schematic of the demultiplexing of signals on the slab. The side view (a) shows the demultiplexing of wavelengths while the top view (b) illustrates the demultiplexing of modes. In these gures, the demultiplexer that plays an active role is shown solid. Part (c) shows the structure of the channel array. Note that the light exiting the slab in part (a) (resp., part (b)) represents a bundle of signals at dierent wavelengths (resp., modes). (see Figure 4(c)). Note that adjacent rows (resp., adjacent columns) of the channel array correspond to adjacent modes or MDM angles (resp., adjacent wavelengths). The cost and size of the output optics depends to a large extent on the demultiplexer hardware and the number of detectors used. 3

Mapping Topologies on a Slab Waveguide

For any topology G = (V ; E ) that is to be mapped on a slab waveguide, at least jEj channels are required (as no edge of the topology is redundant). Each channel corresponds to a unique mode and wavelength combination. The task is to suitably map each edge to a channel.

7

We will assume that M modes and W wavelengths are available and that M W jEj. Let the available modes and wavelengths be (for 0 u < M ) and (for 0 v < W ). We will specify the mapping using two M W arrays called the source and destination arrays and denoted by Src and Dst , respectively. Each element of these arrays is a processor index; recall that the N processors are denoted by p (where 0 i < n). If an edge (p ; p ) from processor p to processor p is mapped to a channel with mode and wavelength , the entry Src (u; v) at row u and column v of the source (resp., destination) array is i, and similarly, Dst (u; v) = j . For example, consider the torus shown in Figure 1(b) that has 12 processors and 48 directed edges. Figures 5 and 6 show the input and output arrays corresponding to two dierent mappings of these edges that use 4 modes and 12 wavelengths; for now ignore the ovals in Figure 6.

Representing the Mapping:

u

v

i

i

j

i

j

u

v

Source Array

ν0

0

4

8

0

4

8

0

4

8

0

4

8

ν1

1

5

9

1

5

9

1

5

9

1

5

9

ν2

2

6

10

2

6

10

2

6

10

2

6

10

ν3

3

7

11

3

7

11

3

7

11

3

7

11

λ 0 λ 1 λ 2 λ 3 λ 4 λ 5 λ 6 λ 7 λ 8 λ 9 λ10 λ11

Destination Array

ν0

1

8

11

8

5

4

3

0

9

4

7

0

ν1

0

1

10

5

4

1

2

9

8

9

6

5

ν2

10

5

6

3

10

11

6

7

2

1

2

9

ν3

7

4

3

2

3

10

11

6

7

0

11

8

λ 0 λ 1 λ 2 λ 3 λ 4 λ 5 λ 6 λ 7 λ 8 λ 9 λ10 λ11

Figure 5: Mapping 1 We now introduce the notion of aggregates that are used to de ne the cost of a mapping. Recall that each channel (; ) of the slab corresponds to a mode and a wavelength . Fix a mapping of edges of the weak topology to channels of the slab. De nition 2 Let C be a set of channels of the slab. For any processor index 0 i < N , the doublet (C; i) is called a source (resp., destination ) mode aggregate i the following conditions

Aggregates:

8

Source Array

ν0

1

1

2

2

5

5

3

3

0

0

7

7

ν1

4

4

11

11

8

8

6

6

9

9

10

10

ν2

3

6

6

0

0

1

1

4

4

2

2

3

ν3

8

9

9

7

7

10

10

11

11

5

5

8

λ 0 λ 1 λ 2 λ 3 λ 4 λ 5 λ 6 λ 7 λ 8 λ 9 λ10 λ11

Destination Array

ν0

0

5

10

3

4

9

2

7

8

1

6

11

ν1

0

5

10

3

4

9

2

7

8

1

6

11

ν2

0

5

10

3

4

9

2

7

8

1

6

11

ν3

0

5

10

3

4

9

2

7

8

1

6

11

λ 0 λ 1 λ 2 λ 3 λ 4 λ 5 λ 6 λ 7 λ 8 λ 9 λ10 λ11

Figure 6: Mapping 2 hold. (a) 8( ; ); ( 0 ; 0 ) 2 C; = 0 ; that is, all channels of C have the same mode. (b) Let 0 v < v0 < W . If ( ; ); ( ; 0 ) 2 C , then 8v00 with v < v00 < v; ( ; 00 ) 2 C ; that is, the channels of C have adjacent wavelengths. (c) If ( ; ) 2 C , then Src (u; v) = i; that is, an entry of the source (resp., destination) array corresponding to a channel of C contains the value i. Similarly, (C; i) is called a source (resp., destination ) wavelength aggregate i (a) all channels of C have the same wavelength, (b) the channels of C have adjacent modes and (c) an entry of Src (resp., Dst) corresponding to a channel of C has value i. u

v

u

u

v

u

u

u

v

u

u

v

v

v

The source and destination arrays of Figure 6 show mode and wavelength aggregates. An aggregate (C; i) is trivial i C contains only one processor index. An aggregate (C; i) is maximal i there is no other aggregate (C 0; i) such that C C 0. All the circled aggregates of Figure 6 are maximal and non-trivial. The four uncircled aggregates in the source array of this gure are both trivial and maximal. 9

The signi cance of a source mode aggregate (C; i) is that a single tunable laser controlled by processor p can generate signals for all the channels of C . Since all entries of the source array corresponding to channels of C contain index i, each channel of C has processor p as source. For a weak topology, only one of these channels can be used at a time. Therefore, a single laser can take turns to tune to dierent wavelengths. The fact that these wavelengths are adjacent makes this tuning feasiblex. For example, consider the aggregate in the top left corner of the source array of Mapping 2 (Figure 6). Here a tunable laser of processor p tunes between wavelengths and . The laser itself is positioned to transmit in mode . We assume that a tunable laser can tune rapidly between wavelengths in the aggregate. A source wavelength aggregate (C; i) has a similar interpretation only if the laser can switch between the modes of the channels in C . At the destination array, a mode or wavelength aggregate (C; i) has a similar interpretation. A single photodetector covering all the spots corresponding to C suÆces to detect information destined for processor p . Tuning (to a mode or a wavelength) is not necessary; an inactive channel has no light and therefore, a logical OR of the signals of all channels in the aggregate provides the correct value at any xed communication step. The leftmost aggregate in the destination array of Figure 6 corresponds to a single photodetector that detects all channels destined for p . The above example presents another interesting aspect of the mapping. Note that all the destination aggregates cover all four modes. This obviates the need to separate the modes. That is, the mode demultiplexing hardware shown in Figure 4 is no longer needed. We now state some assumptions that de ne a \standard mapping." i

i

1

1

0

0

i

0

1. For any vertex v of the graph, de ne its in-degree Æ (v) (resp., out-degree Æ (v)) to be the number of incoming (resp., outgoing) edges of v. De ne the in-degree D (resp., out-degree D ) of the graph to be max fÆ (v ) : v 2 Vg (resp., max fÆ (v ) : v 2 Vg). De ne the degree D of the graph to be maxfD ; D g. For an N -node topology, we will assume that the source and destination arrays are D N arrays. That is, the mapping will use D modes and N wavelengths. For a d-dimensional torus D = 2d. In Section 6 we touch upon the possibility of using dierent numbers of wavelengths and modes. 2. We will assume that each column of the D N destination array forms a single wavelength aggregate as shown in Figure 6. This obviates the need for mode demultiplexing and requires N detectors (the smallest number possible for an N -node topology). i

o

i

0

i

i

x

o

o

Currently, tunable lasers are quite expensive and their ability to change wavelengths quickly is also limited.

However, this work establishes the formalism needed to exploit the possibility of reusing components across channels.

10

Let G be an N -node graph with degree D. A standard mapping for G is one that uses D N arrays such that each column of the destination array is covered by a single aggregate.

De nition 3

Mapping 2 (Figure 6) is a standard mapping, whereas Mapping 1 (Figure 5) is not. In this paper we will only consider standard mappings. Because all mappings we consider are standard, the number of detectors is the best possible (N ) and the output optics is considerably simpli ed. Thus, the cost of a standard mapping is directly related to the number of lasers used. From the earlier discussion, the number of lasers equals the number of aggregates in the source array. In other words, the number of source aggregates is a good measure of the cost of the mappings presented in this paper. Mapping 1 of Figure 5, a non-standard mapping uses 48 lasers and 48 photodetectors, whereas Mapping 2 (Figure 6), a standard mapping, employs only 26 lasers and 12 detectors. It also obviates the need for mode demultiplexing. Thus, the manner in which edges of the topology are mapped to channels of the slab can have a big impact on the cost and complexity of the system.

Cost of a Mapping:

4

Cost Lower Bounds for Mapping Weak Tori

In this section we derive non-trivial lower bounds on the cost of a standard mapping of a d-dimensional, N -processor, weak torus on a slab waveguide. Here the arrays Src and Dst will be of size 2d N , with rows and columns numbered 0; 1; ; d 1 and 0; 1; ; N 1, respectively. We assume that each dimension has size 3. This is only to avoid sticky situations presented by multiple edges. Lemma 1 In a standard mapping of a torus, the source array does not contain any non-trivial wavelength aggregates.

Proof: Suppose there is a non-trivial, source, wavelength aggregate (C; i). Then there is a wavelength and two modes and such that Src (u; v) = Src (u + 1; v) = i. Because the mapping is standard, Dst (u; v) = Dst (u + 1; v) = j (say). This implies that there are two edges from node p to p in the torus. This is impossible with assumption that each dimension has size 3. v

i

u

u+1

j

Theorem 2 A standard mapping of an l m

N -node,

1-dimensional, weak torus (ring ) requires N de-

2 lasers. Proof: We only need prove the bound on the number of lasers. By Lemma 1, all source aggregates are mode aggregates. Since each node of a ring has out-degree of 2, each processor index appears exactly twice in the array Src. Therefore, all source aggregates have a size of at most 2. If N is tectors and at least

N

2

11

l m

even, all 2N elements of Src can be covered by N = 2 non-trivial aggregates (see Figure 7(a)). For odd N , atleast ltwom of the entries must have trivial aggregates (see Figure 7(b)), giving a total of 2 + 2 = 2 aggregates (or lasers). N

2

N

2

1

N

2

ν0

ν0

ν1

ν1 λ0 λ1 λ2 λ3 λ4 λ5 λ6 λ7

λ0 λ1 λ2 λ3 λ4 λ5 λ6

(a) (b) Figure 7: An illustration of the proof of Theorem 2 Let X be a source or destination array. For any 0 v v restriction of X is the part of X consisting only of columns v ; v + 1; ; v .

De nition 4

1

1

2

1

< N

, the [v ; v ]1

2

2

Figure 8 shows a [v ; v ]-restriction of an array. Let X be a source or destination array and let Y 1

2

0

v1

v2

N −1

A1 A2 A3

Figure 8: A restriction of an array showing complete and incomplete aggregates. be a [v ; v ]-restriction of X . An aggregate of X that lies entirely within Y is called a complete aggregate of Y (for example, aggregate A of Figure 8). An aggregate of X that lies partially within Y is called an incomplete aggregate of Y (for example, aggregate A of Figure 8). An incomplete aggregate must cross at least one of the two borders (columns v and v ) of Y . If it crosses column v (resp., column v ), then it is called a v -incomplete aggregate (resp., v -incomplete aggregate); aggregate A of Figure 8 is a v -incomplete aggregate. Of course, an aggregate of X may not overlap with Y (for example, aggregate A of Figure 8). 1

2

1

2

1

1

2

2

1

2

2

1

3

Lemma 3 For a standard mapping of an N -processor torus, let S be a [v; v +1]-restriction of the v

0 v < N 1. Then S has at most two non-trivial aggregates. Proof: By Lemma 1, all non-trivial aggregates of S are mode aggregates. Moreover, since S has only two columns, each non-trivial aggregate must be of size 2. Suppose S has three non-trivial aggregates in rows u , u and u (say). Let the processor indices in these aggregates be i , i and i , respectively. That is, Src (u ; v ) = Src (u ; v + 1) = i , Src (u ; v ) = Src (u ; v + 1) = i and array Src, where

v

v

v

v

1

3

2

3

1

1

1

1

12

2

2

2

2

Src (u3 ; v) = Src (u3 ; v + 1) = i3

(see Figure 9(a)). Keeping in mind that each column of the array Dst forms an aggregate, let Dst (u; v) = j and Dst (u; v + 1) = k, for all rows u (see Figure 9(b)). This implies that the graph of Figure 9(c) is a subgraph of the torus. It is easy to prove that this is not possible. v +1

v

v

v +1 i1

u1

i1

i1

u1

j

k

u2

i2 i2

u2

j

k

u3

i3 i3

u3

j

k

k i2

i3

j

(a)

(b) Figure 9: An illustration of the proof of Lemma 3.

(c)

2 and a standard mapping of an N -processor, d-dimensional, weak torus, let 2 be a [v ; v ]-restriction of the array Src, where 0 v v < N . Then,

Lemma 4 For

d

Sv1 ;v

2

1

1

2

(a) If S contains n v -incomplete aggregates, then n 2. (b) If S contains n v -incomplete aggregates, then n 2. (c) S contains at least 2(v v )(d 1) + 2d (n + n ) complete aggregates. Proof: By Lemma 1, all non-trivial aggregates of S are mode aggregates. If v = 0, then S has no v -incomplete aggregate. If v 1, then by Lemma 3 there can be at most two non-trivial aggregates in the [v 1; v ]-restriction of Src. This implies that S can have at most two v -incomplete aggregates. This establishes part (a) and a similar argument establishes part (b). For part (c) we proceed by induction on v v . Clearly, 0 v v < N . For v = v , there is only one 2d-element column in the restriction and at least 2d (n + n ) = 2(v v )(d 1)+2d (n + n ) complete aggregates. Assuming part (c) to hold for any restriction of x columns (where 1 x < N ), consider the case where v v = x + 1 1 (see Figure 10). Here v > v . With the induction hypothesis, let the [v ; v 1]-restriction, S 0 of Src have n v -incomplete aggregates, and n0 (v 1)-incomplete aggregates. Therefore, S 0 has at least 2(v v 1)(d 1) + 2d (n + n0) complete aggregates (that are also complete aggregates of S ). Clearly, has n v -incomplete aggregates. Let it have n v -incomplete aggregates. If any of the n0 S (v 1)-incomplete aggregates of S 0 does not become a complete aggregate of S , then it must also be a v -incomplete aggregate of S . Let n00 n0 of the (v 1) incomplete aggregates of v1 ;v2

1

1

1

v1 ;v2

2

2

2

2

v1 ;v2

1

1

2

1

v1 ;v2

1

v1 ;v2

1

1

1

v1 ;v2

1

2

1

2

1

1

2

1

2

2

1

2

2

1

2

1

2

1

2

1

1

1

1

2

v1 ;v2

1

v1 ;v2

1

2

2

2

v1 ;v2

2

2

v1 ;v2

13

1

v 2 −1 v 2

v1 n1

n"

n’

n2

Figure 10: An illustration of the proof of Lemma 4. become complete aggregates of S . The n0 n00 incomplete aggregates of S are included in the n v -aggregates of S . The number of complete aggregates in column v , not including the n00 mentioned above, is 2d n n00. So the total number of complete aggregates in S is S0

v1 ;v2

2

2

v1 ;v2

2

v1 ;v2

2

2( v 1 |

)(

v1 d

2

v1 ;v2

1) + 2d (n + n0}) +

complete in

S0

incomplete in

This quantity equals 2(v v )(d 1)+2d (n + n )+(2 as n0 2 by part (b) of this lemma. 2

1

1

+2| d

00

n |{z}

1

{z

2

n0

S0

n2

{z

) 2(v

column

2

n00

}

v2

)( 1)+2d (n + n ),

v1 d

1

2

Theorem 5 For d 2, a standard mapping of an N -processor, d-dimensional, weak torus, requires

2N (d 1) + 2 lasers. Proof: The array Src is its [0; N 1]-restriction and contains no incomplete aggregates. The results follow from Lemma 4 with v v = N 1 and n = n = 0. In the next section we present standard mappings that \nearly achieve" these lowerbounds. at least

N

detectors and at least

2

5

1

1

2

Mappings for Multidimensional Tori

Throughout this section, we assume an N N N , d-dimensional torus, with N = N N N . As we noted before, the mapping will be a standard one employing 2d modes and N wavelengths, and using N detectors. We now outline the mapping for a ring, 2-dimensional torus and multidimensional torus. For brevity, our discussion relies more on examples rather than formal proofs; details are available at Sethuraman [9]. We will specify the mapping by rst assigning a processor index to each column of the array Dst; recall that in a standard mapping, each column of Dst forms a wavelength aggregate. Then we will specify element Src (u; v) by specifying which neighbor of Dst (u; v) it is. 0

0

5.1

1

d

1

d

1

1

1-Dimensional Torus (Ring)

Figure 11 shows the Src and Dst arrays for rings with 7 and 8 processors. Intuitively, the rst 14

l m N

2

Source Array

Destination Array

ν0

1

1

5

5

4

4

6

ν0

0

2

4

6

1

3

5

ν1

6

3

3

0

0

2

2

ν1

0

2

4

6

1

3

5

λ0 λ1 λ2 λ3 λ4 λ5 λ6

λ0 λ1 λ2 λ3 λ4 λ5 λ6

(a) 7-processor ring Source Array

Destination Array

ν0

1

1

5

5

2

2

6

6

ν0

0

2

4

6

1

3

5

7

ν1

7

3

3

7

0

4

4

0

ν1

0

2

4

6

1

3

5

7

λ0 λ1 λ2 λ3 λ4 λ5 λ6 λ7

λ0 λ1 λ2 λ3 λ4 λ5 λ6

λ7

(a) 8-processor ring Figure 11: Mapping a ring. j k

columns are columns of Dst are assigned the even indices (in ascending order), and the last assigned the odd indices. The element Src (u; v) is assigned the left neighbor of Dst (u; v) if u + v is odd, and to its right neighbor, if u + v is even. N

2

Theorem 6 An N -processor ring can be mapped on a slab with 2 modes and N wavelengths using N

detectors (without mode demultiplexing hardware ) and

is even (resp., odd ).

N

+ 2 (resp., N + 1) tunable lasers, if N

Remark: This mapping is optimal for odd N and within 2 of the optimal for even N . 5.2

2-Dimensional Torus

Number the processors of the torus from 0 to N 1 in row major order (as in Figure 1(b)). For any processor i in row r and column c of the torus, its neighbors are de ned by the functions f ; f ; f ; f as follows. Processor f (i) is the left neighbor of i in row r and column (c 1)(mod N ) (consistent with the wraparound connections of the torus). Similarly, f (i), f (i) and f (i) are the neighbors of i to its right, top and bottom, respectively. Figure 12 shows the Src and Dst arrays for mapping 3 4 and 3 3 tori. Let I = f0; 1; ; N 1g. Consider the function : I ! I described by the procedure below. Initialize (0) 0. Proceed as described below until (i) has been assigned for all 0

0

1

1

N

N

15

2

3

N

1

2

3

Destination Array

Source Array

ν0

3

9

9

7

7

1

1

11

11

5

5

3

ν0

0

5

10

3

4

9

2

7

8

1

6

11

ν1

1

1

11

11

5

5

3

3

9

9

7

7

ν1

0

5

10

3

4

9

2

7

8

1

6

11

ν2

8

6

6

0

0

10

10

4

4

2

2

8

ν2

0

5

10

3

4

9

2

7

8

1

6

11

ν3

4

4

2

2

8

8

6

6

0

0

10

10

ν3

0

5

10

3

4

9

2

7

8

1

6

11

λ 0 λ 1 λ 2 λ 3 λ 4 λ 5 λ 6 λ 7 λ 8 λ 9 λ10 λ11

λ 0 λ 1 λ 2 λ 3 λ 4 λ 5 λ 6 λ 7 λ 8 λ 9 λ10 λ11

(a) 3 4 torus Source Array

Destination Array

ν0

2

7

7

0

8

8

1

6

6

ν0

0

4

8

1

5

6

2

3

7

ν1

1

1

6

2

2

7

0

0

8

ν1

0

4

8

1

5

6

2

3

7

ν2

6

5

5

7

3

3

8

4

4

ν2

0

4

8

1

5

6

2

3

7

ν3

3

3

2

4

4

0

5

5

1

ν3

0

4

8

1

5

6

2

3

7

λ0 λ1 λ2 λ3 λ4 λ5 λ6 λ7 λ8

λ0 λ1 λ2 λ3 λ4 λ5 λ6 λ7 λ8

(a) 3 3 torus Figure 12: Mapping a 2-dimensional torus. indices i 2 I . Let the procedure have just assigned (j ) v. Let k be the processor that is diagonally to the bottom-right of j (consistent with the wraparound connections of the torus). That is, k = f (f (j )). If (k) has not yet been assigned, then set (k) v + 1; otherwise, pick the smallest ` such that (`) has not been assigned and set (`) v + 1. The function is a permutation of the set I and therefore, it partitions I into cycles. For positive integers a; b, let `cm (a; b) denote their lowest common multiple; that is, `cm (a; b) is the smallest integer that is divisible by both a and b. It can be shown that partitions I into cycles, each of size `cm (N ; N ). This quantity will impact the cost of the mapping. cm Columns v of Dst is assigned processor index i i (i) = v. Let Dst (u; v) = i. If v is odd, then Src (u; v) = f (i); if v is even, then Src (u; v) = f (i). Each of the cm cycles of induces the pattern of aggregates shown in Figure 12(a) for the 3 4 torus. N

3

1

N

N

N

N

`

0

(N0 ;N1 )

3

u

lengths using N

N

u

`

(N0 ;N1 )

torus can be mapped on a slab with 4 modes and N wave detectors (without mode demultiplexing hardware ) and 2N 1 + cm ( 10 1 ) tunable

Theorem 7 An N -processor, lasers.

1

N0

N

1

`

N ;N

Remark: This mapping is optimal when N and N are relatively prime (see Theorem 5). Even 0

1

16

p

p

in the worst case of N = N = N , the number of lasers is 2N +2 smaller than 4N , the number required with a naive mapping. 0

5.3

1

N

which can be substantially

Higher Dimensional Tori

The basic idea is to express the d-dimensional torus (where d > 2) as a set of N N N tori (each of size N N ) and proceed as in the 2-dimensional case. The Src and Dst arrays have 2d rows, of which only the rst 4 are assigned as in the 2-dimensional case. This also speci es all the entries of Dst. The remaining 2d 4 rows of Src are assigned arbitrarily (without any assurance of forming non-trivial aggregates). It is now easy to derive the following result. 2

0

3

d

1

1

Theorem 8 For

d

2,

an

N -processor, N0

N N 1

1

d

torus can be mapped on a slab with

2d modes and N wavelengths using N detectors (without mode demultiplexing 2N d 1 + tunable lasers, where L = maxf`cm (N ; N ) : 0 i < j < N g. 1

i

L

hardware ) and

j

Remark: This mapping exceeds the lower bound of Theorem5 by 2. Since L maxfN : 0 i < dg N d , the deviation from the optimal is at most 2 N d 1 < 2(N (d + 1) 1), the deviation from the optimal for a naive mapping. 2N

1

6

1

1

L

i

Concluding Remarks

In this paper we have addressed the problem of bridging the gap between the system and medium bandwidths of optical interconnects. Speci cally, our results are for mapping weak tori on slab waveguides. We have proposed the idea of aggregates to capture the cost of our mapping and derived a non-trivial lower bound on this cost. We have presented methods that are all less expensive than a naive mapping and some of which match the lower bound. We note that for all the results presented in this paper there exist dual results in which Src and Dst are interchanged and/or modes and wavelengths are interchanged. The standard mappings we have proposed always use N wavelengths and 2d modes. This somewhat restricts the choices available for the system designer. It is possible to use our results to trade modes o for wavelengths and vice versa (in many cases without any additional cost) [9]. This work opens up several possible directions of future research, including \non-standard mappings" and the introduction of additional (phantom) nodes and edges to the topology to reduce p p p the cost. For example, by converting a pN N torus to a N ( N + 1) torus, the value p p of `cm (N ; N ) increases from N to N + N . This could substantially reduce the cost of the proposed mapping. 0

1

17

References

[1] M. Ch^ateauneuf, A. G. Kirk, D. V. Plant, T. Yamamoto, and J. D. Ahearn, \512-Channel Vertical-Cavity Surface-Emitting Laser Based Free-Space Optical Link," Appl. Optics, vol. 41, (2002), pp. 5552{5561. [2] S.-Y. Cho, M. A. Brooke, and N. M. Jokerst, \Optical Interconnections on Electrical Boards Using Embedded Active Optoelectronic Components," IEEE J. Selected Topics in Quantum Electronic, vol. 9, no. 2, (March/April 2003), pp. 465{476. [3] M. Feldman, R. Vaidyanathan, and A. El-Amawy, \High Speed, High Capacity Bused Interconnects Using Optical Slab Waveguides," Proc. Workshop on Optics and Comp. Science (Springer-Verlag Lecture Notes in Comp. Science vol. 1586), (1999), pp. 924{937. [4] H. Forsberg, M. Jonsson, and B. Svensson, \A Scalable and Pipelined Embedded Signal Processing System Using Optical Hypercube Interconnects," SPIE Opt. Networks Magazine, vol. 4, no. 4, (July/August 2003), pp. 35{49. [5] G. R. Fowles, Introduction to Modern Optics, Dover Publications, 2nd edition, 1989. [6] J.-H. Ha and T. M. Pinkston, \SPEED DMON: Cache Coherence on an Optical Multichannel Interconnect Architecture," J. Parallel and Distrib. Comput., vol. 41(1), (1997), pp. 78{91. [7] X. Han, G. Kim, and R. T. Chen, \Demonstration of the Centralized Optical Backplane Architecture in a Three-Board Microprocessor-to-Memory Interconnect System," Optics and Laser Tech., vol. 35, (2003), pp. 127{131. [8] International Technology Roadmap for Semiconductors, 2003, http://public.itrs.net [9] K. Sethuraman \Mapping Weak Multidimensional Torus Communications on Optical Slab Waveguides," M.S. Thesis, Dept. of Electrical & Computer Engineering, Louisiana State University, (2005). [10] G. Kim, X. Han, and R. T. Chen, \Crosstalk and Interconnection Distance Considerations for Board-to-Board Optical Interconnects Using 2-D VCSEL and Microlens Array," IEEE Photonics Tech. Letters, vol. 12, no. 6, (June 2000), pp. 743{745. [11] J.-S. Kim and J.-J. Kim, \Stacked Polymeric Multimode Waveguide Arrays for TwoDimensional Optical Interconnects," J. Lightwave Tech., vol. 22, No. 3, (March 2004), pp. 840{844. 18

[12] C. G. Plaxton, \Load Balancing, Selection and Sorting on the Hypercube," Proc. Symp. on Parallel Algs. and Arch., (1989), pp. 64{73. [13] W. Mao and J. M. Kahn, \Free-Space Heterochronous Imaging Reception of Multiple Optical Signals," IEEE Trans. Comm., vol. 52, no. 2, (February 2004), pp. 269{279. [14] J. D. Meindl, J. A. Davis, P. Zarkesh-Ha, C. S. Patel, K. P. Martin, and P. A. Kohl, \Interconnect Opportunity for Gigascale Integration," IBM J. Research and Development, vol. 46, no. 2/3, (March/May 2002), pp. 245{263. [15] D. A. B. Miller and H. M. Ozaktas, \Limit to the Bit-Rate Capacity of Electrical Interconnects from the Aspect Ratio of the System Architecture," J. Parallel and Distrib. Comput., vol. 41, issue 1, (February 1997), pp. 42{52. [16] E. Mohammed, A. Alduino, T. Thomas, H. Braunisch, D. Lu, J, Heck, A. Liu, I. Young, B. Barnett, G. Vandentop, and R. Mooney, \Optical Interconnect System Integration for UltraShort-Reach Applications," Intel Tech. J., vol. 8, issue 2, (May 2004), pp. 115{128. [17] A. Naeemi, A. V. Mule, and J. D. Meindl, \Partition Length Between Board-Level Electrical and Optical Interconnects," Proc. IEEE Int. Interconnection Technology Conf., (June 2003), pp. 230{232. [18] B. E. Nelson, G. A. Keeler, D. Agarwal, N. C. Helman, and D. A. B. Miller, \Wavelength Division Multiplexed Optical Interconnect Using Short Pulses," IEEE J. Selected Topics in Quantum Electronic, vol. 9, no. 2, (March/April 2003), pp. 496{491. [19] K. Noguchi, Y. Koike, H. Tanobe, K. Harada, and M. Matsuoka, \Field Trial of Full-Mesh WDM Network (AWG-STAR) in Metropolitan/Local Area," J. Lightwave Tech., vol. 22, no. 2, (February 2004), pp. 329{336. [20] C. Ou, J. Zhang, H. Zhang, L. H. Sahasrabuddhe, and B. Mukherjee, \New and Improved Approaches for Shared-Path Protection in WDM Mesh Networks," J. Lightwave Tech., vol. 22, no. 5, (May 2004), pp. 1223{1232. [21] M. Raksapatcharawong and T. M. Pinkston, \An Optical Interconnect Model for k-ary n-cube Wormhole Networks," Proc. Int. Parallel Proc. Symp., (1996), pp. 666{672. [22] M. Raksapatcharawong and T. M. Pinkston, \Modeling Free-Space Optical k-ary n-Cube Wormhole Networks," J. Parallel and Distrib. Comput., vol. 55(1), (1998), pp. 60{93. 19

[23] N Savage, \Linking with Light," IEEE Spectrum, vol. 39, issue 8, (August 2002), pp. 32{36. [24] K. J. Symington, J. F. Snowdon, and H. Schroeder, \High Bandwidth Dynamically Recon gurable Architectures using Optical Interconnects," Proc. Int. Workshop on FieldProgrammable Logic and Appls. (Springer-Verlag Lecture Notes in Comp. Science vol. 1673), (1999), pp. 411{416. [25] R. Vaidyanathan and A. Padmanabhan, \Bus-Based Networks for Fan-in and Uniform Hypercube Algorithms," Par. Comput., vol. 21, (1995), pp. 1807{1821. [26] R. Vaidyanathan and J. L. Trahan, Dynamic Recon guration: Architectures and Algorithms, Kluwer Academic/Plenum Publishers (Series in Computer Science), January 2004. [27] Y. Yang and J. Wang, \Designing WDM Optical Interconnects with Full Connectivity by Using Limited Wavelength Conversion," Proc. Int. Parallel and Distrib. Proc. Symp., (2004). http://csdl.computer.org/comp/proceedings/ipdps/2004/2132/01/213210035aabs.htm

[28] I. A. Young, \Introducing Intel's Chip-to-Chip Optical I/O Technology," Technology@Intel Magazine, April 2004, www.intel.com/update/departments/initech/ito4o41.pdf.

20