Modeling reliability and cost of a mesh with spare wires

Viewer
Transcript

IEEE TRANSACTIONS ON COMPUTERS

1

Modeling Yield, Cost, and Quality of a Spare-enhanced Multi-core Chip Saeed Shamshiri, Student Member, IEEE and Kwang-Ting (Tim) Cheng, Fellow, IEEE Abstract—It becomes increasingly difficult to achieve a high manufacturing yield for multi-core chips due to larger chip sizes, higher device densities, and greater failure rates. By adding a limited number of spare cores and wires to replace defective cores and wires either before shipment or in the field, the effective yield of the chip and its overall cost can be significantly improved. In this paper, we first model the yield of a multi-core chip that incorporates both spare cores and spare wires. Then, we propose a quality metric for an NoC, and model the system yield subject to a given quality constraint. We also model the manufacturing and service costs of a multi-core chip and show that a spare scheme can significantly improve the quality, increase the yield, reduce the overall cost, and substitute for the burn-in process. We illustrate that, in a spare-enhance system on a chip with high-quality in-field recovery capability, the reliance on high quality manufacturing testing can be significantly reduced. We also demonstrate that the overall quality of a mesh-based NoC depends more on the reliability of the inner links than the outer links; therefore, non-uniform spare wire distribution is sometimes more effective and cost efficient than a uniform approach. Index Terms— Fault tolerance, Redundant design, Reliability, System on a chip, Yield and cost modeling.

——————————  ——————————

1

INTRODUCTION

M

ULTI-CORE processors and systems-on-a-chip (SoC) are quickly becoming dominant architectures, driven by shrinking processes and diminishing returns from additional per-core complexity [1-5]. While individual core complexity is tapering off, larger die sizes, increasing device counts, and higher defect rates are leading to lower yields [6, 7]. A common approach to solving this yield problem has been the addition of redundancy [8]. This approach has become ubiquitous in memories, whose highly repetitive structures and high densities both enable and demand a high degree of defect tolerance [9, 10]. It has also been proposed for use in logic devices that have similarly repetitive structures, such as programmable logic arrays (PLAs) [8], field programmable gate arrays (FPGAs) [11], and systolic array processors [12]. More recently, it has been used in multi-core chips [13-15]. Integrating spare cores into a multi-core chip is a proven method used to improve the overall yield and reliability of the chip [13, 14]; however, it is extremely crucial to consider the reliability of the network in addition to the cores [15]. For example, in an Intel 80-core processor [5], routers consume roughly 17% of the die area. The reliability of the mesh is even more important than this number implies because if a router doesn’t work then there will be no access to the core that is solely connected to the failed router. Such cores will be useless even if they are defect-free. Another failure scenario that

may happen in a network-on-a-chip (NoC) architecture is that the mesh is broken into two or more separate regions without having any defect-free communication link to connect them. In this case, even if all cores and routers work perfectly, the chip may still fail. In an on-chip network, roughly 80% of the communication faults are transient [16]. Different fault tolerance approaches such as Forward Error Control (FEC), Automatic Repeat Query (ARQ), and multi-path routing have been used and compared in literature for reliable on-chip transmission [17-19]. These approaches tolerate transient faults very well, but they are not efficient in the presence of permanent faults. Permanent faults on wires occur during manufacturing or in the field, and may cause the system to fail. Such system failures increase the manufacturing cost if they happen before shipment, or cause a huge service cost if they happen in the field. These extra costs can be reduced by adding some spare wires that can be used to replace the permanently faulty wires either before and/or after shipment [15]. For the spare wire mechanism to work, one can employ a hop-by-hop error detection scheme, such as a Hamming code, that can also locate the faulty wire. Other fault tolerant approaches such as multi-path routing [19], end-to-end error control, or hop-by-hop detection-only checking cannot locate the failed wire and thus cannot help with wire replacement. The idea of using spare wires in conjunction with Hamming codes was proposed in ———————————————— [17], but no yield and cost analysis of the system was  S. Shamshiri is with the Electrical and Computer Engineering Department, provided. University of California, Santa Barbara, CA 93106. E-mail: [email protected] In this paper, we provide a model for the yield and  K.-T. (Tim) Cheng is with the Electrical and Computer Engineering cost of a multi-core chip that is enhanced with spare cores Department, University of California, Santa Barbara, CA 93106. E-mail: and wires. We use an exemplary nine-core processor, as [email protected] well as an Intel 80-core processor [5], to study the Manuscript received in May 31, 2009. xxxx-xxxx/0x/$xx.00 © 2009 IEEE

2

1.1 Paper Organization The rest of the paper is organized as follows: Section 2 explains the components of a spare-enhanced NoC-based multi-core chip to be analyzed by our method. In Section 3, using a bottom-up approach, we analyze the yield of a block and a link based on the input parameters associated with their components. This bottom-up model, as well as the complete flow of the analysis and simulation described in this work, is shown in Fig. 1. In the proposed yield model -- for different chip components such as cores, level 1 and 2 caches, network

interfaces, routers, and wires --, we consider parameters such as the raw yield, defect density [33, 34], and defect coverage of manufacturing test. There may be different defect coverage values for different components, depending on the testing strategy used. After calculating the yield of a block and a link, using Monte Carlo simulation that checks the connectivity of fault-free cores, we measure the observed yield of the chip. Observed yield of a core [15, 33]:

yc  (1 

c  Ac   c  ) 

Defect Density Area Clustering Factor Defect Coverage of Manufac. Testing

Core Instr. $

Bottom-Up Yield Model

Proc.

Data $ Net. Inter.

Correlation factors

 yblock

Block (tile)

Switch fab. Ctrl. logic

Router

Port Correlation factors Correlation factors Correlation factors

  f conn

1  yswctrl  1  yblock

The chance that a known faulty block cannot be used for routing

Defect density Area Clustering factor Defect coverage of manufac. testing Number of wires

 f conn

Src. Port

 ylink Dest. Port

True yield of each wire Defect coverage of manufac. testing

Link

Data Wire Ctrl. Wire

Bus

Parity Wire Spare Wire Observed yield of the bus:

  ( yw _ ctrl ) kctrl ( yw _ parity ) ybus Scale parameter Shape parameter Warranty time Infant mortality time Defect coverage of in-field testing

k parity

...

Minimum acceptable # of available cores: m Minimum acceptable Quality: q

Manufac. cost of a block Test Cost Cost of wires Cost of spare wires

...

i  mw

 nw  (1  yw ) nw i ( yw )i   In-field Failures (for each component)

Life cycle of a chip [36]:

 1

 1

 1

f 

Infant Mortality

Size of the network: n=i.j

nw

  i

In-field Failures

Failure Rate

improvement of the overall yield and the reduction of the total cost due to the use of spare cores and wires. Our experimental results show that, with good in-field testing and sufficient spares for recovery, we can loosen the manufacturing test requirements and eliminate the burnin process as well. Several papers have already shown that high quality in-field testing is feasible, either by built-in self-test, or online testing [20-25]. Many of these approaches are software-based and incur negligible area overheads. Spare processors may even serve as a hardware checking device in a roving emulation scheme, as discussed in [26]. Previous studies have advocated the elimination of burn-in, in particular replacing it with high-stress quiescent supply current (IDDQ) testing [27-29]. In [27], the authors analyze the benefits of burn-in elimination and conclude that the substantial burn-in cost, reportedly 5% to 40% of the chip sale price, exceeds the IDDQ test cost and the increased service cost from test escapes. Unfortunately, this technique becomes less useful in nanometer-scale devices, where high sub-threshold leakage current makes it difficult to identify defective chips. To eliminate the effect of the sub-threshold leakage, a differential IDDQ test technique was proposed [30]. However, the existing leakage reduction techniques are not as effective in aggressively scaled technologies where intrinsic leakage of transistors and circuits highly depend on the clock frequency, the temperature, and the optimal body bias (OBB) [31]. Therefore, IDDQ testing needs to be adapted to exist in an environment of decreasing signal to noise ratio [32] One of the contributions of this paper is an evaluation framework for the feasibility of applying existing defect tolerance methods to eliminate burn-in. We also propose a metric for the network quality and investigate the trends of the yield and cost of a spare-enhanced NoC subject to a quality constraint. Some previous studies have modeled the yield and cost of multi-core chips [6, 8], but none have included burn-in or quality in their analysis. In the last part of this paper, we analyze how much each link of a mesh contributes to the overall reliability of an NoC. The center of the network affects the overall reliability of the mesh more significantly than do the edges. Based on this analysis, we propose a non-uniform distribution of spare wires throughout the mesh, and show that this option is, in general, better than the uniform scheme.

IEEE TRANSACTIONS ON COMPUTER

  t        Grace Period

 1

Time

Breakdown

Monte Carlo Simulation for Connectivity Check & Quality Measurement after Manufacturing and in-field

Cost Models C Man 

Service Cost

C Ser  (1  R F ).C F

In-field Survival (RF)

Manufac. Cost

Manufac. Cost

n.(C K  Ctest )  w.Clink  C L 2 ysoc

Service cost of a failed chip (CF)

Quality System Yield (y’SOC)

+

Total Cost Service Cost

Fig. 1. Analysis and modeling flow; A complex set of equations for yield and cost models was previously developed in [14] and [15] and a complete set of equations and formulas is available online [35].

IEEE TRANSACTIONS ON COMPUTERS

In Section 4, we define a quality metric and explain the implementation of Monte Carlo simulation for quality measurement. In Section 5, using the shape and scale parameters of the in-field failure curve [36, 37] of each component and the defect coverage for the faults that happen in the field, we calculate the failure rate of each component within the warranty time. Using this analytical results, the Monte Carlo simulation injects faults in each component of the chip accordingly and simulates the in-field failure rate for a chip under a given quality constraint. The outputs of the simulation are used by another set of equations to compute the manufacturing and service costs of the chip. We also investigate the possibility of burn-in elimination for spare-enhanced multi-core chips. In Section 6, we propose a spare scheme for which the spare wires are not uniformly distributed. We compare the reliability and the cost of this new approach with those based on a uniformly distributed spare scheme [15]. Finally, the paper concludes in Section 7.

3

computation block (See Figs. 2.c and 2.d). This makes all blocks the same, regardless of their locations on the mesh (i.e., in the middle, on the side, or in the corner). In addition to an output and input port, a communication link consists of a bus which is a group of data, control (e.g., req, ack, probe, et al.), parity (e.g., even/odd parity, CRC error detection, Hamming codes, or other error correction approaches [40, 41]), and spare wires.

1.2 Contributions The contributions of this paper include: 1. Modeling the burn-in process in our yield and cost analysis framework, and illustrating that the burn-in process can be eliminated, and the manufacturing test constraints can be loosened for a spare-enhanced multi-core chip with an in-field reconfiguration mechanism. 2. Proposing a quality metric for an NoC and exploring the tradeoff between the yield and cost of a chip versus quality. 3. Demonstrating that the contribution of each link of a mesh to the overall network reliability varies with respect to the location of the link. We propose a nonuniform distribution of spare wires, which maximizes the overall reliability of the chip.

2

STRUCTURE OF A SPARE-ENHANCED MULTICORE CHIP

An NoC-based multi-core chip with a 2-D mesh architecture to be modeled in this paper is shown in Fig. 2.a. The chip consists of a shared memory plus an array of processors, including spares, connected by a network-onchip. The shared memory can also be a network of memory blocks as is found in the TRIPS processor [38]. The reliability of the on-chip memory can be enhanced by employing error correcting codes such as Hamming codes or higher-order codes [39]. Each tile of the array consists of a processor and a router, as illustrated in Fig. 2.b. Each router can be further divided into a switch fabric, control logic, four global ports connected to four neighbor routers, and one local port connected to the local core. Each port has an input and output buffer; the size of each buffer depends on the specific flow control implementation (e.g., send and wait, sliding window) [16]. For the sake of analysis, we consider the global ports of the routers (i.e., p1, p2, p3, and p4 in Fig. 2.b) as parts of the communication links and exclude them from the

Fig. 2. A spare-enhanced system-on-a-chip.

To allocate the spare wires effectively, we need to distinguish between permanent and transient errors. We define a permanent fault as a fault that occurs a specific number of times, k, consecutively on the same wire. For k=2, using a register for each link that keeps track of the last transmission’s faulty wire would be sufficient to recognize a permanent fault. More hardware overhead is required to distinguish permanent faults of a larger k.

3

MODELING THE OBSERVED YIELD

After decomposing the chip into small components, as explained in the previous section (i.e., Fig. 2), we use a bottom-up approach to model the yield of a block and a link. Fig. 1 shows the hierarchy of the components for the bottom-up yield model. To calculate the yield of a block, we need the yield of the processor and the router; for each of them, we need to model the yield of their subcomponents first (i.e., core, network interface, instruction and data cache for the processor and, switch fabric, control logic, and port for the router). The true yield of a core can be modeled by the clustering parameter (α), the defect density (λc), and the area of the core (Ac) using a negative binomial yield equation [33, 34, 42, 43]:

4

IEEE TRANSACTIONS ON COMPUTER

y c  (1 

c  Ac   ) 



The clustering parameter (α) determines the degree to which the defects are clustered. A smaller value of α corresponds to a stronger clustering of defects, so a greater yield would be expected. On the other hand, a larger α corresponds to weaker defect clustering, resulting in a lower yield. The range of α for current technologies is between 0.3 and 5.0 [34]. Equation (1) shows the true yield of the core, but not the observed yield. The observed core yield is the probability of a core passing the manufacturing test, which is greater than the true yield due to the imperfect testing process. Ideally, if the defect coverage of the manufacturing test is 100%, then the observed yield is equal to the true yield. We can model the observed yield of the core by adding the defect coverage of the manufacturing testing (Ωc) into (1):

yc  (1 

c  Ac   c   ) 



A similar formula can be used for other subcomponents. With these formulas, we can model the yield of the processor based on the yield of its components: core (y’c), instruction cache (y’i), data cache (y’d), and network interface (y’ni). One way of doing this is to simply multiply them and ignore their correlation:

yproc  yc . yi. yd . yni 



Another method replaces the core area in (2) with the total area of the processor, which is the summation of the areas of its components (i.e. Ac +Ai +Ad +Ani):   proc  ( Ac  Ai  Ad  Ani )   proc y proc  1   

  







Equation (4) assumes that these components are completely correlated and that they share the same defect density (λproc), clustering parameter (α), and defect coverage (Ωproc). Neither of these two equations makes a fair assumption about the correlation of the components. A better and more flexible model considers a correlation factor () for each subset of the components: y proc  (1  xc  xi  x d  x ni 

 c i .xc .xi   c  d .xc .x d   c ni .xc .xni    i  d .xi .xd   i ni .xi .xni   d  ni .x d .xni   c i  d .xc .xi .xd   c i ni .xc .xi .x ni   c d ni .xc .xd .xni   i d  ni .xi .x d .xni 



 c i  d ni .xc .xi .xd .xni )  Where: xc 

c  Ac   c  

and j, is a number between zero and one; zero means maximum correlation and one means no correlation. If we substitute all of the correlation factors of (5) with one, it becomes the same as (3); and if we substitute them with zero, it becomes the same as (4). Based on the same approach, we can model the yield of the router. Then, with the yield of the router and the processor, and their correlation, we can calculate the observed block yield, y’block. The yield of a link (y’link) is the multiplication of the yield of the source output port (y’p), the yield of the pointto-point bus (y’bus), and the yield of the destination input port (y’p):

  yp . ybus  . yp  ylink

A bus works when the control and parity wires are fault-free, and the number of permanently faulty data wires is less than the number of available spare wires:

  ( y w _ ctrl ) kctrl ( y w _ parity ) y bus

And xi, xd, and xni can be computed similarly. Note that the true yield can be calculated by removing all defect coverage parameters from the equations. In (5), i-j, the correlation factor between components i

k parity

 nw  (1  y w ) nw i ( y w ) i   i  mw  i  nw

 

In the above equation, mw is the number of data wires, nw is the total number of spare and data wires, and kctrl and kparity are the numbers of control and parity wires respectively. y’w, y’w_ctrl, and y’w_parity are the observed yields of the data/spare wire, the control wire, and the parity wire, which can be calculated using the failure rate of the corresponding type of wires. Any reliability schemes used for control or parity wires such as voting, spacing, or shielding can be considered in our model by reducing the failure rates of those wires accordingly. After measuring y’block and y’link through analytical equations, we can measure the observed yield of the chip using Monte Carlo simulation. We start with a network of n blocks connected in a 2-D mesh; then, we inject faults into each block and each link based on the calculated probabilities (i.e., y’block and y’link) and count the maximum number of connected fault-free blocks, cnt. If this number is greater than or equal to the minimum number of expected working cores, m (i.e. cnt  m ), then the chip passes the manufacturing test. The connectivity analysis of the Monte Carlo simulation, in addition to the aforementioned parameters, depends on the probability, f’conn, that a faulty block can or cannot be used as a routing hub. A faulty block cannot be used for routing if and only if the fault occurs in the switch fabric or the control logic of the router. So, the probability of not being able to use a known faulty block as a router is:   f conn





1  y swctrl   1  yblock



Where y’sw-ctrl is the observed yield of the switch fabric and the control logic combined, which can be calculated from the yields of the switch fabric and the control logic, and their correlation through equations similar to (5) and (6). More details about the analytical models are available online [35]. In the following, we use two different

IEEE TRANSACTIONS ON COMPUTERS

3.1 Example 1: A Nine-core Processor In this example, we use a nine-core chip similar to that in Fig. 2, with cores connected with a 3-by-3 mesh. Each router connects to any of its neighbors with a 76-bit bus for incoming traffic and another 76-bit bus for out-going traffic. These 76 bits consist of 64-bit data, 8-bit parity, and 4-bit control signals. Table 1 lists the values of the various parameters used for this exemplary experiment. Using the input parameters of Table 1, the observed yield is 94% for a block and is 72% for a link, making the total observed yield of the system just 21%. This means that the probability that either at least one of the blocks is faulty or some blocks are completely separated from the other blocks due to faulty links is 79%. TABLE 1 INPUT VALUES IN EXAMPLE 1.

If we increase the number of cores to 12, 16, or 20 (i.e., changing the mesh organization to 3-by-4, 4-by-4, or 4-by5), then there is a better chance of finding nine working blocks with fault-free connections and the yield increases from 21% to 66%, 88% and 95% respectively (the left most set of columns of Fig. 3). Note that for the 9-out-of-20 case, although a more than 100% overhead is introduced, the yield is not yet 100%. On the other hand, if we introduce spare wires instead of spare cores, the yield would improve to 57% if one spare wire per link were introduced and it would saturate at 58% if more spare wires per link were added (the left most column of each set in Fig. 3). These results indicate that the bottleneck is no longer the links, but the cores. The conclusion is that neither the spare-core scheme alone, nor the spare-wire scheme alone can effectively address the yield problem. We need to have spares for both cores and links to maximize the yield improvement. Increasing the number of cores to 12 and having one spare wire for each link, increases the system yield to 99% (second bar corresponding to the case of one spare wire in Fig. 3), which is significantly better than what we could get without applying both schemes at the same time. We can further increase the system yield to 99.99% if we integrate more spare cores and wires into the system. However, the increased manufacturing costs associated with the additional area overhead would likely be higher than the savings from the increased system yield from

99% to 99.99%. Note that the final objective is not to optimize the system yield, but to improve the total cost, as will be discussed later in this paper. 1

Observed System Yield

examples to illustrate the yield improvement as a result of using integrated spare cores and wires.

5

0.95 0.99 0.88

0.8 0.66

0.58

0.57

0.6

0.58

0.58

0.58 9 out of 9

0.4 0.2

9 out of 12 9 out of 16

0.21

9 out of 20

0 0

1

2 3 No. of Spare Wires

4

5

Fig. 3. Improving the system yield by integrating spare cores and wires in the system.

3.2 Example 2: An Intel 80-core Processor The Intel 80-core processor has 80 cores organized in a 10by-8 mesh. Each router has five ports that connect to four neighboring routers and to the local core. Each port connects to a 39-bit incoming and 39-bit out-going link. Fig. 4 shows the components within each tile of the processor. We assume that the true yield is 97% for each floating point unit (i.e., FPMAC0 and FPMAC1), 100% for registers (i.e., RF, and RIB), and 99.5% for any of the other components (i.e., data memory, instruction memory, and router). Other parameters are assumed to have the same values as those shown in Table 1. We assume that 32 out of 39 bits of each link are for data and the other seven bits are control signals. Based on these assumptions, the yield of each tile is 93% and the yield of each link is 85%, which makes the total yield of the processor 0.15%. If we add 10 spare cores to this chip and make the configuration 10-by9 instead of the current 10-by-8 configuration, the system yield will jump to 90%. Adding 20 spare cores with a 10by-10 configuration further increases the system yield to 99.9%.

Fig. 4. One tile of the Intel 80-core processor [5].

There is another way to look at this problem: For a chip manufactured with 80 cores; we want to know the probability that these chips can be marketed as 70+ or 60+ multi-core processors. With the same set of parameter values for defect densities and component yields, 94% of the chips can be sold as 70+ cores, and the remaining 6% of them can be marketed as 60+ cores. The complete trends are shown in Fig. 5. Without having any spare

6

IEEE TRANSACTIONS ON COMPUTER

wires in the system, 54% of the chips have at least 74 defect-free and connected cores. By integrating just one spare wire per link, this number increases to 65%. If we need to have a high yield for having more than 74+ cores, we should have placed more cores in the original design. As this example illustrates, our analysis helps identify the necessary number of spare cores and wires for optimizing either the yield or the total cost. 100 93.78

97.16 87.85 80.22

Percentage of Chips

80

64.56 60 54.06 40 0 spare w ire 1 spare w ire 20

30.86 23.39

6.64 0.25 0.16 78 80

4.51

0 60

62

64 66 68 70 72 74 76 Minim um No. of Defect-Free Cores in a Chip

Fig. 5. Percentage of chips ready to market.

In the next section, we further discuss how to measure the system yield under a quality constraint.

4

MODELING THE QUALITY OF THE NETWORK

To have better insight into the communication network, in addition to connectivity of cores, we are interested in the connections’ quality. Connectivity analysis gives us a black-and-white (zero/one) perspective of the communication infrastructure: a core is either connected to or disconnected from the other cores. Quality analysis gives us an in-depth understanding of the communication system. The first step toward measuring the quality of a network is to define quality. For a group of cores connected in an NoC, we use their Average Minimum Distance (AMD) to define the quality of their connections.

Dijkstra algorithm [15]. Since we are dealing here with a non-weighted graph, the minimum distance can be obtained simply by using a Breadth First Search (BFS) algorithm that has a lower time complexity compared to the Dijkstra algorithm. For a graph with V nodes and E edges, the time complexity of the Dijkstra algorithm is O(E+VlogV), while the time complexity of BFS is just O(E+V) [44]. In the case of a 2-D mesh with n nodes, O(E)=O(V)=O(n), and so the time complexity would be O(nlogn) for Dijkstra, and O(n) for BFS. Examples of both the minimum route and the minimum distance in complete and non-complete 2-D meshes are shown in Fig. 6. Notice that the minimum route is not necessarily unique, but the minimum distance is always unique. A good routing algorithm tries to route the packet through one of the available minimum routes. Routing protocols such as Open Shortest Path First (OSPF) and Intermediate System to Intermediate System (IS-IS), which are being widely used in computer networks and Internet communications, use Dijkstrabased methods to find the minimum route. In a complete mesh, the total number of minimum routes from a to b is (xd+yd)!/xd!yd!, where xd=|xa-xb| and yd=|ya-yb| are the horizontal and vertical distances of these two nodes. In a non-complete mesh, these ideal routes might not be available and the BFS algorithm finds the shortest available route that has some rerouting, w.r.t. an ideal route. For a complete i-by-j mesh, we can prove that the Average Minimum Distance (AMD) from every node a to every other node b, where a  b , is (i+j)/3. This can be easily proven by induction over i and j. We can also prove that, for a complete i-by-i mesh with n nodes (i.e., n=i2), AMD is of O(i)  O ( n ) order. Similarly, the average latency of stochastic communication [45] in a mesh is O ( n ) (not O(log2n) as implied in [45]). The logarithmic latency of a gossip-based broadcast [46] holds for a complete graph. Since a complete mesh has far fewer edges than a complete graph, the logarithmic order does not apply to it.

4.1 Average Minimum Distance (AMD) In a 2-D mesh, the minimum distance from node a to node b is the minimum number of edges (i.e. links) that a packet should pass through to reach node b starting from node a. In an i-by-j mesh, there are totally 2.[i.(j-1)+j.(i-1)] links; for example, for a 3-by-3 mesh, as shown in Fig. 6.a, there are 24 links. In a fault-free mesh, all of these links are available; we call such a mesh a complete mesh. In a complete mesh, the minimum distance from node a to b is |xa-xb|+|ya-yb|, which is also known as the Manhattan distance, between these two nodes. In this formula, (xa,xb) and (ya,yb) are the coordinates of nodes a and b. When some of the links are faulty, we consider them as if they don’t exist and, therefore, the mesh is not a complete mesh anymore, as shown in Fig. 6.b. In a non-complete mesh, the minimum distance can be computed using the

Fig. 6. Minimum route and minimum distance between two nodes in (a) a complete and (b) a non-complete 2-D mesh.

4.2 Quality Measurement Fig. 7 shows the steps of our Monte Carlo simulation for NoC yield and quality estimation. The simulator injects faults on each component and wire based on the analytical values and counts the maximum number of connected fault-free cores, cnt, and checks to determine whether it meets the minimum requirement, m (i.e., Steps 1-5). Here, m is the required number of working

IEEE TRANSACTIONS ON COMPUTERS

7

processors in the chip for shipment. If there exist at least m connected fault-free cores, we then measure the quality of the connections. Here, instead of simply measuring the quality of the connections among cnt connected cores, we need to measure the quality of the connections among those m cores out of cnt cores whose connections are of the highest quality. For a chip with 16 cores (i.e., n=16), for example, we require 9 of them to be defect-free (i.e., m=9) for shipment. The other seven cores are spare cores that can replace a faulty core after either manufacturing testing or in-field testing [14]. Suppose two cores fail the manufacturing test and, among the other 14 cores, just 12 of them have faultfree connections (i.e., cnt = 12). Since cnt is greater than m, we have a sufficient number of connected fault-free cores, therefore, the mesh is a connected mesh. The next step is to know how well these cores are connected and whether the quality of those connections is greater than a required threshold. Since the chip will be sold as a nine-core chip, we should select the 9, out of 12 working cores, which result in the highest connection quality. Steps 6 and 7 of the simulation find this group and report its associated AMD. For an m-out-of-n chip (i.e., an n-core chip that can be shipped only if it has at-least m fault-free connected cores after manufacturing testing), the quality is zero if the number of connected, fault-free cores is less than m (i.e., cnt
Ideal AMD (m)  Measured AMD (m)



Monte Carlo Simulation 1. 2. 3. 4. 5.

6.

7.

8. 9.

10.

11.

12.

13.

Start with n fault-free blocks connected in a 2-D mesh. With 1- y’block probability, inject fault in each block. With f’conn probability disconnect all links in a faulty block. Disconnect every link of the mesh with 1- y’link probability. Traverse the mesh and count the maximum number of fault-free blocks connected. Save this number as cnt. If cnt is greater than or equal to m, it is a connected mesh. If cnt is greater than m, find a group of m nodes out of these cnt nodes such that they have the strongest connections among all combinations of m out of cnt. In the selected group of m nodes, for every pair of nodes, measure the minimum distance using BFS and then calculate the Average Minimum Distance (AMD) for the whole group. Calculate quality by dividing the ideal AMD (minimum achievable AMD) by the measured AMD. If the quality is greater than a minimum requirement, q, the chip passes the manufacturing test; otherwise it is a failed chip, and simulation is complete. For a shipped chip, within the warranty time, inject a fault on every component and every wire based on the calculated probability of failure. Check to determine whether at least m fault-free connected cores exist in the field; if so, for the best m fault-free connected cores, calculate the quality (similar to steps 5-8). Otherwise, the chip has failed and the service cost applies. If the in-field quality becomes less than the in-field minimum quality requirement, q-in-field, the chip fails. (This constraint can be removed by setting q-in-field =0.) Repeat steps 1-12 for many times and return the average values for the quality of the mesh, manufacturing yield, and the in-field system failure probability under the quality constraint.

Fig. 7. Monte Carlo simulation for quality analysis.

In (10), the Ideal-AMD(m) is the minimum achievable AMD for any arrangement of m cores and MeasuredAMD(m) is the number we actually measure through the simulation, which is always equal to or greater than the Ideal-AMD. Therefore, 0  Quality  1 . For a given m, there is just one ideal AMD; however, there might be several arrangements of nodes possible to achieve that ideal AMD. Table 2 shows the ideal AMD values associated with a given m  20 . Fig. 8 shows some examples of core arrangements that may or may not result in obtaining the ideal AMD. It is interesting to observe that the best arrangement for 16 cores is not a 4-by-4 square; nor is 3-by-4 the best arrangement for 12 cores. TABLE 2 IDEAL AMD VALUES ASSOCIATED WITH 1  m  20 m

Ideal AMD(m)

m

Ideal AMD(m)

1 2 3 4 5 6 7 8 9 10

0 1 1.33 1.33 1.6 1.67 1.81 1.93 2 2.13

11 12 13 14 15 16 17 18 19 20

2.25 2.30 2.41 2.49 2.59 2.65 2.75 2.83 2.92 2.96

Fig. 9 shows the improvement in the quality of a 9-outof-n chip by using spare cores and spare wires. This is the result of executing steps 1-8 of the Monte Carlo simulation after 20,000 iterations. Table 3 shows the values of the input parameters we used for this experiment. For a 9-out-of-9 chip (i.e., no spare cores), the quality cannot even reach 50%, no matter how many spare wires are added to each link. This is because we need some spare cores to replace faulty cores. When we change the number of cores to 12 (i.e., 3-by-4), then 16 (i.e., 4-by-4), or even 20 (i.e., 4-by-5), the quality curve shifts up. The number of spare wires also makes a significant contribution to the chip’s quality. Adding the first spare wire to each link changes the quality drastically. But, for the second and third spare wires, the amount of quality improvement declines. Depending on where it stands in the trend curve, we can decide whether extra cores and/or extra wires should be added to best improve the quality. For example, in this experiment, for a 9-out-of-9 system with one spare wire per link (point A in Fig. 9), adding three more cores will improve the quality more than will adding one more spare wire to every link. However, in a 9-out-of-12 system with one spare wire (point B in Fig. 9), the quality improvement of adding one more spare wire per link is more significant than adding four more cores.

4.3 Observed Yield under a Quality Constraint With the ability of measuring the connection quality, we can now place a constraint on quality and require each chip to meet this quality requirement to qualify for shipment. So, a chip passes the test --and, thus, can be shipped -- if the number of connected fault-free

8

IEEE TRANSACTIONS ON COMPUTER

processors is no less than m (i.e., cnt  m ) and the quality is no less than a given threshold, q (i.e., quality  q ). This results in the observed yield of the system, which is computed in Step 9 of the simulation.

the number of spare wires and the minimum quality requirement. When the minimum quality requirement q increases, the yield becomes lower and more spare wires are required to achieve a high yield. This change becomes significant when we have a more strict constraint such as q=0.9; this constraint means that the measured AMD of the best fault-free connected cores should not be more than 10% greater than the ideal AMD. 1

q=0

0.9

9 out of 12

q=0.5

0.8

q=0.6

0.7

q=0.7

0.6

q=0.8 q=0.9

0.5

q=0.95 Yield

0.4 0.3 0.2 0.1

No. of Spare Wires per Link

0 0

1

2

3

4

Fig. 10. The observed yield vs. the number of spare wires for different quality requirements in a 9-out-of-12 system.

Fig. 8. Some examples of the core arrangements that may or may not result in an ideal AMD.

1 We need m ore spare w ires.

1

9 out of 12

B

0.9

9 out of 16

0.8

9 out of 20

0.5

0.7 0.6

0.4

0.5

0.3

We need m ore spare cores.

A

0.2 0.1

No. of Spare Wires per Link

0 0

1

2

3

4

Fig. 9. Chip quality improves significantly by integrating more spare cores and wires.

TABLE 3 INPUT VALUES USED IN THE EXPERIMENTS OF SECTION 4 Parameter Number of required working cores (m) Number of data wires per link (data bus width) Number of control wires per link Number of parity wires per link Defect coverage of manufacturing test for each component and wire

Value 9 64 4 6 0.95

Parameter True yield of each component

Value

True yield of each wire

0.99

True yield of the shared memory Correlation among any two components Clustering parameter

0.99

1 0.5 2

Fig. 10 shows the trends of the observed yield, w.r.t.

9-out-of-12 system Zero spare w ire One spare w ire

0.4

Tw o spare w ires

0.3

Three spare w ires

0.2

Four spare w ires

0.1 0

Quality Constraint 0. 6 0. 65 0. 7 0. 75 0. 8 0. 85 0. 9 0. 95

0.6

9 out of 9

0 0. 05 0. 1 0. 15 0. 2 0. 25 0. 3 0. 35 0. 4 0. 45 0. 5 0. 55

0.7

Quality

0.8

Yield

0.9

Notice that the quality of the connections also translates into the chip performance. A higher quality chip has less communication overhead among cores (in terms of time) and, so, it operates faster. In fact, quality is a mixture of yield and performance metrics. Fig. 11 gives another perspective of the tradeoffs between yield, quality, and the number of spare wires. This figure shows the exponential decrease in the yield as the minimum quality requirement increases. The results indicate that, for a higher quality threshold, more spare wires would be required to maintain the yield.

Fig. 11. The observed yield vs. required quality for different numbers of spare wires in a 9-out-of-12 chip.

All these experimental results were produced by executing steps 1-9 of the Monte Carlo simulation. The rest of the simulation steps (i.e., Steps 10-12) simulate the chips in the field by injecting in-field faults into different instances of cores, wires, and other components (i.e., network interface, data cache, etc.). It also estimates the connectivity and quality of the system in the field. These simulation results will be used later in the process of calculating the service cost.

IEEE TRANSACTIONS ON COMPUTERS

5

9

TABLE 4

COST ANALYSIS

INPUT VALUES USED IN THE EXPERIMENTS OF SECTIONS 5.1 TO 5.4

The total cost of a chip is the sum of its manufacturing and service costs. We discuss each of these cost aspects separately and present some experimental results in the next two sub-sections.

5.1 Manufacturing Cost We model the manufacturing cost as: C Man 

n.(C K  C test )  w.C link  C L 2  C BI   y soc



In (11), n is number of blocks (tiles), Ck is the manufacturing cost of a block, and Ctest is the total cost associated with different testing methods implemented for all of the components of a block. Similarly, w is number of links in the mesh, and Clink is the manufacturing cost of a link (including parity and spare wires, Hamming code logic, and wire replacement circuitry). CL2 is the manufacturing cost of the shared level-2 cache, CBI is the cost of burn-in process, and y’soc is the observed yield of the chip. Integrating spare cores or spare wires into the system increases both the numerator and denominator of (11). We are interested in identifying numbers of spares so that the cost is minimal. Fig. 12 shows the manufacturing cost of the exemplary chip described in Example 1 for different numbers of spare wires and spare cores. Table 4 shows the input parameters used in this experiment. We assume that burn-in is not applied in this experiment (i.e. CBI = 0). We use a normalization base to normalize the cost values in this experiment and in all experiments for the rest of the paper. The normalization base is the cost of a chip without any spare or testing overhead and with a perfect yield. 5 9 out of 9

Manufacturing Cost

4.5 Minimum Cost: -Three spare blocks -One spare w ire per link

4 3.5

9 out of 12 9 out of 16 9 out of 20

3 2.5 2 1.5 1 0

1

2 3 No. of Spare Wires

4

5

5.2 Service Cost The need for modeling the service cost is due to the fact that a chip after shipment could still be defective during the warranty period – either due to the test escape of an imperfect manufacturing test process, or due to a new in-field failure (such as infant mortalities or early wearouts). A defective chip deployed in the field would incur huge service cost. If we make the system recoverable from in-field failures by incorporating spare cores and wires and an automatic mechanism to detect, locate, and reconfigure the chip in the field, the service cost could be minimized, but at the expense of higher manufacturing cost resulting from the hardware overhead required for incorporating such features. We use the Weibull distribution function [36] (as shown in the bottom of Fig. 1) to model the field failure rates of each component and each wire. Based on this model, we can analyze the probability of a system failure in presence of spares and in-field defects. Fig. 13 illustrates the total cost of the chip specified by the input parameters of Table 4 in Section 5.1. For this experiment, the best configuration is 9-out-of-16 with one spare wire per link. 16

Fig. 12. Improving the manufacturing cost by integrating spare cores and wires in the system.

9 out of 9

Minimum Cost: -Seven spare blocks -One spare w ire per link

12 Total Cost

As illustrated in Fig. 12, the lowest possible manufacturing cost, in this example, is achieved when three spare blocks and one spare wire per link are included in the chip. In comparison with a chip without any spares, the corresponding cost of this optimal scheme is 69% lower.

14

10

9 out of 12 9 out of 16 9 out of 20

8 6 4 2 0 0

1

2 3 No. of Spare Wires

4

5

Fig. 13. Reducing the total cost by integrating spare cores and wires.

10

IEEE TRANSACTIONS ON COMPUTER

7 6

Cost

Minim um Total Cost

Minim um Service Cost

4 3 2 Service Cost

1

Manufacturing Cost

0 9 out of 9

9 out of 12

9 out of 16

Fig. 14. Improving the manufacturing integrating spare cores in the system.

and

9 out of 20

service

costs

by

5.3 Burn-in Elimination The burn-in process fast-forwards the life cycle of a chip before shipment to avoid the service costs typically associated with the infant mortality period. As with any manufacturing test procedure, burn-in adds to the manufacturing expense but reduces the service cost. If, however, we have some spare cores in the system, it could be more cost efficient to eliminate the burn-in process and let the in-field recovery procedure use the available spare cores to replace the faulty ones. Fig. 15 compares the total cost of the chip specified by Table 4, with and without burn-in. In this experiment, we assume that the cost of burn-in is 20% of the normalization base. As Fig. 15 shows, a 9-out-of-9 or 9out-of-12 system has a lower cost with burn-in applied, but a 9-out-of-16 or 9-out-of-20 system has a better cost without burn-in applied. This experiment concludes that the burn-in process can be eliminated when there are sufficient spares in the system and a high-quality recovery mechanism is available in the field. For all of the later experiments in this paper, we assume no burn-in is applied, and thus, the infant mortality period of the bathtub curve is part of the infield period for our analysis. 7 Total Cost

Number of spare wires per link = 1 6

In this section, we investigate the effects of the defect coverages of manufacturing testing and in-field testing on the total cost of a spare-enhanced chip. In Fig. 16, we plot the total cost versus the number of spare cores for different manufacturing testing defect coverages. The in-field defect coverage is fixed at 99% in this experiment. If there are no spares in the system, and thus no in-field recovery, the defect coverage of manufacturing testing is a critical factor in determining the total cost. With some spare cores added into the chip, the curves for different defect coverages are significantly close to each other. These results indicate that with an infield recovery capability using spares, the manufacturing defect coverage’s criticality to overall cost is significantly reduced. In other words, we can employ a simpler, faster, and cheaper solution to manufacturing testing to reduce the cost. For the next experiment, we fix the defect coverage of manufacturing testing to 99%, and play with the in-field defect coverage to evaluate its effect on the total cost. Fig. 17 shows that, with some spare cores included in the chip, the in-field defect coverage is a significant factor for the total cost. To focus on the effect of the spare-core count on the cost, in both experiments, we alleviate the effect of spare wires by assuming that each wire has a perfect yield, and there are four spare wires per link to recover in-field failures on wires. From these two experiments we conclude that in a spare-enhanced chip, the in-field test quality is much more important than the manufacturing test quality. An in-field test solution, either on-line checking or off-line BIST, with a high defect coverage is necessary. 14 12 10 8

Manufac. test defect coverage = 0.6 Manufac. test defect coverage = 0.7 Manufac. test defect coverage = 0.8 Manufac. test defect coverage = 0.9 Manufac. test defect coverage = 1

6 4 2 0 9 out of 9

9 out of 12

Fig. 16. Total cost vs. the number manufacturing defect coverage values. 4

Total cost w /o burn-in Total cost w / burn-in

5

Yield of wires = 1 Number of spare wires per link = 4 In-field test defect coverage = 0.99

3

9 out of 16

of

spares

for

different

In-field test defect coverage = 0.6 In-field test defect coverage = 0.7

Total Cost

Minim um Manufacturing Cost

5

5.4 Manufacturing vs. In-field Testing

Total Cost

The manufacturing cost of the chip is compared to its service cost in Fig. 14, for the case of one spare wire per link. As shown in this figure, the service cost drops more sharply than the manufacturing cost and becomes very close to zero when we increase the number of spare cores, while the manufacturing cost of a shipped chip drops to a minimal point and then starts to increase linearly.

In-field test defect coverage = 0.8 In-field test defect coverage = 0.9 In-field test defect coverage = 1

4

Burn-in increases the cost.

3

2

2

Burn-in reduces the cost. 1 9 out of 9

1 9 out of 12

9 out of 16

9 out of 20

Fig. 15. Comparing the total cost with and without burn-in.

Yield of wires = 1 Number of spare wires per link = 4 Manufacturing test defect coverage = 0.99

9 out of 9

9 out of 12

9 out of 16

Fig. 17. Total cost vs. the number of spares for different in-field defect coverage values.

IEEE TRANSACTIONS ON COMPUTERS

11

5.5 Total Cost Under a Quality Constraint

6

In the previous sections, we considered the chip cost without imposing any quality constraint. The total chip cost increases as the manufacturing and in-field quality constraints become more stringent, which in turn would require more spare wires integrated into the system. For a 9-out-of-12 chip, the trend of total cost versus the required quality is shown in Fig. 18. Table 5 shows the input values used in this experiment and the rest of the paper. In the experiment shown in Fig. 18, the in-field quality requirement is the same as the manufacturing quality constraint. In this case, the total cost increases as the quality constraint increases. In general, the in-field quality requirement is not necessarily the same as the manufacturing quality requirement. A special case is that there is only a manufacturing quality requirement but no additional in-field requirement on quality – it’s acceptable as long as the chip deployed in the field has m connected fault-free cores. Fig. 19 shows the total cost when there is no in-field quality requirement (i.e., q-in-field=0 in Step 12 of the Monte Carlo simulation). In contrast to Fig. 18, when the manufacturing quality requirement increases in Fig. 19, the cost decreases to a minimal point before it increases. This is due to the fact that imposing a greater quality requirement in manufacturing testing reduces the probability that the cores become disconnected in the field in the presence of in-field failures; therefore, the service cost is reduced although the manufacturing cost increases due to the lower yield.

Thus far, we have considered the effect of uniformly distributed spare wires on the reliability, quality, and cost of an NoC. This uniformly distributed spare scheme assumes that all links in the mesh are equally important to the system reliability. In the following, we intend to measure the contribution of each link to the overall reliability. This is equivalent to the probability that a link is being used in a random packet transmission, which depends on the routing algorithm. In XY-routing (i.e., the routing is first made in the direction of x and, then, in the y direction), there is only one route from a source to a destination (e.g., Fig. 20.a), but for a min-route routing (routing through any route that has the minimum length), there are usually many routes from a source to a destination. Fig. 20 shows an example of a route from a source to a destination which is three hops away in each direction. In this case, for a minroute, there are (3+3)!/(3!3!)=20 routes from the source to the destination. Fig. 20.b displays the number of routes that passes through each link. If we divide these numbers by the total number of routes (i.e., 20 in this example), we have the probability that each link is being used for a packet transmission from the source to the destination (Fig. 20.c). For example there are 6 routes passing through link AB, so there is a 30% chance that a min-route algorithm uses this link to send a packet from the source to the destination. TABLE 5 INPUT VALUES USED IN THE EXPERIMENTS OF THIS SECTION AND

14 12

Total Cost

16

NON-UNIFORM SPARE DISTRIBUTION

FOR THE REMAINDER OF THE PAPER

9 out of 12 One spare w ire Tw o spare w ires

10

Three spare w ires Four spare w ires

8 6

Quality Constraint

0 0. 05 0. 1 0. 15 0. 2 0. 25 0. 3 0. 35 0. 4 0. 45 0. 5 0. 55 0. 6 0. 65 0. 7 0. 75 0. 8 0. 85 0. 9 0. 95

4

Fig. 18. Total cost vs. required quality for different numbers of spare wires in a 9-out-of-12 chip. In this experiment, the in-field quality requirement is the same as manufacturing quality constraint. 5 Total Cost

Three spare wires per link 4.5

min cost =3.80

4 9 out of 12 9 out of 16

3.5

9 out of 20

min cost =2.93 3

2.5 min cost =2.62 Quality Constraint

2 0.7

0.75

0.8

0.85

0.9

0.95

Fig. 19. Total cost vs. required quality for different number of spare cores. In this experiment, there is manufacturing quality requirement but no additional quality requirement in the field.

Fig. 20. Number of routes passing through each link, and the probability of each link being used in an XY-route and Min-route of one packet transmitted from node Src to Dest.

12

Fig. 21. The average usage probability of links in a random packet transmission for a 4-by-4 and a 5-by-5 complete mesh. 0.08-0.09 0.07-0.08 0.06-0.07 0.05-0.06 0.04-0.05 0.03-0.04 0.02-0.03 0.01-0.02

Usage Probability

0.07

0.09

0.06

0.08

0.05

0.07

Usage Probability

0.06

0.04 0.03 0.02 0.01

mesh

0 1

2

3

4

5

0.05

0.06-0.07 0.05-0.06 0.04-0.05 0.03-0.04 0.02-0.03 0.01-0.02 0-0.01 6

0.04 0.03 0.02

mesh

0.01 1

2

3

4

7

a. XY-route

5

6

7

b. Min-route

Fig. 22. The average usage probability of the horizontal links of an 8by-8 mesh in a random packet transmission for the XY-route and min-route approaches.

Fig. 23. Two examples of non-uniform distributions of spare wires. 1

Quality

AMD

3

2.8

9 out of 16

0.9 0.8 0.7

2.6

AMD

0.6

Quality

0.5

2.4

0.4 0.3

2.2

0.2 0.1 Spare Wires Pattern

2 P-0

P-0-1

P-1

P-1-2

P-2

P-2-3

P-3

P-3-4

0 P-4

Fig. 24. Average Minimum Distance (AMD) and quality associated with different spare-wire patterns. 1

9 out of 16 0.8

Yield

For each link of the mesh, if we compute the usage probability of the link for every pair of source and destination nodes and average them, we will have the probability that a link is being used in packet transmission from a random source to a random destination. Fig. 21 shows the average usage probability for each link of a 4-by-4 and of a 5-by-5 mesh for a minroute algorithm. On a packet transmission from a random source to a random destination, for example, there is an 8.38% chance that the packet goes through link AB. Due to the symmetry of the problem, the values of the vertical edges are the same as the values of the corresponding horizontal edges. For example, there is an equal chance that a packet goes through link AB or CD. In this experiment, we consider each link as a bidirectional link. Note that, if we break each link into two unidirectional links, because of the symmetry of the random packet transmission, the chance of using each of the two unidirectional links is equal to half of the usage probability of the bidirectional link. Fig. 22 shows the results of similar experiments on an 8-by-8 mesh in a 3-D view. This figure simply plots the values on the horizontal links of the mesh, while the values of the vertical links can be obtained by the matrix transposition of the values of the horizontal links. Fig. 22.a shows an XY-route, and Fig. 22.b shows a min-route. As this figure illustrates, for the min-route approach, the closer a link is to the center of the mesh, the more important it is in a packet transmission, and so it deserves a better care. This motivates the idea of allocating more spares to the inner links than to the outer links. Fig. 23 shows two non-uniform spare wire distribution patterns, named P-1-2 and P-2-3. In these patterns, we allocate one more spare wire to those links with a usage probability that is more than 0.1, based on the experimental data of Fig. 21. Fig. 24 compares the AMD and the quality of connections of meshes with uniform (e.g. P-1: allocating one spare wire per link) and non-uniform (e.g. P-1-2) patterns of redundancy. Similarly, Fig. 25 illustrates the effect of the uniform and non-uniform patterns on the yield of a 9-out-of-16 chip subject to different quality constraints. From these two figures, we observe that there are several remarkable jumps in the AMD, quality, and the system yield when we switch from a uniform pattern to the next non-uniform pattern, but there is also a cost associated with the extra spares that we have in the nonuniform patterns; so, for a more meaningful comparison, we need to compare the total cost of the chips that use a uniform or non-uniform distribution. Fig. 26 shows this comparison and, as we can see, for many cases the cost is minimal for a non-uniform pattern.

IEEE TRANSACTIONS ON COMPUTER

q=0

0.6

q=0.5 q=0.6

0.4

q=0.7 q=0.8 q=0.9

0.2

q=0.95 Spare Wires Pattern

0 P-0

P-0-1

P-1

P-1-2

P-2

P-2-3

P-3

P-3-4

P-4

Fig. 25. Yield associated with different spare-wire patterns under a quality constraint. In many cases, changing from a uniform pattern to the next non-uniform pattern improves the yield significantly.

IEEE TRANSACTIONS ON COMPUTERS

13

[3]

20 min cost =15.98

18

14

9 out of 16

Total Cost

16

q=0.5

[4]

q=0.6 q=0.7

12

q=0.8

10

q=0.9

min cost =2.99

8

[5]

min cost =4.78

6 4

min cost =2.67

2

min cost =2.66

0 P-1

P-1-2

P-2

P-2-3

P-3

Spare Wires Pattern P-3-4

P-4

Fig. 26. Total cost associated with different spare-wire patterns subject to a quality constraint. In many cases, the total cost is minimal for a non-uniform pattern.

7

CONCLUSION

In this paper, we model the yield and cost of an NoCbased multi-core chip and show that it could be very beneficial to add extra cores and wires to enable repair either in the manufacturing line or in the field after shipment. Having this in-field recovery built into the system, one can even eliminate the burn-in process and loosen manufacturing testing requirements to further reduce the total product cost. Furthermore, we propose a metric for the communication quality of an NoC and show that the quality analysis reveals new information about the system, which is hidden in the connectivity analysis. Information about the communication quality represent the communication speed and can be used as a metric along with the frequency binning to price the chip in the market. Using the proposed quality definition, one can optimize the overall yield as well as cost of the chip using a quality constraint. We also propose to distribute the spare wires based on the contribution of each link to the overall network quality; this leads to a non-uniform distribution of spare wires.

ACKNOWLEDGMENT The authors acknowledge the support of the National Science Foundation Center for Domain-Specific Computing (CDSC) and the Gigascale Systems Research Center (GSRC), one of six research centers funded under the Focus Center Research Program (FCRP), a Semiconductor Research Corporation entity.

REFERENCES [1] [2]

[6]

L. Hammond, B.A. Nayfeh, and K. Olukotun, “A Single-Chip Multiprocessor,” IEEE Computer, vol. 30, no. 9, pp. 79-85, 1997. M. Gschwind, H.P. Hofstee, B. Flachs, M. Hopkins, Y. Watanabe, and T. Yamazaki, “Synergistic Processing in Cell’s Multicore Architecture,” IEEE Micro, vol. 26, no. 2, pp. 10-24, 2006.

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16] [17]

[18]

[19]

[20]

[21]

[22]

R. Kumar, D.M. Tullsen, N.P. Jouppi, and P. Ranganathan, “Heterogeneous Chip Multiprocessors,” IEEE Computer, vol. 38, no. 11, pp. 32-38, 2005. D. Pham et al., “The Design and Implementation of a FirstGeneration CELL Processor – A Multi-Core SoC,” Proc. IEEE International Conference on Integrated Circuit and Technology, pp. 49-52, 2005. S. Vangal et al., “An 80-Tile 1.28TFLOPS Network-on-Chip in 65nm CMOS,” Proc. IEEE Int. Solid-State Circuits Conference, pp. 98-99, 589, 2007. T. Hsieh, K. Lee, and M.A. Breuer, “An Error-Oriented Test Methodology to Improve Yield with Error-Tolerance,” Proc. IEEE VLSI Test Symposium, pp. 130-135, 2006. International Technology Roadmap for Semiconductors; http://www.itrs.net/Links/2006Update/2006UpdateFinal.htm . 2006. I. Koren and Z. Koren, “Defect Tolerance in VLSI Circuits: Techniques and Yield Analysis,” Proceedings of the IEEE, vol. 86, no. 9, pp. 1819-1837, 1998. R.T. Smith, J.D. Chlipala, J.F.M. Bindels, R.G. Nelson, F.H. Fischer, and T.F. Mantz, “Laser Programmable Redundancy and Yield Improvement in a 64K DRAM,” IEEE Journal of SolidState Circuits, vol. 16, no. 5, pp. 506-514, 1981. J.H. Kim and S.M. Reddy, “On the Design of Fault-Tolerant Two-Dimensional Systolic Arrays for Yield Enhancement,” IEEE Transactions on Computers, vol. 38, no. 4, pp. 515-525, 1989. F. Hatori et al., “Introducing Redundancy in Field Programmable Gate Arrays,” Proc. IEEE Custom Integrated Circuits Conference, pp. 7.1.1-7.1.4, 1993. I. Kim, Y. Zorian, G. Komoriya, H. Pham, F.P. Higgins, and J.L. Lewandowski, “Built in Self Repair for Embedded High Density SRAM,” Proc. IEEE Int. Test Conference, pp. 1112-1119, 1998. S. Makar, T. Altinis, N. Patkar, and J. Wu, “Testing of Vega2, A Chip Multi-Processor with Spare Processors,” Proc. IEEE Int. Test Conference, pp. 1-10, 2007. S. Shamshiri, P. Lisherness, S.-J. Pan, and K.-T. (Tim) Cheng, “A Cost Analysis Framework for Multi-Core Systems with Spares,” Proc. IEEE Int. Test Conference, pp. 1-8, 2008. S. Shamshiri and K.-T. (Tim) Cheng, “Yield and Cost Analysis of a Reliable NoC,” Proc. IEEE VLSI Test Symposium, pp. 173178, 2009. G. De Micheli and L. Benini, Networks on Chips. Morgan Kaufmann Publishers, 2006. T. Lehtonen, P. Liljeberg, and J. Plosila, “Self-Timed NoC Links Using Combinations of Fault Tolerance Methods,” Proc. IEEE Design Automation and Test in Europe, 2007. M.C. Neuenhahn, D. Lemmer, H. Blume, and T.G. Noll, “Quantitative Cost Modeling of Error Protection for Networkon-Chip,” Proc. ProRISK Workshop, pp. 331-337, 2007. Y. Jiao, Y. Yang, M. He, M. Yang, and Y. Jiang, “Multi-Path Routing for Mesh/Torus-Based NoCs,” Proc. IEEE Int. Conference on Information Technology, ITNG, pp. 734-742, 2007. M. Gao, H.-M. Chang, P. Lisherness, and K.-T. (Tim) Cheng, " Time-Multiplexed Online Checking: A Feasibility Study," Proc. IEEE Asian Test Symposium, pp. 371-376, 2008. A. Krstic, W.-C. Lai, L. Chen, K.-T. (Tim) Cheng, and S. Dey, "Embedded Software-Based Self-Testing for SoC Design," Proc. IEEE Design Automation Conference, pp. 335-360, 2002. Li Chen and S. Dey, “Software-Based Self-Testing Methodology

14

[23]

[24]

[25]

[26]

[27]

[28]

[29]

[30] [31]

[32]

[33]

[34]

[35]

[36]

[37]

[38]

[39]

[40]

[41]

IEEE TRANSACTIONS ON COMPUTER

for Processor Cores,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 20, no. 3, pp. 369380, 2001. N. Kranitis, G. Xenoulis, A. Paschalis, D. Gizopoulos, and Y. Zorian, “Application and Analysis of RT-Level Software-Based Self-Testing for Embedded Processor Cores,” Proc. IEEE Int. Test Conference, pp. 431-440, 2003. M. Nicolaidis and Y. Zorian, “On-Line Testing for VLSI—A Compendium of Approaches,” Journal of Electronic Testing, vol. 12, no. 1-2, pp. 7-20, 1998. H. Al-Asaad, B.T. Murray, and J.P. Hayes, “Online BIST for Embedded Systems,” IEEE Design & Test of Computers, vol. 15, no. 4, pp. 17-24, 1998. M.A. Breuer, and A.A. Ismaeel, “Roving Emulation as a Fault Detection Mechanism,” IEEE Transactions on Computers, vol. 35, no. 11, pp. 933-939, 1986. A.W. Righter, C.F. Hawkins, J.M. Soden, and P.C. Maxwell, “CMOS IC Reliability Indicators and Burn-in Economics,” Proc. IEEE Int. Test Conference, pp. 194-203, 1998. T.R. Henry and T. Soo, “Burn-in Elimination of a High Volume Microprocessor Using IDDQ,” Proc. IEEE Int. Test Conference, pp. 242-249, 1996. R. Kawahara, O. Nakayama, and T. Kurasawa, “The Effectiveness of IDDQ and High Voltage Stress for Burn-in Elimination [CMOS Production],” Proc. IEEE Int. Workshop on IDDQ Testing, pp. 9-13, 1996. M. Sachdev, “Deep Sub-Micron IDDQ Testing: Issues and Solutions”, Proc. European Design and Test Conf., 1997. K. Roy, T.M. Mak, and K.-T. (Tim) Cheng, “Test Consideration for Nanometer-scale CMOS Circuits,” IEEE Design & Test of Computers, vol. 23, no. 2, pp. 128-136, 2006. K.-T. Cheng, S. Dey, M. Rodgers and K. Roy, "Test Challenges for Deep Sub-Micron Technologies," Proc. Design Automation Conference, pp. 142-149, 2000. J.T. De Sousa and V.D. Agrawal, “Reducing the Complexity of Defect Level Modeling Using the Clustering Effect,” Proc. IEEE Design, Automation and Test in Europe, pp. 640-644, 2000. Way Kuo and Taeho Kim, “An Overview of Manufacturing Yield and Reliability Modeling for Semiconductor Products,” Proceedings of the IEEE, vol. 87, no. 8, pp. 1329-1344, 1999. S. Shamshiri and K.-T. (Tim) Cheng, “Yield and Cost Analysis for Spare-Enhanced Network-on-Chips,” UCSB Technical Report, http://cadlab.ece.ucsb.edu, 2008. J.M. Carulli and T.J. Anderson, “The Impact of Multiple Failure Modes on Estimating Product Field Reliability,” IEEE Design & Test of Computers, vol. 23, no. 2, pp. 118-126, 2006. V.V. Kumar and J. Lach, “IC Modeling for Yield-Aware Design with Variable Defect Rates,” Proc. IEEE Reliability and Maintainability Symposium, pp. 489-495, 2005. P. Gratz, C. Kim, K. Sankaralingam, H. Hanson, P. Shivakumar, S.W. Keckler, and D. Burger, “On-Chip Interconnection Networks of the TRIPS Chip,” IEEE Micro, vol. 27 , no. 5, pp. 4150, 2007. S. Shamshiri and K.-T. (Tim) Cheng, "Error-Locality-Aware Linear Coding to Correct Multi-bit Upsets in SRAMs," Proc. IEEE International Test Conference, 2010. D. Rossi, P. Angelini, C. Metra, “Configurable Error Control Scheme for NoC Signal Integrity,” Proc. IEEE International OnLine Testing Symposium, 2007. Q. Yu and P. Ampadu, “A Flexible Parallel Simulator for

[42]

[43]

[44]

[45]

[46]

Networks-on-Chip with Error Control,” IEEE Transactions on CAD, vol. 29, no. 1, 2010. J.A. Cunningham, “The Use and Evaluation of Yield Models in Integrated Circuit Manufacturing,” IEEE Transactions on Semiconductor Manufacturing, vol. 3, no. 2, pp. 60-71, 1990. I. Koren, Z. Koren, and C.H. Stapper, “A Unified NegativeBinomial Distribution for Yield Analysis of Defect-Tolerant Circuits” IEEE Transactions on Computers, vol.42, no. 6, 1993. T.H. Cormen, C.E. Leiserson, R.L. Rivest, and C. Stein, Introduction to Algorithms (2nd ed.). MIT Press and McGraw-Hill, 2001. T. Dumitras, S. Kerner, and R. Marculescu, “Towards On-Chip Fault-Tolerant Communication,” Proc. IEEE Asia South Pacific Design Automation Conference, pp. 225-232, 2003. B. Pittel, “On Spreading a Rumor,” SIAM Journal on Applied Mathematics, vol. 47, no. 1, pp. 213–223, 1987.

Saeed Shamshiri (S’08) received his B.S. and M.S. degrees in computer engineering from the University of Tehran, Iran, in 2002 and 2005. He is currently working toward a Ph.D. degree at the Department of Electrical and Computer Engineering, UCSB. His current research interests include reliability, availability, and on-line testing of multi-core chips. Kwang-Ting (Tim) Cheng (S’88–M’88– SM’98–F’00) received his Ph.D. in Electrical Engineering and Computer Science from the University of California, Berkeley in 1988. He worked at Bell Laboratories in Murray Hill, NJ from 1988 to 1993 and joined the faculty at the University of California, Santa Barbara in 1993 where he is now a professor in the Electrical and Computer Engineering Department. He was the founding director of UCSB’s Computer Engineering Program (1999-2002), Chair of the ECE Department (2005-2008), and Visiting Professor of National TsingHua University Taiwan (1999) and University of Tokyo, Japan (2008). His current research interests include design validation and test, multimedia computing, and mobile embedded systems. He has published more than 300 technical papers, co-authored five books, and holds 12 U.S. Patents in these areas. Cheng, an IEEE fellow, received eight Best Paper Awards from various IEEE conferences and journals. He has also received the 2004-2005 UCSB College of Engineering Outstanding Teaching Faculty Award. He served as Editor-in-Chief of IEEE Design and Test of Computers from 2006 to 2009 and is currently serving on the editorial boards for a number of journals. He has also served as general and program chair for several international conferences on design, design automation and testing.

Modeling reliability and cost of a mesh with spare wires

on a chip with high-quality in-field recovery capability, the reliance on high quality manufacturing testing can be significantly reduced. ... cost of a multi-core chip that is enhanced with spare cores and wires. ..... networks and Internet communications, use Dijkstra- ...... represent the communication speed and can be used as.

Download PDF

1MB Sizes 2 Downloads 94 Views

Report

Modeling reliability and cost of a mesh with spare wires

Recommend Documents