A 1-Gb/s joint equalizer and trellis decoder for ...

Viewer
Transcript

374

IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 36, NO. 3, MARCH 2001

A 1-Gb/s Joint Equalizer and Trellis Decoder for 1000BASE-T Gigabit Ethernet Erich F. Haratsch, Student Member, IEEE, and Kamran Azadet, Associate Member, IEEE

Abstract—1000BASE-T Gigabit Ethernet employs eight-state 4-dimensional trellis-coded modulation to achieve robust 1-Gb/s transmission over four pairs of Category-5 copper cabling. This paper compares several postcursor equalization and trellis decoding algorithms with respect to performance, hardware complexity, and critical path. It is shown that parallel decision-feedback decoders (PDFD) offer the best tradeoff. The example of a 14-tap PDFD, however, shows that it is challenging to meet the required throughput of 1 Gb/s using current standard-cell CMOS technology. A modified approach is proposed which uses decision-feedback prefilters followed by a one-tap PDFD. This considerably reduces hardware complexity and improves the throughput while still meeting the bit-error-rate requirement. The critical path is further reduced by employing a look-ahead technique. The proposed joint equalizer and trellis decoder architecture has been implemented in 3.3-V 0.25- m standard-cell CMOS process. It achieves a throughput of 1 Gb/s with a 125-MHz clock. Compared to a 14-tap PDFD, the design improves both gate count and throughput by a factor of two, while suffering only from a 1.3-dB performance degradation. Index Terms—Decision-feedback sequence estimation, Gigabit Ethernet 1000BASE-T, joint equalization and trellis decoding, parallel decision-feedback decoding, reduced-state sequence estimation, Viterbi algorithm.

I. INTRODUCTION

E

THERNET has become the most successful media interface for local area networks (LAN) and has seen an exponential growth in achievable data rates over the past decades [1]. Whereas 10-Mb/s Ethernet was initially introduced for coax cable in the late 1970s, the economic advantages of running 10-Mb/s Ethernet over unshielded twisted pair (UTP) soon became apparent. Thus a new 10-Mb/s Ethernet standard 10BASE-T, that uses UTP of Category 3, was introduced in 1990. The increasing demand for bandwidth in LAN applications prompted the development of Fast Ethernet standards in the mid 1990s, namely 100BASE-TX, 100BASE-T2, and 100BASE-T4, which specified 100-Mb/s operation over UTP cabling. More recently the IEEE 802.3ab task force finalized the 1000BASE-T standard, which increases the data rate even further to 1 Gb/s for copper wiring over distances of 100 m minimum [2]–[4]. 1000BASE-T employs full duplex Manuscript received August 4, 2000; revised December 6, 2000. The work of E. F. Haratsch was supported by a grant from the Deutscher Akademischer Austauschdienst, Hochschulprogramm III, and the Hans-Keller-Stiftung. E. F. Haratsch is with the Institute for Integrated Circuits, Technische Universität München, D-80290 München, Germany (e-mail: [email protected]). K. Azadet is with Lucent Technologies, Bell Laboratories, Holmdel, NJ 07733 USA (e-mail: [email protected]). Publisher Item Identifier S 0018-9200(01)01566-9.

baseband transmission over four pairs of Category 5 cabling. A throughput of 1 Gb/s is achieved by the transmission of 250 Mb/s over each wire pair, as shown in Fig. 1. Pulse amplitude modulation with the five levels (PAM5) is used as the transmission scheme on each wire pair. By grouping the four symbols transmitted on the four channels, a four-dimensional (4-D) symbol is formed which carries eight information bits. Thus, the symbol rate is 125 MBaud/s, which corresponds to a symbol period of 8 ns. While achieving a , the receiver target bit error rate (BER) of less than corresponding to a wire pair must operate under the following severe channel conditions: • intersymbol interference (ISI) caused by the wire attenuation; • echo from the local transmitter on the same wire pair; • near-end crosstalk (NEXT) from the local transmitters corresponding to the three adjacent wire pairs; • far-end crosstalk (FEXT) from the remote transmitters of the three adjacent wire pairs; • noise from sources other than those listed above. Whereas FEXT can generally be tolerated, the noise due to NEXT and echo must be mitigated using respective cancellers. To improve the noise margin, 1000BASE-T employs trellis-coded modulation (TCM) [5], [6] which can achieve an asymptotic coding gain of approximately 6 dB. A cost-effective implementation of the 1000BASE-T media interface demands that the whole transceiver including both the analog and digital signal processing be integrated in a single chip. A simplified receiver architecture which would be part of a single chip 1000BASE-T transceiver solution is shown in Fig. 2. Except for the equalizer and trellis decoder, Fig. 2 shows only the processing blocks corresponding to one wire pair. The channel output of a wire pair is first digitized using an A/D converter with 125 MHz or higher sampling rate. After the A/D converter, an adaptive feedforward equalizer (FFE) removes precursor ISI to make the channel minimum-phase and whiten the noise. Echo from the transmitter corresponding to the same wire pair and NEXT from the transmitters corresponding to adjacent wire pairs are cancelled with respective adaptive cancellers. After feedforward equalization, echo and NEXT cancellation, the channel impulse response is solely postcursor and spans around 14 symbol periods. The equalizer and trellis decoder block has the task of equalizing the ISI due to the postcursor impulse response and decoding the trellis code. Its inputs are the four received signals corresponding to the four wire pairs after feedforward equalization, echo and NEXT cancellation. To obtain a significant fraction of the coding gain

0018–9200/01$10.00 ©2001 IEEE

HARATSCH AND AZADET: JOINT EQUALIZER AND TRELLIS DECODER FOR 1000BASE-T GIGABIT ETHERNET

Fig. 1.

375

1-Gb/s full-duplex transmission over four pairs of CAT-5 cabling.

Fig. 2. 1000BASE-T receiver block diagram.

provided by the trellis code, the transceiver must employ a sophisticated sequence estimation technique to perform joint equalization and trellis decoding. It will be shown in this paper that the simplest of these techniques from an implementation perspective is parallel decision-feedback decoding which enhances the well-known Viterbi algorithm by separate decision-feedback computations for each code state [7]–[9]. However, the implementation of a parallel decision-feedback decoder (PDFD) for operation at 125 MHz is extremely challenging because of the long recursive loop encountered in this decoder structure [3], [10]. In addition to this, it is required to keep the hardware complexity of the equalizer and trellis decoder at a minimum to offer an overall cost-effective and low-power 1000BASE-T transceiver solution. Although there exists extensive literature on the VLSI implementation of high-speed Viterbi decoders which are able to achieve throughputs up to 1 Gb/s (see, e.g., [11]–[18]), these techniques for speeding up Viterbi decoding cannot be applied to PDFDs because of the different nature of the recursive loop. To the best of our knowledge there are no publications

describing techniques on the algorithm and architecture level which allow the implementation of a PDFD for 1000BASE-T Gigabit Ethernet using current CMOS standard-cell technology. The contributions of this paper are as follows. First, it provides a comparison of equalizer and trellis decoder architectures for 1000BASE-T Gigabit Ethernet with respect to signal-to-noise ratio (SNR) performance, hardware complexity, and critical path. Second, it proposes a joint equalizer and trellis decoder structure comprising decision-feedback prefilters followed by a one-tap look-ahead PDFD capable of achieving the throughput and BER specified by the 1000BASE-T standard. Third, the paper reports the first high-speed PDFD architecture and implementation, capable of achieving a data rate of 1 Gb/s using 0.25- m standard-cell CMOS technology. The remainder of this paper is organized as follows. In Section II, candidate algorithms for the equalization and decoding of trellis-coded signals are compared in terms of error-rate performance and hardware implementation. Section III describes the implementation aspects of a 1000BASE-T PDFD. A lowcomplexity architecture for a 1000BASE-T equalizer and trellis decoder is proposed in Section IV, and the architectural details

376

IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 36, NO. 3, MARCH 2001

Fig. 4. 1000BASE-T code trellis.

Fig. 3.

Subset partitioning.

are described in Section V and Section VI, respectively. Implementation results are presented in Section VII with conclusions drawn in Section VIII. II. EQUALIZATION

DECODING SIGNALS

AND

OF

TRELLIS-CODED

(2)

1000BASE-T Gigabit Ethernet specifies 4-D TCM with eight states. For coding purposes, the symbol alphabet is partitioned into different subsets. In the 1-D signal space which corresponds to a single wire pair, the PAM5 signal constellation is divided and into the two 1-D subsets leading to a minimum Euclidean distance of between these two 1-D subsets (see Fig. 3). By grouping different combinations of the 1-D subsets together which are sent over the are formed. four wire pairs, the eight 4-D subsets Each 4-D subset consists of both -type and -type 4-D symbols. The 4-D subset partitioning guarantees a minimum Eubetween different 4-D symbols in clidean distance of between points of different the same subset and of even subsets (S0, S2, S4, S6) or odd subsets (S1, S3, S5, S7). The eight-state radix-4 trellis code employed by 1000BASE-T is shown in Fig. 4. This trellis is generated by a rate 2/3 trellis encoder at the transmitter, which is described in [3] and [4]. Each transition in the trellis corresponds to a 4-D subset as specified in Fig. 3. Due to the subset partitioning and labeling of the transitions in the code trellis, only branches corresponding to even or odd 4-D subsets leave or enter each state. Therefore, the minimum Euclidean distance between allowed sequences is which corresponds to an asymptotic coding gain of dB over uncoded PAM5 in an ISI-free channel. A. Maximum Likelihood Sequence Estimation After feedforward equalization, echo and NEXT cancellation, the input into the equalizer and trellis decoder corresponding to wire pair is given by (1) are the 1-D symbols transmitted over wire pair , are the postcursor channel coefficients, and are the noise samples for this wire pair. is the postcursor channel where

memory and is assumed to be 14 for the rest of this paper, as the postcursor ISI, which must be considered in the equalizer and trellis decoder, spans at most 14 symbol periods in the 1000BASE-T application [3]. It can be assumed that is white Gaussian noise with zero mean. When trellis coding is used over a channel that suffers from ISI, a combined code and channel state can be defined [8], which in the case of 1000BASE-T is given by

is the 4-D symbol sent over where is the code state at time . The opthe four wire pairs and timum method for the recovery of the information bits is maximum likelihood sequence estimation (MLSE) [19] which applies the Viterbi algorithm (VA) [20], [21] to the super trellis defined by the combined code and channel state. The number , where is of states in the super trellis is given by the number of code states and is the number of information , bits contained in a 4-D code symbol. In 1000BASE-T, , and , which makes MLSE prohibitively expenstates. Thus a suboptimal equalization and sive with about trellis decoding method must be found for 1000BASE-T that provides satisfactory error-rate performance with manageable hardware cost. B. Separate Equalization and Decoding One suboptimal but simple method to equalize and decode the trellis-coded 4-D symbols corrupted by ISI and noise according to (1) is the concatenation of four conventional decision-feedback equalizers (DFEs) (one for each wire pair to remove the postcursor ISI), and a TCM decoder that runs the VA on the code trellis to decode the trellis-coded symbols (see Fig. 5). The advantage of this approach is the low implementation complexity and the fact that a throughput of 1 Gb/s can be easily achieved. In the TCM decoder the 1-D branch metric unit (1D-BMU) computes in total eight 1-D branch metrics, as there are four wire pairs and the two 1-D subset types and . The eight 4-D branch metrics corresponding to the eight different 4-D subsets (cf. Fig. 3) are obtained by adding up corresponding 1-D branch metrics in the 4D-BMU. As pipeline stages can be inserted between the DFEs, 1D-BMU, 4D-BMU, add-compare-select unit (ACSU), and survivor memory unit (SMU), a clock rate of 125 MHz can readily be achieved. The maximum throughput is limited only by the ACS loop inside the ACSU. However for the 1000BASE-T application,

HARATSCH AND AZADET: JOINT EQUALIZER AND TRELLIS DECODER FOR 1000BASE-T GIGABIT ETHERNET

Fig. 5.

377

Separate equalization and decoding.

Fig. 7. Error-rate performance of equalization and trellis decoding algorithms.

D. M-Algorithm

Fig. 6.

14-tap PDFD architecture.

this architecture cannot meet the target BER due to error propagation inside the DFEs, as the postcursor ISI is severe.

Whereas a PDFD searches for the most likely sequence in a reduced-state trellis, the M-algorithm (MA) works on the nonreduced super trellis defined by (2), but retains only a limited number, i.e., , paths with the best metrics [22]. To apply the paths is extended by its MA to 1000BASE-T, each of the four extensions at each processing step and then the resulting paths are tested for duplicate paths and sorted to find the best paths which are retained.

C. Parallel Decision-Feedback Decoding

E. Implementation and Performance Comparison

Error propagation can be reduced by incorporating the postcursor ISI cancellation into the TCM decoder such that separate ISI estimates are calculated for each code state using the symbols of the corresponding survivor path as feedback decisions. This is called parallel decision-feedback decoding and is the simplest case of decision-feedback sequence estimation [7] or reduced-state sequence estimation (RSSE) [8], [9]. The architecture of a PDFD for 1000BASE-T which equalizes a 14-tap long span of postcursor ISI (14-tap PDFD) is shown in Fig. 6. For each of the eight code states and four wire pairs, the decision-feedback unit (DFU) calculates a separate ISI estimate based on symbols from the corresponding survivor paths in the survivor memory unit (SMU), in total 32 ISI estimates. As the 1-D branch metrics are computed for each code state, wire pair and 1-D subset type ( and ), the 1D-BMU computes a total of 64 1-D branch metrics. A total of 32 4-D branch metrics are computed in the 4D-BMU, as there are four transitions leaving each state of the code trellis (cf. Fig. 4). Although the PDFD operates on the same number of states as the TCM decoder of Fig. 5, it is associated with significantly more hardware, as the DFU becomes very complex for long postcursor channel memories, and branch metrics have to be calculated for all states separately. Furthermore, the maximum throughput of a PDFD is drastically lower than the one achievable with a TCM decoder working on the same code trellis, as all processing blocks, namely the DFU, 1D-BMU, 4D-BMU, ACSU, and SMU lie in a recursive loop which cannot be pipelined.

Error-rate curves for the performance of separate DFE equalization and trellis decoding (DFETCM), parallel deci(MA4) sion-feedback decoding, and the MA with are shown in Fig. 7. For comparison, simulation results for uncoded PAM5 transmission and detection with conventional DFEs are also shown. These curves were obtained using the 1000BASE-T channel model published by the IEEE 802.3ab task force, and it was assumed that the equalization algorithm cancels the ISI due to 14 postcursor channel taps, i.e., the channel memory is 14. It can be seen that DFETCM is only 1 dB better than uncoded DFE detection because of error propagation effects in the DFEs preceding the TCM decoder. However, the PDFD and MA4 are able to retain a coding gain of 5.3 and 5.6 dB, respectively. Coding gains approaching the performance of MLSE could be achieved by considering more than eight reduced states with an RSSE scheme or more than four paths in the MA [10]. Table I compares DFETCM, the PDFD and MA4 in terms of coding gain, critical path and hardware complexity. It can be seen that DFETCM is the most favorable architecture in terms of complexity and critical path, but its low coding gain makes it unsuitable for a robust paths to achieve receiver design. The MA4 uses only a slightly better error-rate performance than a PDFD, where eight survivor paths are retained. This would make the MA4 interesting for a low-complexity joint equalizer and trellis decoder implementation. However, in the MA4 the testing and sorting of 16 paths lies in the critical path, whereas in the PDFD

378

IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 36, NO. 3, MARCH 2001

TABLE I COMPARISON OF EQUALIZER AND TRELLIS DECODER ARCHITECTURES

B. 1D-BMU The 1-D branch metrics corresponding to code state wire pair are calculated in the 1-D-BMU according to

and

(4)

Fig. 8. Calculation of an ISI estimate in the DFU.

only four metrics have to be compared during the same time [10]. Thus it seems to be impractical to implement the MA4 at a clock rate of 125 MHz in current CMOS technology because of the critical path problem, whereas the PDFD seems to be a good choice for 1000BASE-T both in terms of coding gain and feasibility of implementation.

is the closest PAM5 symbol to the ISI-free signal where within the 1-D subset or . For the estimate 1000BASE-T trellis code, a significantly higher bit precision is needed for the 1-D branch metrics than for the decoding of convolutional codes where only 4-bit precision is needed for the branch metrics [24]. However, it is not necessary to implement an exact squaring function to achieve the full coding gain of Viterbi-type decoders [25], [26]. It is sufficient to approximate the squaring function based on truth tables with reduced output precision. The contribution of the 1D-BMU to the critical path is one addition, the slicing operation, and random logic for the squaring function.

III. 1000BASE-T PDFD IMPLEMENTATION ASPECTS Following the overview of the architecture for a 14-tap PDFD given in Section IIC, this section addresses the implementation aspects associated with each processing block as well as their contribution to the critical path.

C. 4D-BMU and 4-D The 4-D branch metric corresponding to state is obtained by adding up the input 1-D branch metrics for the four wire pairs.

A. DFU In the DFU, the ISI estimate for each code state pair is calculated according to

(5) and wire

(3) is the -th dimension of the 4-D survivor , , , , which belongs to the survivor sequence into and corresponds to time . Fig. 8 shows the code state structure needed to calculate an ISI estimate for a particular state and dimension. Due to the PAM5 signal constellation, the symbol multiplications can be realized using a shift operation. To limit the contribution of the DFU to the overall critical path, the final addition should be implemented using an adder tree with 4-2 compressors [23]. Then, the contribution of the DFU to the critical path is one shift operation and three additions.

where symbol

For each of the eight 4-D subsets, the branch metrics for both the best -type and -type 4-D symbol must be computed (see Fig. 3), and then the 4-D symbol with the minimum metric among these two has to be found through a compare-select operation. The architecture for the computation of the 4-D branch metrics corresponding to transitions from state zero is shown in Fig. 9. The contribution of the 4-D-BMU to the critical path is two additions, one subtraction, and a 2:1 multiplexer. D. ACSU from the four predeThe best survivor path into state is determined by performing the four-way cessor states ACS operation (6)

HARATSCH AND AZADET: JOINT EQUALIZER AND TRELLIS DECODER FOR 1000BASE-T GIGABIT ETHERNET

379

Fig. 9. Calculation of 4-D branch metrics for transitions from state 0 in the 4-D-BMU.

Fig. 10.

ACSU architecture (only state 0 is shown).

To avoid an overflow of the path metrics, the modulo scheme described in [27] should be used, as it does not require any normalization operations which would increase the critical path. Minimum delay through an 4-way ACS cell can be achieved by employing the radix-4 architecture in [15], where all four metrics for the path extensions are compared to each other in parallel, resulting in six simultaneous pairwise comparisons, and the minimum metric is selected based on the signs of these comparisons (see Fig. 10). In such an implementation, the contribution to the overall critical path is given by one addition, subtraction, the selection logic and a 4:1 multiplexer. E. SMU The survivor sequences for the eight code states are stored in the survivor memory. It is well known that survivor sequences merge with a high probability after some decoding depth , and for our simulations have shown that this merge depth is 1000BASE-T. Usually, the trace-back architecture (TBA) is the preferred architecture for the survivor memory as it has considerably less power consumption than the register-exchange-architecture (REA) [28]. However, as the TBA introduces latency, it cannot be used to store the survivor symbols, which are needed in the DFU with zero latency. Thus a hybrid survivor memory arrangement seems to be favorable for a PDFD implementation

with a -tap DFU where the survivors corresponding to the past decoding cycles are stored in a REA, and survivors corresponding to later decoding cycles in a TBA. However, due to the tight latency budget specified for the entire receiver by the 1000BASE-T standard, the REA is used for the entire survivor memory in this work. The contribution of a REA SMU to the critical path is a 4:1 multiplexer and the delay due to the register storing a survivor symbol. F. 1000BASE-T Implementation Results A 14-tap PDFD has been implemented in VHDL considering above implementation aspects and synthesized in 3.3-V 0.25- m standard-cell CMOS. The pre-layout results are shown in Table II. It can be seen that the hardware complexity is dominated by the DFU (51%) and branch metric computations (26%), whereas the ACSU and SMU consume only 10% and 13% respectively of the PDFD hardware. Although the fastest known architectures for each of the processing blocks and the fastest arithmetic components available in the Synopsys DesignWare library were used, the critical path is almost 16 ns, which corresponds to a clock rate of only 75 MHz or 500 Mb/s throughput. This clearly demonstrates that innovations at the architectural and/or algorithmic level are needed to meet the data rate requirement of 1000BASE-T.

380

IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 36, NO. 3, MARCH 2001

TABLE II IMPLEMENTATION RESULTS FOR A 14-TAP PDFD

Fig. 12. Truncation of the postcursor channel impulse response through a prefilter.

When compared to a 14-tap PDFD, this approach speeds up the processing speed by a factor of two and at the same time halves the hardware complexity while still meeting the target BER requirement. The proposed architecture has been implemented and is part of a single-chip 1000BASE-T transceiver. Inputs into the chip core are the four received signals after feed-forward equalization, echo and NEXT cancellation and the postcursor channel coefficients, and the outputs are the decoded bits. The training and adaptation of the channel coefficients is not the focus of this work and can be performed in a block outside the presented chip core. The details of the proposed architecture are described in the following two sections. Fig. 11.

Proposed and implemented equalizer and trellis decoder architecture.

IV. PROPOSED JOINT EQUALIZER AND TRELLIS DECODER ARCHITECTURE Extensive literature exists on the VLSI implementation of high-speed Viterbi decoders with data rates up to 1 Gb/s. However, bit-level pipelining of the ACS loop as suggested in [12] and [16] cannot be applied to a PDFD implementation as in this decoder structure the throughput is limited by a substantially longer and more complicated recursive loop. Precomputing branch metrics for multiple-step transitions according to [13], [15], [16], [29] or [30] outside the ACS loop is also not possible, as in a PDFD the branch metrics depend on decisions of the previous decoding step. The sliding window method described in [14], [17], [18] or [31] may be extended to PDFDs leading to an architecture which may meet the throughput requirement of 1 Gb/s, but would come with an extensive hardware overhead, as only a speedup of two is needed. Also, it would introduce significant latency which would violate the tight latency budget set by the 1000BASE-T standard. Table II shows that most of the hardware complexity of a 14-tap PDFD is in the DFU. To reduce the PDFD hardware complexity it has been suggested in [32] to compute separate states with the best path metrics, ISI estimates only for the , and use the ISI estimate of the state with the where overall best path metric for the remaining states. However, this technique would worsen the critical path problem in the PDFD as the selection of the states with the best metrics lies in the critical path. This paper proposes the equalizer and trellis decoder architecture shown in Fig. 11 where decision-feedback prefilters (DFP) are used for each wire pair to shorten the postcursor channel memory to one, and 1-D branch metrics are computed in a lookahead fashion in the one-tap look-ahead PDFD (LA-PDFD).

V. DECISION-FEEDBACK PREFILTER The FFEs for the four wire pairs (cf. Fig. 2) have the task to shape the channel such that the postcursor channel impulse response seen by the equalizer and trellis decoder is minimumphase [33]. This means that most of the postcursor channel energy will be concentrated in the beginning of the impulse response, and the tail will only have a minor contribution. This property can be exploited to reduce the complexity and critical path of a PDFD with little degradation of the error-rate performance. When the less significant tail of the postcursor channel impulse response is truncated on each wire pair with low complexity prefilters, the DFU in the PDFD has to deal only with a short postcursor channel, therefore reducing the overall complexity of the equalizer and trellis decoder and the critical path of the PDFD. The postcursor channel impulse response could be truncated using linear filters [34] but this would require full multiplications which are expensive in terms of hardware complexity. Also, linear prefiltering enhances and colors the noise, degrading the performance of the PDFD. Thus a decision-feedback structure is proposed as a prefilter for channel memory truncation as it does not lead to noise enhancement. Such an arrangement had been proposed in [35] to reduce the number of states of a MLSE receiver. However, in our work DFPs are used to reduce the computational complexity of the DFU of a PDFD, while the number of processed states is not changed. To truncate the postcursor channel memory on wire pair to one as shown in Fig. 12, the DFP drawn in the top of Fig. 13 can be used. It resembles the structure of a DFE as it uses tentative decisions obtained by its own slicer to remove the tail of the postcursor channel impulse response. From the top of Fig. 13, the output of the DFP is

(7)

HARATSCH AND AZADET: JOINT EQUALIZER AND TRELLIS DECODER FOR 1000BASE-T GIGABIT ETHERNET

Fig. 13.

Decision-feedback prefilter architecture.

where is a tentative decision obtained by slicing is given as closest PAM5 symbol and

381

Fig. 14. Error-rate performance for prefiltering.

to the

TABLE III PRECOMPUTED 1-D BRANCH METRICS VERSUS POSTCURSORS

(8) The DFP architecture in the top of Fig. 13 contains a long critical path due to the recursive loop comprising a shift operation for the symbol multiplication, 14 additions, and the slicer. However, an architecture with improved timing characteristics can be obtained through the cut-set transformation [36]. The architecture which was implemented is shown in the bottom of Fig. 13, where the critical path comprises just one shift operation (for the symbol multiplication), addition and the slicer. Fig. 14 shows that truncating the postcursor channel impulse response to one using DFPs results in a coding gain of 4 dB compared to maximum achievable gain of about 5.3 dB using a 14-tap PDFD. This performance loss is tolerable as the target is still met under worst-case 1000BASE-T BER of channel conditions. The performance loss is not greater due to the fact that error propagation effects in the DFPs are reduced, as they only cancel the ISI caused by the less significant tail of the postcursor impulse response. Also drawn in Fig. 14 is the error rate for the case that DFPs shorten the postcursor memory to two and a 2-tap PDFD takes care of the ISI due to the first two postcursor taps. This receiver structure comes very close to the performance of a 14-tap PDFD, but it would require significantly more hardware when applying the look-ahead technique presented in Section VI. Implementation and synthesis of an equalizer and trellis decoder architecture that uses DFPs to truncate the channel memory to one and a one-tap PDFD without look-ahead computations show that the hardware complexity is reduced by 45% to 89 kGates. The critical path is reduced by 27% to 11.38 ns as there is no final addition in the one-tap DFU. However, this structure still does not meet timing, and a look-ahead method is needed to speed up the design even further.

VI. LOOK-AHEAD PDFD ARCHITECTURE It can be seen from Table II that the DFU and 1D-BMU contribute more than 50% to the overall critical path. This suggests that precomputing the 1-D branch metrics in a look-ahead fashion to move the DFU and 1D-BMU out of the critical path may lead to a design which is operational at 125 MHz. The concept of speculative precomputations to improve throughput has already been applied in other applications such as quantizer loops [37] and decision-feedback equalizers [38]. However, this paper extends the look-ahead concept to PDFDs. As there is only a finite number of 1-D branch metrics possible for a partic, they can be precomputed. Then the calular PDFD input culation of the 1-D branch metrics becomes independent from previous decoding steps and can be pipelined. The correct 1-D branch metrics for a particular state are determined by the corresponding survivor symbols. Therefore, once the survivor symbols become available, the correct 1-D branch metrics can be selected among the precomputed ones. The number of possible 1-D branch metrics depends on the number of possible symbol combinations in the channel memory of the respective wire pair. symbol combinaAs PAM5 modulation is used, there are tions possible per wire pair. As there are four dimensions, and the two 1-D subset types ( and ) must be considered for the current symbol corresponding to a particular dimension, the total number of 1-D branch metrics, which must be precom[39]. puted in the look-ahead method, is Although this number grows exponentially with the channel memory , it can be seen from Table III that speculative precomputation of 1-D branch metrics is feasible for a channel with up to two postcursors. However, the postcursor channel memory

382

Fig. 15.

IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 36, NO. 3, MARCH 2001

One-tap LA-PDFD architecture.

in 1000BASE-T is 14, making the look-ahead technique prohibitively expensive. But, it has been shown in Section V that truncating the channel memory to one using DFPs (see Fig. 11) leads to an acceptable loss of coding gain, where the BER requirement is still met. Precomputing 1-D branch metrics in a lookahead fashion for a postcursor channel with memory one leads to the lowest complexity PDFD structure with 40 precomputed branch metrics, whereas a one-tap PDFD without look-ahead computation would require the computation of 64 1-D branch metrics. The architecture for a one-tap LA-PDFD is shown in Fig. 15. Speculative 1-D branch metrics are precomputed in the 1D-LA-BMU, and the multiplexer unit (MUXU) selects for each state the appropriate 1-D branch metrics based on past survivor symbols which correspond to this state. Compared to Fig. 6, only the MUXU, 4D-BMU, ACSU, and SMU are in the critical path in the proposed one-tap LA-PDFD architecture, but not the 1D-BMU and DFU, as there is a pipeline stage between the 1D-LA-BMU and the MUXU. It should be noted that bringing not only the 1D-BMU, but also the 4-D-BMU out of the critical loop would be impractical even for a channel with memory one. It can be easily shown that there are two many possibilities to consider [39]. In the following the architecture of the 1D-LA-BMU, MUXU, and SMU are described. The 4D-BMU and ACSU are implemented as described in Section III.

Fig. 16.

Precomputation of 1-D branch metrics in the 1D-LA-BMU.

A. Precomputation of 1-D Branch Metrics As the DFPs truncate the postcursor channel memory to one, the speculative branch metric for wire pair under the assumpis tion that the channel memory contains Fig. 17. Selection of 1-D branch metrics in the MUXU.

(9) is the closest -type or -type 1-D symbol to the where . As there are the five possibilISI-free estimate for and two possibilities for ities ( -type or -type 1-D symbol) per wire pair, a total of ten 1-D branch metrics must be precomputed per channel. The architecture for the speculative precomputation of 1-D branch metrics for wire pair is shown in Fig. 16, where the slicers calculate the difference between the slicer input and the closest symbol in the 1-D subsets or , respectively. There is one clock period for addition, slicing, and squaring. B. Selection of 1-D Branch Metrics For each state , wire pair , and 1-D subset type or , five 1-D branch metrics are possible corresponding to the five in the channel memory. The acpossible symbol values tual 1-D branch metrics corresponding to the two different 1-D

subsets types and , but the same code state and wire pair are selected with 5:1 multiplexers based on the corresponding available from the SMU as past 1-D survivor symbol shown in Fig. 17. As there are eight states, four dimensions, and two 1-D subset types, in total 64 such multiplexers are needed. The MUXU adds the 5:1 multiplexer to the critical path of the one-tap LA-PDFD, which still has a significantly lower delay than the DFU and 1D-BMU inside the critical loop of a conventional PDFD (see Section III). C. Survivor Memory As the channel memory seen by the one-tap LA-PDFD is are one, only the most recent 4-D survivor symbols needed to select the correct 1-D branch metrics in the MUXU. The implemented REA with merge depth 14 is shown in Fig. 18, where only the first row storing the survivor sequence corresponding to state zero is drawn to improve the clarity of the

HARATSCH AND AZADET: JOINT EQUALIZER AND TRELLIS DECODER FOR 1000BASE-T GIGABIT ETHERNET

383

TABLE IV CHIP CHARACTERISTICS

TABLE V IMPLEMENTATION COMPARISON Fig. 18.

Fig. 19.

SMU architecture (only state 0 is shown).

Chip core layout.

illustration. denotes the 4-D symbol decision correand a transition from state , sponding to subset denotes the eight information bits which correspond to the 4-D , and is the 2-bit ACS decisurvivor symbol sion for state . Only the first column in Fig. 18 stores the 4-D symbols which are fed into the MUXU and they are represented by twelve bits (three bits per PAM5 symbol). After this first column, the 4-D survivor symbols are mapped into the respective eight information bits to reduce the SMU complexity. Note that in a 14-tap PDFD with merge depth 14, all 4-D survivor symbols are needed for the ISI computations in the DFU and thus cannot be mapped into information bits, leading to an approximately 45% more complex SMU than the proposed one-tap LA-PDFD approach [40]. VII. IMPLEMENTATION RESULTS The proposed equalizer and trellis decoder architecture depicted in Fig. 11 and described in detail in Sections V and VI has been implemented and is part of a single-chip 1000BASE-T transceiver. The chip core was designed using a HDL/synthesis methodology based on standard cells in 3.3-V 0.25- m CMOS process with four layers of metal. A gate-level netlist was synthesized from VHDL using Synopsys. To reduce the delay through arithmetic components in the critical path such as adders and subtractors, the fast structures available in the Synopsys DesignWare library were used. A flat layout was generated using the automatic place and route tools from Avanti. The layout of the chip core measures 1.7 1.7 mm and is shown in Fig. 19. Parasitics were back-annotated to the gate-level netlist for timing and power simulations. According to post-layout simulations, the critical path is 7.64 ns, so that the chip core is operational at 125 MHz. Gate-level simulations

suggest a power consumption of less than 200 mW. Table IV summarizes the implementation results. The complexity and critical paths of the discussed equalizer and trellis decoder architectures, namely a 14-tap PDFD, a one-tap PDFD preceded by DFPs, and our proposed DFPs and one-tap LA-PDFD configuration are compared in Table V. It can be seen that compared to a 14-tap PDFD, which was the starting point of this work, both the complexity and critical path have been reduced by a factor of two. VIII. CONCLUSION It has been shown that the PDFD offers a good tradeoff between hardware complexity, critical path, and error-rate performance for 1000BASE-T Gigabit Ethernet. However, it is extremely challenging to implement a 125-MHz 1-Gb/s PDFD because the throughput of this joint equalizer and trellis decoder architecture is limited by a long recursive loop. Also, the computation of the ISI estimates in a PDFD consumes a significant amount of hardware. To overcome these implementation obstacles, an architecture was proposed and implemented in 3.3-V 0.25- m standard-cell CMOS process where a 14-tap-long postcursor channel impulse response is shortened by decision-feedback prefiltering, and a one-tap look-ahead PDFD equalizes the remaining ISI and decodes the code trellis. This design achieves the required throughput of 1 Gb/s while still meeting the target BER. Compared to a conventional 14-tap PDFD implementation, the hardware complexity and critical path are both reduced by a factor of two. ACKNOWLEDGMENT The authors wish to thank J. Williams for his help in the physical design, M. Yu and A. Blanksby for valuable discussions, and B. Ackland for his support of the project. REFERENCES [1] R. Seifert, Gigabit Ethernet. Reading, MA: Addison-Wesley, 1998. [2] Physical Layer Parameters and Specifications for 1000 Mb/s Operation Over 4-Pair of Category-5 Balanced Copper Cabling, Type 1000BASE-T, IEEE Standard 802.3ab-1999, 1999. [3] M. Hatamian et al., “Design considerations for Gigabit Ethernet 1000BASE-T twisted pair transceivers,” in Proc. IEEE Custom Integrated Circuits Conf. (CICC), Santa Clara, CA, May 1998, pp. 335–342.

384

[4] K. Azadet, “Gigabit Ethernet over unshielded twisted pair cables,” in Proc. Int. Symp. VLSI Technology, Systems, and Applications, Taipei, Taiwan, Jun. 1999, pp. 167–170. [5] G. Ungerboeck, “Trellis-coded modulation with redundant signal sets, Part I: Introduction,” IEEE Commun. Mag., vol. 25, pp. 5–11, Feb. 1987. , “Trellis-coded modulation with redundant signal sets, Part II: [6] State of the art,” IEEE Commun. Mag., vol. 25, pp. 12–21, Feb. 1987. [7] A. Duel-Hallen and C. Heegard, “Delayed decision-feedback sequence estimation,” IEEE Trans. Commun., vol. 37, pp. 428–436, May 1989. [8] P. R. Chevillat and E. Eleftheriou, “Decoding of trellis-encoded signals in the presence of intersymbol interference and noise,” IEEE Trans. Commun., vol. 37, pp. 669–676, Jul. 1989. [9] M. V. Eyuboglu and S. U. Qureshi, “Reduced-state sequence estimation for coded modulation on intersymbol interference channels,” IEEE J. Select. Areas Commun., vol. 7, pp. 989–995, Aug. 1989. [10] E. F. Haratsch, “High-speed VLSI implementation of reduced complexity sequence estimation algorithms with application to Gigabit Ethernet 1000BASE-T,” in Proc. Int. Symp. VLSI Technology, Systems, and Applications, Taipei, Taiwan, Jun. 1999, pp. 171–174. [11] G. Fettweis and H. Meyr, “High-speed parallel Viterbi decoding: Algorithm and VLSI architecture,” IEEE Commun. Mag., vol. 29, pp. 46–55, May 1991. , “A 100-Mbit/s Viterbi decoder chip: Novel architecture and its [12] realization,” in Proc. IEEE Int. Conf. Commun. (ICC), vol. 2, Apr. 1990, pp. 463–467. , “High-rate Viterbi processor: a systolic array solution,” IEEE J. [13] Select. Areas Commun., vol. 8, pp. 1520–1534, Oct. 1990. [14] G. Fettweis, H. Dawid, and H. Meyr, “Minimized method Viterbi decoding: 600-Mb/s per chip,” in Proc. IEEE Global Commun. Conf. (GLOBECOM), vol. 3, Dec. 1990, pp. 1712–1716. [15] P. Black and T. Meng, “A 140-Mb/s 32-state radix-4 Viterbi decoder,” IEEE J. Solid-State Circuits, vol. 27, pp. 1877–1885, Dec. 1992. [16] A. K. Yeung and J. Rabaey, “A 210-Mb/s radix-4 bit-level pipelined Viterbi decoder,” in ISSCC Dig. Tech. Papers, Feb. 1995, pp. 88–89. [17] H. Dawid, G. Fettweis, and H. Meyr, “A CMOS IC for Gb/s Viterbi decoding: System design and VLSI implementation,” IEEE Trans. VLSI Syst., vol. 4, pp. 17–31, Mar. 1996. [18] P. Black and T. Meng, “A 1-Gb/s four-state sliding block Viterbi decoder,” IEEE J. Solid-State Circuits, vol. 32, pp. 797–805, June 1997. [19] G. D. Forney Jr, “Maximum-likelihood sequence estimation of digital sequences in the presence of intersymbol interference,” IEEE Trans. Inform. Theory, vol. IT-18, pp. 363–378, May 1972. [20] , “The Viterbi algorithm,” Proc. IEEE, vol. 61, pp. 268–278, Mar. 1973. [21] H.-L. Lou, “Implementing the Viterbi algorithm,” IEEE Signal Processing Mag., vol. 12, pp. 42–52, Sept. 1995. [22] N. Seshadri and J. B. Anderson, “Decoding of severely filtered modulation codes using the (M,L) algorithm,” IEEE J. Select. Areas Commun., vol. 7, pp. 1006–1016, Aug. 1989. [23] N. H. Weste and K. Eshragian, Principles of CMOS VLSI Design, 2 ed. Reading, MA: Addison-Wesley, 1994. [24] J. A. Heller and I. M. Jacobs, “Viterbi decoding for satellite and space communications,” IEEE Trans. Commun. Technol., vol. COM-19, pp. 835–848, Oct. 1971. [25] A. Eshraghi, T. Fiez, K. Winters, and T. Fisher, “Design of a new squaring function for the Viterbi algorithm,” IEEE J. Solid-State Circuits, vol. 29, pp. 1102–1107, Sep. 1994. [26] A. A. Hiasat and A. S. Abdel-Aty-Zohdy, “Combinational logic approach for implementing an improved approximate squaring function,” IEEE J. Solid-State Circuits, vol. 34, pp. 236–240, Feb. 1999. [27] A. P. Hekstra, “An alternative to metric rescaling in Viterbi decoders,” IEEE Trans. Commun., vol. 37, pp. 1220–1222, Nov. 1989. [28] R. Cypher and C. B. Shung, “Generalized trace-back techniques for survivor memory management in the Viterbi algorithm,” J. VLSI Signal Processing, vol. 5, pp. 85–94, 1993. [29] G. Fettweis and H. Meyr, “Parallel Viterbi algorithm implementation: breaking the ACS-bottleneck,” IEEE Trans. Commun., vol. 37, pp. 785–790, Aug. 1989.

IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 36, NO. 3, MARCH 2001

[30] K. K. Parhi, “High-speed VLSI architectures for Huffman and Viterbi decoders,” IEEE Trans. Circuits Syst. II, vol. 39, pp. 385–391, Jun. 1992. [31] G. Fettweis and H. Meyr, “Feedforward architectures for parallel Viterbi decoding,” J. VLSI Signal Processing, vol. 3, pp. 105–119, 1991. [32] R. Raheli, G. Marino, and P. Castoldi, “Per-survivor processing and tentative decisions: What is in between?,” IEEE Trans. Commun., vol. 44, pp. 127–129, Feb. 1996. [33] E. A. Lee and D. G. Messerschmidt, Digital Communication, 2nd ed. Norwood, MA: Kluwer, 1994. [34] D. D. Falconer and F. R. Magee, “Adaptive channel memory truncation for maximum-likelihood sequence estimation,” Bell Syst. Tech. J., vol. 52, pp. 1541–1562, Nov. 1973. [35] W. U. Lee and F. S. Hill, “A maximum-likelihood sequence estimator with decision-feedback equalization,” IEEE Trans. Commun., vol. 25, pp. 971–979, Sept. 1977. [36] P. Pirsch, Architectures for Digital Signal Processing. New York: Wiley, 1998. [37] K. K. Parhi, “Pipelining in algorithms with quantizer loops,” IEEE Trans. Circuits Syst. II, vol. 38, pp. 745–754, July 1991. [38] P. S. Bednarz, S. C. Lin, C. S. Modlin, and J. M. Cioffi, “Design, performance, and extensions of the RAM-DFE architecture,” IEEE Trans. Magn., vol. 31, Mar. 1995. [39] E. F. Haratsch and K. Azadet, “High-speed reduced-state sequence estimation,” in Proc. IEEE Int. Symp. Circuits and Systems, vol. 3, Geneva, Switzerland, May 2000, pp. 387–390. , “A low complexity joint equalizer and decoder for 1000BASE-T [40] Gigabit Ethernet,” in Proc. IEEE Custom Integrated Circuits Conf. (CICC), Orlando, FL, May 2000, pp. 465–468.

Erich F. Haratsch (S’93) received the Dipl.-Ing. degree in electrical engineering and information technology from the Technische Universität München, Germany, in 1997, where he is currently working toward the Ph.D. degree in electrical engineering at the Institute for Integrated Circuits. From November 1996 until July 1997, he was with AT&T Labs-Research, Red Bank, NJ, working on video coding and face animation in the MPEG-4 framework. From August until October 1997, he was with Fuji Xerox Corporate Research, Kanagawa, Japan, investigating algorithms and VLSI architectures for image processing. Since November 1997, he has been with the DSP & VLSI Systems Research Department, Bell Labs, Lucent Technologies, Holmdel, NJ, for his Ph.D. research. His research interests include signal processing and VLSI design for high-speed communications.

Kamran Azadet (S’92–A’95) was born in Paris, France, in 1966. He received the engineering degree from Ecole Centrale de Lyon in 1990, and the Ph.D. degree from ENST Paris in 1994. From 1990 to 1994, he was a Research Engineer with Matra MHS, Saint Quentin en Yvelines, France, where he was involved in the design of video filters for acquisition systems. From 1994 he has been with Bell Laboratories, Holmdel, NJ, where he has worked in the area of color digital CMOS cameras, and highspeed transceivers. He was a member of the IEEE 802.3ab Gigabit Ethernet 1000BaseT standard, and more recently a member of the IEEE Ethernet high-speed study group 10-Gigabit Ethernet. He is currently Director of the High-Speed Communications VLSI Research Department in Holmdel. Dr. Azadet was a co-recipient of the 1998 IEEE JOURNAL OF SOLID-STATE CIRCUITS best paper award for a paper on a color digital CMOS camera.

A 1-Gb/s joint equalizer and trellis decoder for ...

the analog and digital signal processing be integrated in a single chip. A simplified receiver architecture which would be part of a single chip 1000BASE-T ...... in Holmdel. Dr. Azadet was a co-recipient of the 1998 IEEE JOURNAL OF SOLID-STATE. CIRCUITS best paper award for a paper on a color digital CMOS camera.

Download PDF

225KB Sizes 1 Downloads 149 Views

Report

A 1-Gb/s joint equalizer and trellis decoder for ...

Recommend Documents