Applying Low Power Techniques in AES MixColumn ...

Viewer
Transcript

Applying Low Power Techniques in AES MixColumn/InvMixColumn Transformations George N. Selimis, Apostolos P. Fournaris, and Odysseas Koufopavlou VLSI Lab, Department of Electrical & Computers Engineering, University of Patras, Patras, Greece {gselimis, apofour, odysseas}@ee.upatras.gr Abstract. In low power resources environments with increased security needs, like smart cards or RFIDs tags, power consumption plays a crucial role in system efficiency. Since AES algorithm is widely used in the above applications, power efficient design of this algorithm is essential. However few researchers have extensively studied this issue but rather focus on high throughput designs. In this paper the low power techniques of Resource Sharing and Power Management are applied in a 32-bit architecture for the MixColumn/InvMixColumn transformation of the Advanced Encryption Standard. The proposed architecture performs multiplication in GF (28 ) field of a byte S i , j with specific constants, using a common data path. Low power consumption is also achieved by deactivating the unused parts of the data path when MixColumn Transformation is performed. The proposed architecture achieves low power consumption and low area resources compared to other designs.

I.

INTRODUCTION

Cryptography plays an important role in the security of data transmission. NIST selected the Rijndael as the AES algorithm [1] in October 2000. The AES algorithm has broad applications, including smart cards and cellular phones, WWW servers and automated teller machines (ATMs). Compared to software implementations, hardware implementations of the AES algorithm provide more physical security as well as higher speed [6]. The existing implementations of AES do not focus on the problem of power consumption but present high throughput architectures. [6]. In addition, most of the existing implementations of AES, approach Mix Column and InvMixColumn independently. However, in low resources environments, power and area resources are the main efficiency factor. In this paper, we analyze the basic operations used in MixColumn and InvMixColumn transformations of AES and by applying power management and resource sharing techniques, we propose a low power architecture for the above transformation. The proposed architecture gives interesting results in terms of power consumption and area resources when compared with other known designs. In this paper, the mathematical background of Galois Fields is presented in Section II. The basic structure of the standard AES is given in Section III. In Section IV, we analyze the MixColumn/InvMixColumn transformations. In Section V, the proposed system is presented in detail. Comparisons with other works are given in Section VI and the paper is concluded in Section VII.

II.

MATHEMATICAL BACKGROUND

This article uses the same notations and conventions as in the AES specifications [1].

Bytes. The basic data unit of AES are bytes: a = {a7, a6, a5, a4, a3, a2, a1, a0}. A byte can represent an element of the Galois Field GF (28 ) in polynomial representation: 7

a ( x) = ∑ ai x i = a7 x 7 + a6 x 6 + a5 x 5 + ... + a1 x + a0 defined over i=0

8

4

3

the irreducible polynomial p ( x) = x + x + x + x + 1 . For example, the binary value {01100011} is {63} in hex-decimal notation and represents the polynomial x 6 + x 5 + x + 1 . Addition: The addition of two bytes representing polynomials a( x), b( x) ∈ GF (28 ) is achieved by adding their corresponding coefficients modulo 2 which is a XOR operation usually denoted 7

by

⊕ . a ( x) ⊕ b( x) = ∑ ( ai ⊕ bi ) x i

.

i =0

The additive inverse of a byte is the byte itself:

− b ( x ) = b ( x ) . Due to this, subtraction is identical with addition: a ( x ) − b ( x) = a ( x ) + b ( x ) Multiplication: The multiplication of

a( x ), b( x ) ∈ GF (28 ) ,

denoted as a( x ) • b( x ) , uses the irreducible polynomial p(x) of degree 8 defining the Galois Field. The multiplication

c( x) = a ( x ) • b( x ) in GF (28 ) is done by multiplying the polynomials a( x ), b( x) which yields a polynomial t (x ) with degree less than 15. This step is followed by a modular reduction step c( x) = t ( x ) mod p ( x ) to ensure that the result is an element of GF (28 ) .

III.

ADVANCED ENCRYPTION STANDARD

The AES algorithm is a symmetric-key cipher, in which both the sender and the receiver use a single key for encryption and decryption. The data block length is fixed to be 128 bits, while the key length can be 128, 192, or 256 bits, respectively. In addition, the AES algorithm is an iterative algorithm. Each iteration can be called a round, and the total number of rounds, N r is 10, 12, or 14, when the key length is 128, 192, or 256 bits, respectively. The 128-bit data block is divided into 16 bytes. These bytes are mapped to a 4 × 4 array called the State, and all the internal operations of the AES algorithm are performed on the State. Each byte in the State is denoted by S i , j (0 ≤ i, j < 4) , and is considered as an element of GF (28 ) . The irreducible polynomial used in the AES

algorithm

to 8

4

construct,

GF ( 28 )

field

is

3

p ( x) = x + x + x + x + 1 . In the encryption of the AES algorithm, each round except the final round consists of four transformations: the SubBytes, the ShiftRows, the MixColumns, and the AddRoundKey, while the final round does not have the MixColumns transformation. The previous Cipher transformations can be inverted and then implemented in reverse order to produce a straightforward Inverse Cipher for the AES algorithm. The individual transformations used in the Inverse Cipher are InvShiftRows, InvSubBytes, InvMixColumns, and AddRoundKey

IV.

8bit streams (32 bit value). With the wiring block, the 8-bit products follow the appropriate XOR tree determined by the matrix multiplication for encryption and decryption of Section IV. In the end of the process, the 32-bit output stream is the i column of the State after the MixColumn/InvMixColumn operation, where 0 ≤ i < 4 . In Figure 2, the top level architecture of the proposed system is presented.

MIXCOLUMN /INVMIXCOLUMN TRANSFORMATIONS

The MixColumn transformation operates on the State columnby-column, treating each column as a four-term polynomial. (Fig. 1).

Fig. 1. The MixColumn Transformation.

The columns are considered as polynomials over GF (2 8 ) multiplied modulo (x4 + 1) with a fixed polynomial a( x) = {03}x 3 + {01}x 2 + {01}x + {02} . Suppose s ' ( x) = a ( x) ⊗ s ( x) , as a result of this multiplication, the four bytes in a column are replaced by the following:

S 0' ,c = ({02} • S 0,c ) ⊕ ({03} • S1,c ) ⊕ S 2,c ⊕ S3,c S1',c = ({02} • S1,c ) ⊕ ({03} • S 2,c ) ⊕ S 0,c ⊕ S 3,c S 2' ,c = ({02} • S 2,c ) ⊕ ({03} • S 3,c ) ⊕ S 0,c ⊕ S1,c S 3' ,c = ({02} • S 3,c ) ⊕ ({03} • S 0,c ) ⊕ S1,c ⊕ S 2,c In the InvMixColumn the columns are considered as polynomials over GF (28 ) multiplied modulo (x4 + 1) with a fixed polynomial a −1 ( x) = {0b}x 3 + {0d}x 2 + {09}x + {0e} .

V.

PROPOSED ARCHITECTURE

A. Top Level Architecture The proposed system has a 32-bit input. Each input stream is a column of the AES State. The 1-bit signal en/dec determines the encryption/decryption mode. The system includes four Multiplier blocks. The multiplier block multiplies in GF (2 8 ) field the 32-bit stream with the constants {01}, {01}, {02} and {03} in encryption mode and with the constants {09}, {0B}, {0D} and {0E} in decryption mode. Therefore, the output of the Multiplier is four

Fig. 2. The top level architecture of the proposed system

B.

GF (2 m ) Multiplication preliminaries

In this subsection, the design of a hardware circuit for multiplication in a

GF ( 2 m ) field is discussed. In Algorithm 1

the multiplication of a, b ∈ GF ( 2 m ) is presented. In this Algorithm the bits of b are processed from left (most significant) to right (least significant). The resulting multiplier, is called most significant bit first (MSB) multiplier. MixColumn Transformation is a Multiplication of GF (2 8 ) field elements defined over the irreducible polynomial p( x) = x 8 + x 4 + x 3 + x + 1 . Taking into consideration the MixColumn/InvMixColumn Transformation specification, Algorithm 1 is reformed as presented in Algorithm 2. The input is an 8 bit signal S that is multiplied by one of several Constant values (signal Con). The Multiplication process is concluded after 8 rounds.

Type I equation correspond to the finding of the first non zero coni

Algorithm 1.

Most significant bit first (MSB) multiplier for INPUT:

GF (2 m )

a = ( am −1,..., a1 , a0 ) ∈ GF ( 2 m ),

b = (bm −1,..., b1 , b0 ) ∈ GF (2m ) and

reduction

polynomial

p( x ) = x + r ( x). OUTPUT: c = a • b 1. Set c ← 0 m

value and has the following form

c ← c + coni S

and since c=0

and coni=1 becomes c ← S . Type II equation correspond to the case of a coni=0 occurrence after finding the first non zero bit of Con. Type II equation has the form c ← leftshift (c) + c7 r . Type III equation correspond to the case of a coni=1 occurrence after finding the first non zero bit of Con and has the form c ← leftshift (c) + c7r + S .

2. For i from m – 1 downto 0 do 2.1 c ← leftshift (c ) + c m−1 r

c ← c + bi a 3. Return (c ) 2.2

Algorithm2.

MixColumn/InvMixColumn MSB multiplication in INPUT:

GF (2 m )

S = ( s7 ,..., s1 , s0 ),∈ GF ( 2 m )

Con = (con7 ,..., con1 , con0 ) ∈ GF (2 m ), and reduction polynomial p( x) = x 8 + x 4 + x 3 + x + 1 = x 8 + r ( x)

Fig. 3 . The tree structure of the multiplication process.

c = S • Con c←0

OUTPUT:

1. Set 2. For i from 7 downto 0 do

← leftshift (c) + c7 r 2.2 c ← c + coni S 3. Return (c ) 2.1 c

C. Resource Sharing Hardware Technique Resource sharing can be employed in order to speed up the calculations and reduce hardware area and power consumption. Observing Algorithm 2, it can be noted that till the first non zero coni is used, in round i, no change of the intermediate value c occurs. This value is set to zero. Therefore, if number i of the first non zero coni is known then (7 - i) rounds in Algorithm 2 can be omitted. As shown in Table 1, for all the Constant values (Con value) used in MixColumn/InvMixColumn, con7 to con4 bits are zero. It can be concluded that (7 - 3) = 4 rounds can be omitted in each multiplication. Calculating the multiplication product requires at most 4 rounds. The required number of rounds for multiplying the S input to each constant value Con is also shown in Table I.

The Constant values along with the number of Rounds for the multiplication of a column S of the State with each Constant value are known. Using the above remark, the whole multiplication process can be represented by the tree structure of Fig. 3. Every level of this tree corresponds to one round of Algorithm 2. Each node of the tree corresponds to a Type I, II, or III equation depending on the current value of coni. The root of the tree (Level 1) represents con3 and one multiplication round. Level 2 represents con2 and two multiplication rounds. Level 3 represents con1 and three multiplication rounds. Level 4 represents con0 and four multiplication rounds.

Fig. 4. Proposed Type II multiplier slices

TABLE I. BINARY PRESENTATION OF CONSTANT VALUES AND REQUIRED ROUNDS FOR EACH MULTIPLICATIONTABLE TYPE STYLES Constant Values {01}: 00000001 {02}: 00000010

Req. Rounds 1 2

{03}: 00000011

2

Constant Values {0B}: 00001011 {0D}: 00001101 {0E}: 00001111

Req. R. 4 4 4

{0D}: 00001101

4

Each round of Algorithm 2 can be represented as a recursive equation, varying according to coni and r inputs. However, the Con and r values are known and a recursive equation can be specified for every possible input. There are three types of such equations.

Fig. 5. Proposed Type III multiplier slices

Each product is taken at the appropriate level according to the required number of rounds for each multiplication, shown in Table I. Using the above tree, all the required results can be computed. For example, to obtain the result Si , j •{09} we follow the path 10-0-1 while for result

S i , j • {0D} we follow path 1-1-0-1.

Each equation can be modelled by an 8bit hardware slice. Such slices are presented in Figures 4, 5 for Type II, III equations respectively. Type I slice is a rearrangement of wires not requiring any gate. Type II slice, implementing equation c ← leftshift (c ) + c7 r , uses only three XOR gate since the value r is a known constant (r={11011}) and the Least Significant bit of c after left shifting is always set to zero, as shown in Fig. 4. Type III slice, implementing

c ← leftshift (c) + c7 r + S ,

utilizes 11 XOR gates as shown in Fig 5.

D. Power Management Technique Switching activity [5] is the major cause of energy dissipation in most CMOS digital systems. Switching activity of area resources that do not contribute in a specific operation at a given time, can be reduced. The basic principle is to identify logical conditions at some inputs of a logic circuit that are invariant to the output. When the system operates in encryption mode some parts of the multiplier block do not contribute in the result. We can shutdown, Level 3 and 4 of the proposed tree (Fig. 3) by introducing AND gates to stop the propagation of S(i), C(i) signals (Fig. 4, 5). Applying the above proposed methodology, only the appropriate parts (Level 1 and Level 2) are operational in encryption mode. In Figure 5 the active and inactive parts of a Multiplier Block during MixColumn operation are presented, controlled by En/Dec signal.

deactivated. In this case, a part of the system is inactive and does not contribute to the total power consumption. Due to that fact, a big number of gates is inactive and the power savings are significant. In general, common implementations of AES algorithm with low power characteristics are not proposed, with the exception of [2], where low power implementations of Subbytes Transformation are presented. We compare the proposed system with two detailed architectures in MixColumn Transformation. In [4], no resource sharing is used and a different multiplication architecture is proposed for each constant. This technique can achieve about the same power consumption with our proposed design but covers about 35% more area resources. In [3], the work has similar area resources compared to our proposed design. However, during MixColumn Transformation, power consumption, in [3], is increased by a factor of 170% more active gates than the power consumption in our proposed design. Comparison results are shown in Table 2. In order to achieve fair comparisons, the 8 bit architectures of [3], [4] are normalized to 32 bits in Table II. TABLE II. Implementations Bit Length Area Resources Throughput

COMPARISONS WITH OTHER WORKS [4] 32-bit 592 XOR

[3] 32-bit 424XOR+128 AND

proposed 32-bit 432 XOR+ 104 AND

4byte/clock cycle

4byte/clock cycle.

4byte/clock cycle.

no yes Resource Sharing Active Gates (Power Consumption) MixColumn oper. 152 XOR 424 XOR InvMixColumn oper. 440 XOR 424 XOR

yes 152 432

VII. CONCLUSIONS In this paper, applying the resource sharing technique, we find common data paths between the desired operations in order to limit the area resources of the system. Also, applying power management technique, data paths that do not contribute in the required results of the system are deactivated. The resulting proposed architecture, combining the above techniques, achieved efficient results in terms of power consumption and area when compared with other known designs. [1] [2]

[3]

[4]

Fig. 6. The proposed Power Management technique

VI.

COMPARISONS WITH OTHER WORKS

Applying power management technique in System level, data paths that do not contribute in the required results of the system are

[5] [6]

REFERENCES FIPS 197: Advanced Encryption Standard, 2001 Stefan Tillich, Martin Feldhofer, and Johann Großschädl. Area, Delay, and Power Characteristics of Standard-Cell Implementations of the AES S-Box. In Embedded Computer Systems: Architectures, Modeling, and Simulation, vol. 4017 of Lecture Notes in Computer Science, pp. 457–466. Springer Verlag, 2006. J. Wolkerstorfer, “An ASIC implementation of the AES MixColumn operation”, in Proc. Austrochip 2001, Vienna, Austria, Oct. 12, 2001, pp. 129-132. P. Noo-Intara, S. Chantarawong, and S. Choomchuay, “Architectures for MixColumn Transform for the AES”, Proc. of (ICEP2004), University (Phuket Campus), January 2004. P.J.M. Havinga, “Mobile Multimedia Systems”, Ph.D. thesis University of Twente, February 2000. X. Zhang and K. Parhi, “High-speed VLSI architectures for the AES algorithm”, IEEE Transactions on Very Large Scale Integration (VLSI) Systems ,Volume 12 , Issue 9 (September 2004).