Accelerating Blowfish Encryption using NIOS II C2H Compiler Gokulavasan G.1, Murali Krishnan Nair2*, Ravi Saini3, Raj Singh4 Abstract A Large number of algorithms and applications now-a-days are computationally intensive. With the availability of unused logic elements on the FPGA such processor-intensive operation can be accelerated through hardware implementation. This paper discusses about the hardware acceleration of the Blowfish Encryption Algorithm using the NIOS II C2H Acceleration Compiler. Blowfish, a symmetric block cipher, has been seen as an alternative to the existing encryption techniques. Altera’s NIOS II Embedded Processor C-to- Hardware (C2H) Acceleration is an excellent tool for accelerating time-critical functions. We implemented the Blowfish Encryption Algorithm with a key size of 64 bits on NIOS II/e Soft core in Cyclone II FPGA. We achieved a speed gain of 19.5x with optimized C2H Acceleration. The increase in the number of logic elements were about 77%. Keywords: Blowfish Encryption, C2H Compiler, Hardware Acceleration, FPGA 1. Introduction

The complexity of the algorithms that are to be implemented on embedded systems has been on the increase. Running these algorithms directly on the processor cores might be really time consuming and therefore needs a different approach. Hardware acceleration is an excellent option in most of the cases. In this method, the computationally intensive operations are performed by separate hardware accelerators. In FPGA, the unused programmable logic can be used to create hardware accelerators, which can offload the operations that are to be performed on the processor cores. This results in a drastic decrease in the time taken for computation and thereby increasing the speed. This is done at the

cost of increase in the utilized logic elements. Altera’s NIOS II C-to-Hardware Acceleration is a tool that aids in the process of development of hardware accelerators. The tool comes as a part of the NIOS II IDE, which is used for software development for the NIOS II processor based systems. A lot of applications can be accelerated to attain the desired speedup. In this case, we have considered the Blowfish encryption algorithm. Blowfish algorithm was designed by Bruce Schneier in 1993 and is considered to be an alternative for the existing encryption algorithms. 2. Blowfish Encryption Blowfish encryption is a symmetric block cipher that can take in variable key length [Schneier (1994)]. The encryption algorithm is unpatented and is considered to be an alternative to the existing cryptographic algorithms. Few works have carried out on the cryptanalysis of the algorithm. FPGA implementation of the Blowfish encryption has been discussed [Honig (2000)]. The ASIC implementation of the encryption algorithm has been carried out [Michael et. al. (2000)]. In this case, we have considered the implementation of the encryption algorithm on reconfigurable Cyclone II FPGA. Blowfish is a 16 round Feistel cipher that operates on 64-bit blocks. Blowfish uses a number of sub-keys and they have to be calculated before encrypting or decrypting the data. These values are key-dependent and have to be computed every time a new key is used. The algorithm uses an 18-entry P-array of 32-bit each and four 256-entry S boxes of 32-bit. The algorithm of initialization of the P-array and the S-boxes is given below [Schneier (1994)]: 1. P-array and the S-boxes are initialized with fixed string that consists of the hexadecimal digits of pi.

1. Student, Birla Institute of Technology and Science, Pilani (Email: [email protected]) 2. Student, Birla Institute of Technology and Science, Pilani (Email: [email protected]) (Phno. 09950675867) *Presenting the paper if accepted. 3. Scientist, IC Design Group, CEERI Pilani (Email: [email protected]) 4. Raj Singh, Head, IC Design Group, CEERI Pilani (Email: [email protected] )

2. P [1] is xor-ed with the first 32 bits of the key, P [2] is xor-ed with the second 32-bits of the key, and so on for all bits of the key. The cycle is repeated through the key bits until the entire P-array has been xor-ed with key bits. 3. The Blowfish algorithm is used to encrypt all the zero-string using the sub keys described above. 4. P [1] and P [2] are replaced with the output of step (3). 5. The output of step (3) is encrypted using the Blowfish algorithm with the modified sub keys. 6. P [3] and P [4] are replaced with the output of step (5). 7. The process is continued replacing all entries of the P- array, and then all four S-boxes in order, with the output of the continuously changing Blowfish algorithm. After the initialization, the algorithm given below is used for the encryption or decryption of the given data. Blowfish has the input in the form of a 64bit element. The data element is divided into two 32-bit elements Xl and Xr. Then the pseudocode of encryption is as follows [Schneier (1994)]:

coprocessor/accelerator system. The hardware implementation will decrease the time taken for the computation. But the tradeoff for the speed gain will be the real estate on the FPGA. Usually, the algorithm is captured in the RTL coding and is then implemented on the reconfigurable system. But the unique feature of the C2H compiler is that it generates the hardware accelerators directly from ANSI C functions. C2H generated accelerators are automatically inserted in the SOPC builder tool and are also connected automatically to the rest of the system. The NIOS II based systems use Avalon Interconnect fabric that connects the components in the processor system into a memory-mapped register system. C2H Accelerators are co-processors and can accept arguments and are also capable of accessing memory and other peripherals. The block diagram of a NIOS II system along with the accelerator is shown in figure 1 [3].

Xl = Xl xor P [i] Xr = F (Xl) xor Xr Swap Xl and Xr After the sixteenth round, Xl and Xr are swapped again to undo the last swap. Then, Xr = Xr xor P17 and Xl = Xl xor P18. Finally, Xl and Xr are combined to get the cipher text. Function F: Divide Xl into four eight-bit quarters: a, b, c, and d. Then, F (Xl) = ((S1, a + S2, b mod 232) xor S3, c) + S4, d mod 232. The procedure for decryption is exactly the same as encryption, except that Parray elements are used in the reverse order. The blowfish algorithm is license free and the source code is available [1]. 3. NIOS II C2H Compiler NIOS II C-to-Hardware Acceleration compiler is a tool introduced by Altera for the purpose of aiding the users of NIOS II processor based systems to accelerate their time-critical and processor intensive applications [2]. Hardware acceleration is used to offload the processor intensive functions onto a

Fig. 1. NIOS II System with Hardware Accelerator The C2H compiler relates the accelerated C function to a hardware structure and these hardware structures are instantiated as a part of the accelerator. Few operations are discussed below. The complete lists along with the supported ANSI C functions are given in [3]. The multiplication operator will map to a multiplier of the specific size and addition operation will map onto an adder circuit. Some

operations such as the shifting operation might not need a hardware unit. The loops are transformed into state machines and pointer dereferences map to Avalon master ports and thus can access the memory. In order to achieve the maximum results from the C2H Acceleration tool, one needs to restructure the code since it is directly related to the accelerator generated. The most important parameters of a hardware accelerator generated by C2H Acceleration tool are loop latency and CPLI (Cycles per Loop Iteration). The speed gain achieved will be determined by the above parameters. Loop latency is the number of clock cycles taken to fill the pipeline of the accelerator. And the CPLI gives the estimate of the effective number of cycles that are needed to finish the execution of one loop iteration. The blowfish encryption algorithm was implemented on the NIOS II/e soft-core system and then the C2H Acceleration tool was used to generate the hardware accelerator for the system. The optimization steps carried out and the results achieved are discussed below. 4.1 NIOS II Processor System The Quartus II IDE aids in the development of the NIOS II based system. The SOPC (Systemon-Programmable Chip) builder tool lists a number of IP Cores from which the cores can be chosen and instantiated in the system. In the tool, NIOS II soft-core processor of the type ‘e’ was chosen (cpu_0). This core does not have instruction/data cache and is one of the most basic cores available. It occupies about 700 logic elements of the FPGA device. A system timer (timer1) was chosen so as to estimate the time of computation of different operations in the processor system. The timer was accessed in the software using the corresponding HAL function [4]. SDRAM controller was chosen from the SOPC builder IP core list. The code for the processor will reside on the SDRAM memory chip that resides on the Cyclone II FPGA Development kit. PLL (pll_0) was used in the system so as to phase shift the clock signal that is used by the SDRAM system as per the requirement of the chip. JTAG UART (jtag_uart_0) was also selected which enables the JTAG features of programming and

debugging. The system also includes an on-chip memory module for the storage of the sub-keys. The hardware accelerator (accelerator_nios_c2h_Blowfish_encipher) created by the C2H Acceleration tool is also included in the SOPC Builder list by the tool and is also connected to the rest of the NIOS II processor system. The System ID (sysid) IP Core enables the IDE to verify the core on to which the software is being downloaded and hence avoids mismatches. The SOPC builder system along with chosen IP cores is given in figure 2.

Fig. 2. SOPC Builder system After the creation of the NIOS II processor system, it is then implemented on the block diagram of the project and the corresponding pins are added. The pins are also connected to the necessary pins through the pin assignment tool. The picture of the system on the block diagram is given in figure 3. The processor system was connected to the pins of the SDRAM and also to the clock source on the development board. The direction of the pins is chosen accordingly.

Fig. 3. NIOS II Processor System

function could be directly implemented by routing the corresponding wires.

4.2 Software Implementation The Blowfish encryption algorithm was implemented in software on the NIOS II/e core. The system was clocked at 50MHz. A 64-bit key was used for the encryption purpose. The encryption algorithm contains the initialization part and the encryption part. The decryption operation is not discussed, as it will achieve the same speed up as that of the encryption since the operations performed are similar. The time taken for the initialization of the sub-keys was 230ms. And the time taken for the encryption of a 64-bit block turned out to be about 0.43ms. The total number of logic elements used was 2050. 4.3 Hardware Accelerator The encryption function was analyzed using the C2H Compiler. The loop latency was reported to be 28 and the CPLI turned out to be 17. The assignment of operations during the different states was studied. The sub-key arrays (S-boxes and the P-array) reside on the SDRAM during this analysis. In order to improve on the memory access time, the sub-keys were targeted on to a low latency memory such as the on-chip memory module as in this example. The arrays can be targeted to the on-chip memory using the following keyword during the initialization of the array. unsigned long parray[] __attribute__ (".onchip_memory_0"))) = {….}; unsigned long sbox0[] __attribute__ (".onchip_memory_0"))) = {….}; unsigned long sbox1[] __attribute__ (".onchip_memory_0"))) = {….}; unsigned long sbox2[] __attribute__ (".onchip_memory_0"))) = {….}; unsigned long sbox3[] __attribute__ (".onchip_memory_0"))) = {….};

((section ((section ((section ((section ((section

The function ‘F’ was also directly implemented in the encryption function instead of a call to a sub-function. Since a lot of operations in this function like masking and shifting doesn’t need any hardware blocks for realization. For example, the following operations in the ‘F’

d = x & 0x00FF; x >>= 8; c = x & 0x00FF; x >>= 8; b = x & 0x00FF; x >>= 8; a = x & 0x00FF; So these operations don’t add to the CPLI of the accelerator. Moving the sub-key initializations on to a low-latency memory will not change the C2H scheduling since the C2H compiler will use the worst-case latency for all the memories that are connected to a particular master. We can eliminate this by using the ‘pragma’ specification. This will limit the number of connections to the master/slave port of the accelerator to only the specified ports and thus reduces the arbitration logic between the ports. This results in reduction in logic elements and as well as in the execution time. In this case, the sub-keys S-boxes and the P-array reside in the onchip memory and the data to be encrypted is present in the SDRAM. Hence, the pragma reference is used as given below. #pragma altera_accelerate connect_variable Blowfish_encipher/xl to sdram #pragma altera_accelerate connect_variable Blowfish_encipher/xr to sdram #pragma altera_accelerate connect_variable Blowfish_encipher/sbox0 to onchip_memory_0 #pragma altera_accelerate connect_variable Blowfish_encipher/sbox1 to onchip_memory_0 #pragma altera_accelerate connect_variable Blowfish_encipher/sbox2 to onchip_memory_0 #pragma altera_accelerate connect_variable Blowfish_encipher/sbox3 to onchip_memory_0 #pragma altera_accelerate connect_variable Blowfish_encipher/parray to onchip_memory_0 A few statements can be clubbed together so as to reduce state assignments.

y = (y^sbox2[c])+sbox3[d]; //The two operations were combined to a single step Using the above ‘pragma’ references reduced the loop latency to 16 and the CPLI to 11. A considerable change can be seen from the previous results. The option to build the accelerator was chosen and the trial runs were carried out. The C2H Acceleration tool generated the accelerator and it included the same in the SOPC Builder tool and also generated the configuration file for the FPGA. The file was downloaded on to the FPGA and test runs were carried out. The time for initialization with the same parameters as in the software implementation (4.2) turned out to be 18ms and the encryption time turned out to be 0.022ms. 5. Results The software implementation and the hardware implementation runs were carried out. Figure 4 shows the comparison of the initialization time (Init) and encryption time (Encr) for thousand 64-bit blocks. The number of logic elements used in the software implementation was 2050 and in the case of the hardware acceleration the number rose to 3679 logic cells. The speed gain achieved for initialization function through hardware acceleration is 12.7x and in the case of encryption the gain is 19.5x. The increase in the number of logic elements utilized is about 77%. The result clearly indicates a huge decrease in the computation time and the work is also offloaded onto a coprocessor.

Fig 4. Execution time for different operations 6. Conclusion

The Blowfish encryption algorithm was implemented on software and also on the hardware with the aid of the C2H Acceleration tool. The execution times in both the cases have been determined and the latter case gives a speedup of 19.5x. The tool can be used effectively to obtain speedups in many cases. The understanding of hardware mapping and the optimization techniques used in the tool is essential to get the most out of the tool. The development time was cut down drastically and the performance metrics clearly gives us an idea about the acceleration obtained. For further speed up, the accelerator can be clocked faster so as to work in a different time domain as that of the NIOS II processor system. But care should be taken for data transfer among modules operating in different clock domains. 7. References [1] Bruce Schneier, Source Code in C of Blowfish Encryption Algorithm http://www.schneier.com/code/bfsh-sch.zip [2] Altera, Hardware Acceleration, http://www.altera.com/products/ip/processors/ni os2/benefits/performance/ni2-acceleration.html #hardware_accelerators [3] NIOS II C2H Compiler Ver 7.1 User Guide Ver. 1.2 , May 2007 Altera [4] NIOS II Software Developer's Handbook NII5V2-7.1, Altera Pg. 11-38 [5] Bruce Schneier, ‘Description of a New Variable-Length Key, 64-Bit Block Cipher (Blowfish)’ Fast Software Encryption, Cambridge Security Workshop Proceedings (December 1993), Springer-Verlag, 1994, pp. 191-204 [6] David Honig, ‘Blowfish & IDEA in Silicon’, http://citeseer.ist.psu.edu/rd/13648578%2C4588 70%2C1%2C0.25%2CDownload http://coblitz.codeen.org:3125/citeseer.ist.psu.ed u/cache/papers/cs/23076/http:zSzzSzwww.geoci ties.comzSzasicipherzSzfishnchips.pdf/honig00b lowfish.pdf [7] Michael C.-J. Lin, Youn-Long Lin, A VLSI implementation of the Blowfish encryption/decryptionalgorithm’ Design Automation Conference, Proceedings of the ASP-DAC 2000. Asia and South Pacific, 2000.

Accelerating Blowfish Encryption using C2H Compiler

the availability of unused logic elements on the. FPGA such ... FPGA, the unused programmable logic can be .... dereferences map to Avalon master ports and.

263KB Sizes 1 Downloads 200 Views

Recommend Documents

Accelerating Blowfish Encryption using C2H Compiler
Raj Singh, Head, IC Design Group, CEERI Pilani (Email: [email protected] ). Accelerating Blowfish ... of the NIOS II IDE, which is used for software development for the NIOS II ..... Automation Conference, Proceedings of the. ASP-DAC 2000.

Simultaneous Encryption using Linear Block Channel Coding
Two vital classes of such coding techniques are: block and convolutional. We will be ..... Press, 1972. [8] Online Matrix Multiplier: http://wims.unice.fr/wims/wims.cgi.

Clear key encryption using MP4BOX -
What tools are needed and where are they ? MP4Box to encrypt or decrypt ... drm_file. It is an XML file whose syntax looks like this: XML Syntax. 1. 2. 3. 4. 5. 6. 7.

Accelerating String Matching Using Multi-threaded ...
processor are too slow for today's networking. • Hardware approaches for .... less complexity and memory usage compared to the traditional. Aho-Corasick state ...

Accelerating String Matching Using Multi-Threaded ...
Abstract—Network Intrusion Detection System has been widely used to protect ... malware. The string matching engine used to identify network ..... for networks. In. Proceedings of LISA99, the 15th Systems Administration Conference,. 1999.

Compiler Construction using Flex and Bison
Most program time is spent in the body of loops so loop optimization can result in significant performance im- provement. Often the induction variable of a for loop is used only within the loop. In this case, the induction variable may be stored in a

Accelerating X-Ray Data Collection Using Pyramid ...
A. Averbuch is with the School of Computer Science, Tel Aviv University,. Tel Aviv ... convert from pyramid beam projection data into parallel projec- tion data. II.

Accelerating String Matching Using Multi-threaded ...
Experimental Results. AC_CPU. AC_OMP AC_Pthread. PFAC. Speedup. 1 thread. (Gbps). 8 threads. (Gbps). 8 threads. (Gbps) multi-threads. (Gbps) to fastest.

developing a high performance gpgpu compiler using ...
optimized kernels to relieve the application developers of low-level hardware-specific performance optimizations. State-of-the-art GPUs use many-core ...

Accelerating Differential Evolution Using an Adaptive ...
variants such as evolutionary strategies (ES) [2], real coded ge- netic algorithms .... tions in the reproduction stage [5], [23]. In order to distinguish ... uses an additional mutation operation called trigonometric mu- tation operation (TMO).

SBDC - Blowfish Racing.pdf
and Canada, as well as direct internet sales that extend the products' reach ... Blowfish Racing learned about Wisconsin's Small Business Development ... had initially come in with the idea to form a partnership, get a significant bank loan,.

Encryption Whitepaper
As computers get better and faster, it becomes easier to ... Table 1 details what type of data is encrypted by each G Suite solution. 3. Google encrypts data as it is written to disk with a per-chunk encryption key that is associated .... We compleme

Google Message Encryption
Google Message Encryption service, powered by Postini, provides on-demand message encryption for your organization to securely communicate with business partners and customers according to security policy or on an “as needed” basis. Without the c

COMPILER DESIGN.pdf
b) Explain the various strategies used for register allocation and assignment. 10. 8. Write short notes on : i) Error recovery in LR parsers. ii) Loops in flow graphs.

Compiler design.pdf
c) Briefly explain main issues in code generation. 6. ———————. Whoops! There was a problem loading this page. Compiler design.pdf. Compiler design.pdf.

ClamAV Bytecode Compiler - GitHub
Clam AntiVirus is free software; you can redistribute it and/or modify it under the terms of the GNU ... A minimalistic release build requires 100M of disk space. ... $PREFIX/docs/clamav/clambc-user.pdf. 3 ...... re2c is in the public domain.

Compiler design.pdf
3. a) Consider the following grammar. E → E + T T. T → T *F F. F → (E) id. Construct SLR parsing table for this grammar. 10. b) Construct the SLR parsing table ...

compiler design__2.pdf
Page 1 of 11. COMPILER DEDIGN SET_2 SHAHEEN REZA. COMPILER DEDIGN SET_2. Examination 2010. a. Define CFG, Parse Tree. Ans: CFG: a context ...

compiler design_1.pdf
It uses the hierarchical structure determined by the. syntax-analysis phase to identify the operators and operands of. expressions and statements. Page 1 of 7 ...

Data Encryption Techniques
his/her computer/ laptop is protected enough because of the anti-virus and router being used, but keeping ... AES has 10 rounds for 128-bit keys, 12 rounds for.

19106853-Reconfigurable-Computing-Accelerating-Computation ...
Connect more apps... Try one of the apps below to open or edit this item. 19106853-Reconfigurable-Computing-Accelerating-Computation-With-FPGAs.pdf.

CS6612-COMPILER-LABORATORY- By EasyEngineering.net.pdf ...
1. Implementation of symbol table. 2. Develop a lexical analyzer to recognize a few patterns in c (ex. Identifers, constants,. comments, operators etc.) 3. Implementation of lexical analyzer using lex tool. 4. Generate yacc specification for a few sy