Implementation of Keccak hash function in Tree hashing mode on Nvidia GPU Guillaume Sevestre [email protected]

Abstract. This paper presents a Graphics Processing Unit implementation of K EC CAK cryptographic hash function, in a parallel tree hash mode to exploit the parallel compute capacity of the graphics cards. The Nvidia Cuda language has been used to access precisely the specificity of the GPU hardware (memory hierarchy, host-device memory transfers). After optimizations of the cooperation between GPU and CPU, top speed of more than 1 GB/s (including data transfers) has been reach using an entry level GTS 250 card, for a 256 bits security target and hash length. A Stream Cipher mode has also been implemented, which can find applications in high speed encryption or pseudo random number generation. Keywords: Cryptographic hash function, K ECCAK, GPU, Cuda, Tree hashing, Sponge functions, S HA -3 proposal.

1 Introduction GPU computing applied to cryptography has already been explored, by porting for example the AES block cipher on this platform like in Manavski [5] and Osvik et al [4]. This paper presents a Graphics Processing Unit (GPU) implementation of the cryptographic hash function K ECCAK [1], a candidate to the SHA-3 competition. To use efficiently the parallelism of the GPU architecture, a dedicated tree hashing mode is proposed for this implementation. This mode is inspired by different tree mode proposals in the K ECCAK specification and in the MD6 hash function proposal [8].

2 Targeted hardware and language framework The targeted hardware for this implementation is Nvidia GPUs in general, and a G92 chip based card (GTS 250) is used by the author for benchmark. The choice of Nvidia Cuda API has been done to access precisely the specificity of the hardware (memory hierarchy, hostdevice memory transfers). The Cuda API is an extension to the C language, so programmers familiar with C can easily use this framework. The parallel programming model used is Single-instruction multiple-thread (SIMT), which can be viewed as an extension of the Single-instruction multiple-data (SIMD) model (used by SSE and Altivec vector instructions set). In this model each thread process the same instructions on different data, by opposition to the standard Multiple-instruction multiple-data (MIMD) model used on multi-processor architecture, on which each thread can process different instructions on different data. The G92 used 128 Cuda cores which are individually able to perform 32 bits integer operations, with 1024 32 bits registers available by core. Cores are grouped by 8 to form a Streaming MultiProcessor (SM), able to access some extra shared memory, 16 kB par SM. Each Streaming Multiprocessor is able to run near thousands of concurrent threads, and it is recommended to use at least hundreds threads by SM. Limitations comes by the size of the data processed by each thread, which should fit into available registers for better performance.

Fig. 1. Nvidia G92 hardware

Different memory areas are used by Cuda hardware, they are called device memory by opposition to host memory which resides on the host computer. The device global memory fits in VRAM, it is the largest memory available, but the slowest to access for cores. Shared memory is accessible by each core in a SM, and is faster than global memory. Shared memory can be used to store cooperation data used between threads during the computation.

Registers are private memory for each thread, and are very fast. The typical workflow of a Cuda program is as follow: 1. 2. 3. 4. 5. 6.

Allocate memory workplace in host and device, Transfer data from host to device global memory, Load working memory in shared memory or registers, Perform the core computation on GPU, Transfer results back to the host memory, Perform CPU finalization of computation (if needed).

Details on the programing model, the Cuda API, hardware configurations and capabilities, and performance recommendations can be found in the Cuda programming guide provided by Nvidia [7]. Details on how to install the Cuda SDK and toolkits can be found on the Cuda developer zone web site.

3 Choosing the hash mode 3.1 Tree hash mode proposed Several hash functions proposed for the SHA-3 competition are designed to offer inner parallelism, in a sense that in the execution of one hash function call, different computation steps can be done in parallel. For example, in the MD6 design [8], 16 steps of the round function are independent and can be computed in parallel. One can try to exploit this inner parallelism on ‘many threads’ Cuda hardware. An other way to parallelize any hash function is to use a tree hash mode, also called Merkle tree [6]. In this way the parallelism is done outside the hash function, by running several instances of the hash function concurrently, and gathering the results of each hash function by hashing them at an upper level in a tree. The author chooses to implement a tree hash mode, with many leaves (reusing K ECCAK paper [1], §6.4 vocabulary) hashing input message parts in the GPU, and a top node hashing results of leaves in a serial way on the Host CPU. A leaf interleaving mode have been chosen, but a fixed input message size for each leaf is used. So the GPU hashes fixed size input data ‘big blocks’ and return hash results to be processed by the sequential top hash node on CPU.

Fig. 2. Basic TreeHash mode

Final Hash

Keccak Top Node on CPU H 0,0

H 0,1

Keccak Thread 0

Keccak Thread 1

Keccak Thread 2

W 0,0

W 0,1

W 0,2

W 1,0

W 1,1

W 2,0

W 2,1

H 0,2

W 1,2 W 2,2

H 0,192







… …

Keccak Thread 192

Keccak Thread 0

Keccak Thread 1

W 0,192

W 0,0

W 0,1

W 1,192 …

H 0,0

W 2,192

……

……

……

Thread Block 0 on GPU

W 1,0

W 1,1

W 2,0

Thread Block 1 on GPU

Each leaf is computed by a thread on the GPU, and threads are grouped in ThreadsBlocks. Output of threads are transfers to CPU to the top hash node. This choice allows to define a streaming mode on big blocks of data, having the advantage to be stateless on the graphic card. On the other hand, the size of the big blocks (several MB) restrains this tree mode to big files only. 3.2 Memory ordering To improve memory read/write speed, the Tree hash mode have been designed to ensure coalesced memory access, which means that, when threads are reading 32 bits words from global memory, in an array like W [], W [0] is read by Thread 0, W [1] is read by Thread 1, W [i ] is read by Thread i , etc ... This memory ordering ensure best performance on old Nvidia Hardware (depends on compute capability of the GPU). 3.3 K ECCAK version used As targeted hardware device is limited in register memory, the author chooses to implement K ECCAK- f [800] permutation. Using a compact implementation, a working hash state (hash state and extra working memory needed for computation) can fit into 41 32b registers, so as there are 1024 registers in a SM, up to 192 hash states can fit in a SM, and 192 threads can be launched in parallel. To following parameters for the K ECCAK sponge function have been chosen : K ECCAKf [800][r=256,c=544], and 256 bits of output size. The capacity of 544 bits and the output sizes enables to claim 256 bits security for the hash functions. The input rate of 256 bits, equal to output size, facilitates the tree construction as output data blocks are input data blocks of upper nodes. It has to be noted that with bit rate divide by 4, and work factor divide by 2 (in approximation), K ECCAK- f [800][r=256,c=544] is expected to be 2 times slower than K ECCAKf [1600][r=1088,c=512] or K ECCAK- f [1600][r=1024,c=576], which are respectively the K EC CAK proposal for SHA-3 256 and the default K ECCAK function proposed by K ECCAK authors.

4 Performance results 4.1 First performance results In this section first performance results are presented, using a basic tree mode as described before. In addition to the GTS 250 card, the author also use a laptop Cuda card (Quadro FX 370M), similar to the GTS 250 but with only 8 Cuda cores (16 times less than the desktop card). Comparing performance on this two cards allows to study scaling ability of the software. A full CPU implementation is also used for benchmark, using 32 bit mode on one core (no multithreading) and without SIMD (SSE) instructions. This implementation is also used to improve confidence in the correctness of the GPU implementation. Table 1. First performance results System

CPU Hash speed in MB/s CPU + GPU speed in MB/s

Core2 Duo 2.6 Ghz Core i5-750 2.6Ghz Quadro FX 370M Nvidia GTS 250 25 61

15 682

4.2 Improving performance The first way to improve performance is to overlap GPU and CPU computation. As the kernels (function executed on the GPU) launches are asynchronous in the Cuda API, the CPU can compute the Keccak top node of the previous GPU computation results during the current GPU work. In basic Cuda execution model, memory transfers (between host and device) and computation on GPU are done in sequence. Data transfers are slow and can be the bottleneck of the program performance. One other way to improve performance, if the hardware supports it, is to overlap data transfers and computation on GPU. This can be done using pagelocked memory, and Cuda streams (succession of data transfers and computation that are issued in order in each stream), as explained in the Cuda programming guide [7]. The combination of overlapping GPU and CPU computation and overlapping data transfers and GPU computation gives the best results. It’s interesting to note that those improvements do not affect much the slower GPU configuration, as GPU computation time is still the most consuming task. In order to optimize performance using overlapping, the data transfers and works of the GPU should be divided in smaller independent (in terms of computation and data transfers) pieces, to be overlapped. But to fully occupied the several GPU cores those pieces should not be to small. Good trade off between smaller work size for overlapping and bigger work size to occupy all the GPU cores must be found.

Fig. 3. Overlapping data transfers, GPU and CPU computations No overlapping GPU Work i

GPU Work i+1 memcpy Host to Device

Memcpy D2H

memcpy Host to Device

CPU Work i

Overlapping GPU & CPU GPU Work i

GPU Work i+1 Memcpy D2H

memcpy Host to Device

Memcpy D2H

memcpy Host to Device

CPU Work i-1

CPU Work i

Overlapping GPU & Memcpy & CPU mcpy H2D

GPU Wi mcpy H2D

mcpy D2H GPU Wi mcpy H2D

mcpy H2D mcpy D2H GPU Wi

CPU Work i-1

mcpy D2H

GPU W i+1

mcpy H2D mcpy D2H

GPU W i+1

mcpy H2D

mcpy H2D mcpy D2H GPU W i+1

GPU W i+2

mcpy H2D mcpy D2H

mcpy D2H GPU W i+2

mcpy H2D

CPU Work i

mcpy D2H GPU W i+2

mcpy D2H

CPU Work i+1

Table 2. Improved performance results Systems Configuration

Core2 Duo 2.6 Ghz Core i5-750 2.6 Ghz Quadro FX 370M Nvidia GTS 250

CPU Hash speed in MB/s CPU + GPU speed in MB/s

25 61

15 682

CPU + GPU overlapped (MB/s) CPU + GPU Overlapped + Streams (MB/s)

63 64

1032 1219

5 Enhanced Tree hash mode 5.1 Second Tree hash mode proposed The first Tree hash mode proposed uses only 256 bits of chaining value between tree nodes. The aim of the second tree hash mode proposed is to use a double sized chaining value, as required in [2] as a necessary condition for being a sound tree hash mode. As doubling the chaining value size in first Tree hash mode proposed would double the load of the CPU top node, a tree of height 2 is used in the second mode. Output of first nodes are store in shared memory, which enable the fewer threads computing the height 2 nodes to access to the heigth 1 nodes results. Shared memory visibility is limited to the thread block, so each thread block implement a subtree, and results of each subtrees are still hashed by the CPU top node.

Fig. 4. Second TreeHash mode Final Hash

Keccak Top Node on CPU H 0,32

H 0,0 H 1,0



Keccak 0

H 0,0 H 1,0

H 0,1 H 1,1



H 1,2

Keccak Thread 1

Keccak Thread 2

W 0,0

W 0,1

W 0,2

W 1,1

H 1,0

Keccak 32

Keccak 0

H 0,192 H 1,192

H 0,2

Keccak Thread 0

W 1,0

H 0,0

H 1,32

W 1,2

W 2,0

W 2,1

W 2,2













……

……

……



H 1,0

H 0,1 H 1,1

Keccak Thread 192

Keccak Thread 0

Keccak Thread 1

W 0,192

W 0,0

W 0,1

W 1,192 …

H 0,0

W 2,192

Thread Block 0 on GPU

W 1,0 W 2,0

W 1,1 W 2,1

Thread Block 1 on GPU

5.2 Performance of the second Tree hash mode Performance of the second tree hash mode is slightly under the performance of the first mode proposed, but it’s a security/performance trade off. All the asynchronous improvements of the first mode are also used in this mode.

Table 3. Performance results of the second Tree hash mode Systems Configuration

Core2 Duo 2.6 Ghz Core i5-750 2.6 Ghz Quadro FX 370M Nvidia GTS 250

CPU Hash speed in MB/s

23

14

CPU + GPU Overlapped + Streams (MB/s)

59

1183

5.3 Future work A proper streaming hashing API could be build over the functions implemented in this project. this API should handle padding of input message, and should be designed to fulfill the requirements of [2] to be a sound tree hash mode.

6 Stream Cipher and Pseudo Random Number Generator (PRNG) 6.1 Stream Cipher The parallelism of the GPU can be fully used by implementing K ECCAK in a Stream Cipher mode with independent streams generated by each threads in the GPU. In this mode, a 256 bits Key and a 256 bits (or less) Nonce is transfered from the Host to the Device. Then each thread hashes the Key and the Nonce, the thread id and the thread block number. Output streams are generated by using the arbitrary output length mode of K ECCAK, also called squeezing mode. The whole computation is done on the GPU in this mode.

Fig. 5. Stream Cipher mode





Stream X,192

Stream X,0

Stream 1,0

Stream 1,1

Stream 0,0

Stream 0,1



Keccak Squeeze Mode Thread 0

Keccak Squeeze Mode Thread 1

Thread id=0

Thread id=1

Block id=0

Block id=0





Stream X,1 …



Stream X,0

Stream 1,192

Stream 1,0



Stream 0,192

Stream 0,0





Keccak Squeeze Mode Thread 192

Keccak Squeeze Mode Thread 0

Thread id=192

Thread id=0





Block id=0

Block id=1

Key 256 bits Nonce 256 bits

Thread Block 0 on GPU

Thr Block 1 on GPU

6.2 Performance of Stream Cipher mode

Table 4. Performance of StreamCipher mode Systems Core2 Duo 2.6 Ghz Core i5-750 2.6 Ghz Configuration Quadro FX 370M Nvidia GTS 250 GPU using Cuda Streams (MB/s) 62 1183

The performance of stream cipher mode is quite similar to the first tree hash mode, since it involves the same number of operations to input X blocks of data in a K ECCAK state in hashing mode than output X blocks of output in arbitrary output length mode. The performance figures are not taking into account the XOR of the key stream with the clear text. It has to be tested if xoring clear text with key stream is more efficient on GPU (need to transfer clear text to GPU and cipher text back to Host) or on CPU. 6.3 Encryption and Authentication If authentication is needed, a combination of StreamCipher mode and tree hashing for computing an HMAC over the data encrypted can be easily build. The sequence of operations could be 1. 2. 3. 4. 5. 6.

Transfer the Key and nonce to the GPU and start computing cipher key streams, Asynchronously transfer clear text data to GPU, XOR clear text with key streams in GPU, Start computing HMAC in tree mode in GPU, asynchronously transfer cipher text back to CPU, when HMAC is computed, transfer result MAC to CPU. This can be viewed as a parallel variant of the new duplex sponge mode proposed in [3].

6.4 High speed PRNG The Stream Cipher mode can be used as a building block of an high speed PRNG. the Key and the Nonce can be replaced by true random data or data from a random pool to form a seed. The Stream Cipher function act as a random expander, returning ‘cryptographic’ random data from the seed.

7 Notes about overall performances and optimizations 7.1 Optimizations of K ECCAK implementation and usage Optimizations of KECCAK can be done on this project’s implementation. For example lane complementing (described in [1]) §7.2) have not been used yet. Overall performances figures in this paper can be greatly improved by using a better bitrate of K ECCAK- f [800] , and/or use K ECCAK- f [1600]. For K ECCAK- f [1600], as GPU are 32 bits platforms, optimizations for using 32 bits words should be studied (bit interleaving). 7.2 Optimizations using new GPU hardware Optimizations described in this paper, explicit asynchronous data transfers, coalesced memory access, are bound to the hardware used. New GPU hardware, in addition of generally having more Cuda cores embedded and more memory (registers, shared memory), can improve performance by doing implicit optimizations (for example: implicit asynchronous memory transfer between Host and GPU). The author extrapolates that using new GTX 4XX Nvidia cards and K ECCAK- f [1600], speeds of more than 3GB/s should be within reach.

8 Disclaimer This work is a proof of concept about using GPU for cryptographic hash, and it’s still in alpha stage. Neither the correctness nor the cryptographic strength of this software is guaranteed.

9 Acknowledgments The author thanks the K ECCAK team for their comments and answers about this project.

References 1. G. Bertoni, J. Daemen, M. Peeters, and G. Van Assche. Keccak sponge function family main document. Submission to NIST (updated), 2009. 2. G. Bertoni, J. Daemen, M. Peeters, and G. Van Assche. Sufficient conditions for sound tree and sequential hashing modes. 2009. 3. G. Bertoni, J. Daemen, M. Peeters, and G. Van Assche. Duplexing the sponge: single-pass authenticated encryption and other applications. 2010. 4. J.W. Bos, D.A. Osvik, and D. Stefan. Fast Implementations of AES on Various Platforms. Technical report, Cryptology ePrint Archive, Report 2009/501, November 2009. http://eprint. iacr. org, 2009. 5. S.A. Manavski. CUDA compatible GPU as an efficient hardware accelerator for AES cryptography. In Signal Processing and Communications, 2007. ICSPC 2007. IEEE International Conference on, pages 65–68. IEEE, 2008. ˚ 6. R. Merkle. A certified digital signature. In Advances in CryptologyUCRYPTOŠ89 Proceedings, pages 218–238. Springer, 1990. 7. C. NVIDIA. programming guide version 3.0. NVIDIA Corporation, 2010. 8. R.L. Rivest, B. Agre, D.V. Bailey, C. Crutchfield, Y. Dodis, K.E. Fleming, A. Khan, J. Krishnamurthy, Y. Lin, L. Reyzin, et al. The MD6 hash function A proposal to NIST for SHA-3. Submission to NIST, 2008.

Implementation of Keccak hash function in Tree ...

opposition to host memory which resides on the host computer. The device global ... ing results of leaves in a serial way on the Host CPU. A leaf interleaving ...

149KB Sizes 8 Downloads 185 Views

Recommend Documents

VLSI IMPLEMENTATION OF THE KEYED-HASH ...
Every client and application server provider must be authenticated, in order both ... WAP an extra layer, dedicated to the security, is needed. The security layer of ...

permutation grouping: intelligent hash function ... - Research at Google
The combination of MinHash-based signatures and Locality-. Sensitive Hashing ... of hash groups (out of L) matched the probe is the best match. Assuming ..... [4] Cohen, E. et. al (2001) Finding interesting associations without support pruning.

Attacking the Tav-128 Hash function - IIIT-Delhi Institutional Repository
Based RFID Authentication Protocol for Distributed Database Environment. In. Dieter Hutter and Markus Ullmann, editors, SPC, volume 3450 of Lecture Notes.

Attacking the Tav-128 Hash function
Date: 28-July-2010. Abstract. Many RFID protocols use cryptographic hash functions for their security. The resource constrained nature of RFID systems forces the use of light weight ... weight primitives for secure protocols in § 5. 2 Notation and .

Optimising the SHA-512 cryptographic hash function on FPGA.pdf ...
Infrastructure [3], Secure Electronic Transactions [4] and. communication protocols (e.g. SSL [5]). Also, hash. functions are included in digital signature ...

Implementation of a Thread-Parallel, GPU-Friendly Function ...
Mar 7, 2014 - using the method shown in Snippet 10. Here, obsIter is a ..... laptop, GooFit runs almost twice as quickly on a gamer GPU chip as it does using 8 ...

State-of-the-Art Implementation of SHA-1 Hash ...
to a maximum of 2 Gbps. In this paper, a new implementation comes to exceed this limit improving ... Moreover year-in year-out Internet becomes more and more ...

1 New Hash Algorithm: jump consistent hash -
function mapping tuple to a segment. Its prototype is. Hash(tuple::Datum ... Big Context: To compute the Hash function, they both need a very big context to save.

Hash Tables
0. 12. 15. 1. 2 ... 10. Hash. Function banana apple cantaloupe mango kiwi pear apple banana cantaloupe kiwi mango pear. Hash Tables ...

Hash Rush Whitepaper.pdf
Page 2 of 22. Table of Contents. Legal Disclaimer. Introduction. Overview of HASH RUSH Project. RUSH COIN Tokens. Beginning of the Game. Game Modes and Events. Look and Feel. Monetization. Earn Minable Crytocurrencies. Game World and Rules. Factions.

4.Implementation of Just-In-Time in Romanian Small Companies.pdf ...
There was a problem loading more pages. Retrying... Whoops! There was a problem previewing this document. Retrying... Download. Connect more apps... Try one of the apps below to open or edit this item. 4.Implementation of Just-In-Time in Romanian Sma

Implementation of Energy Management in Designing Stage of ...
Implementation of Energy Management in Designing Stage of Building.pdf. Implementation of Energy Management in Designing Stage of Building.pdf. Open.

Appointment of Consultant for implementation of GST in BCPL..pdf ...
Appointment of Consultant for implementation of GST in BCPL..pdf. Appointment of Consultant for implementation of GST in BCPL..pdf. Open. Extract. Open with.

IMPLEMENTATION OF MIS Implementation of MIS ... -
space occupied by computers, terminals, printers, etc., as also by people and their movement. ... These classes are not necessarily exclusive, as they quite often.

Structure and function of mucosal immun function ...
beneath the epithelium directly in contact with M cells. Dendritic cells in ... genitourinary tract, and the breast during lactation (common mucosal immune system).

The Function of the Introduction in Competitive Oral Interpretation
One of the performance choices confronting an oral inter- preter is reflected in .... statement of title and author is by no means sufficient in introduc- ing literature. .... Imprisoned in a world of self-deceit, extreme vulnerability, and the confi

COMPUTATIONAL ACCURACY IN DSP IMPLEMENTATION ...
... more logic gates are required to implement floating-point. operations. Page 3 of 13. COMPUTATIONAL ACCURACY IN DSP IMPLEMENTATION NOTES1.pdf.

Predictors of immune function in space flight
Oct 19, 2006 - (2000) 405–426. [25] R.A. Vilchez, C.R. Madden, C.A. Kozinetz, S.J. Halvorson,. Z.S. White, J.L. Jorgensen, et al., Association between simian.

Complexity and the Function of Mind in Nature.pdf
ISBN 0-521-45166-3 hardback. Page 3 of 320. Complexity and the Function of Mind in Nature.pdf. Complexity and the Function of Mind in Nature.pdf. Open.