Vectorization with SIMD extensions speeds up ...

Viewer
Transcript

Journal of Structural Biology 170 (2010) 570–575

Contents lists available at ScienceDirect

Journal of Structural Biology journal homepage: www.elsevier.com/locate/yjsbi

Technical Note

Vectorization with SIMD extensions speeds up reconstruction in electron tomography J.I. Agulleiro a, E.M. Garzón a, I. Garcı´a a, J.J. Fernández a,b,* a b

Dept. Computer Architecture, University of Almerı´a, 04120 Almerı´a, Spain National Center for Biotechnology (CSIC), Cantoblanco, 28049 Madrid, Spain

a r t i c l e

i n f o

Article history: Received 26 November 2009 Received in revised form 14 January 2010 Accepted 14 January 2010 Available online 18 January 2010 Keywords: Electron tomography Three-dimensional reconstruction Weighted backprojection WBP Simultaneous iterative reconstruction technique SIRT Code optimization Vectorization

a b s t r a c t Electron tomography allows structural studies of cellular structures at molecular detail. Large 3D reconstructions are needed to meet the resolution requirements. The processing time to compute these large volumes may be considerable and so, high performance computing techniques have been used traditionally. This work presents a vector approach to tomographic reconstruction that relies on the exploitation of the SIMD extensions available in modern processors in combination to other single processor optimization techniques. This approach succeeds in producing full resolution tomograms with an important reduction in processing time, as evaluated with the most common reconstruction algorithms, namely WBP and SIRT. The main advantage stems from the fact that this approach is to be run on standard computers without the need of specialized hardware, which facilitates the development, use and management of programs. Future trends in processor design open excellent opportunities for vector processing with processor’s SIMD extensions in the ﬁeld of 3D electron microscopy. Ó 2010 Elsevier Inc. All rights reserved.

1. Introduction Electron tomography (ET) is the leading imaging technique for the three-dimensional (3D) visualization of cellular structures (Lucic et al., 2005; Fernández et al., 2006). From a set of projection images taken from a single individual specimen, the 3D structure can be obtained by means of tomographic reconstruction algorithms. Because of the resolution needs, large projection images are used and considerable processing time is thus required. Parallel and distributed computing systems have turned out to be key to cope with such requirements (Perkins et al., 1997; Fernández et al., 2002, 2004; Fernández, 2008). Current stand-alone computers have a tremendous power thanks to a number of technological and architectural factors (Hennessy and Patterson, 2007) and, moreover, are easily conﬁgured and managed. So, an interesting solution to tackle computational complex problems is code optimization, a technique that intends to accelerate programs by rewriting the code to fully exploit the platform’s power (Hennessy and Patterson, 2007; Wadleigh and Crawford, 2000). In this work, we address the computational demands of tomographic reconstruction by exploitation of the vector instructions

* Corresponding author. Address: National Center for Biotechnology (CSIC), Campus Cantoblanco, C/Darwin 3, 28049 Madrid, Spain. Fax: +34 91 585 4506. E-mail addresses: [email protected], [email protected] (J.J. Fernández). 1047-8477/$ - see front matter Ó 2010 Elsevier Inc. All rights reserved. doi:10.1016/j.jsb.2010.01.008

within modern processors. These instructions, which allow execution of several operations at a time, are virtually present in all processors available nowadays. In conjunction to other optimization techniques, these instructions have the potential to allow the development of efﬁcient and fast 3D reconstruction programs to be run on standard computers without the need of specialized hardware. This will facilitate their use and management in laboratories of structural biology. 2. Tomographic reconstruction A series of projection images is typically acquired from the specimen by following the single tilt axis geometry. This so-called tilt series usually contains a number of images that range from 60 to 200. Depending on the resolution requirements, the image size ranges between 512 512 and 4096 4096 pixels, or even 8192 8192. The reconstruction problem is then to obtain the 3D structure of the specimen from the tilt series. Weighted backprojection (WBP) (Radermacher, 2006) is currently the standard algorithm in ET. The method uniformly distributes the specimen mass present in the projection images over computed backprojection rays. A previous high-pass ﬁlter (i.e., weighting) is applied to the projection images in order to compensate the transfer function of the backprojection process; hence the term ‘‘weighted backprojection”.

J.I. Agulleiro et al. / Journal of Structural Biology 170 (2010) 570–575

The 3D problem in single tilt axis ET can be decomposed into a set of independent 2D reconstruction sub-problems corresponding to the slices perpendicular to the tilt axis (Perkins et al., 1997; Fernández, 2008) (Fig. 1(c)). Each of the 2D slices of the volume can then be computed by WBP, now working in 2D, from the corresponding set of 1D projections (usually referred to as sinogram). The algorithm in Fig. 1(a) is a high-level description of the 3D reconstruction by WBP (for simplicity, the weighting of the projection data is left out). The outer loop sweeps across the slices of the volume in order to perform the backprojection process, slice by slice. Its body represents the 2D-WBP algorithm. It ﬁrst computes the point rf at which the current pixel ðx; yÞ is projected with an angle a. The density value to be backprojected is then computed by linear interpolation from the 1D projection. proj[s] denotes the set of 1D projections (i.e., the sinogram) corresponding to the current slice slice[s]. Iterative methods are getting increasing interest in the ﬁeld due to their better behaviour under the conditions found in ET (Lucic et al., 2005; Fernández et al., 2002). Simultaneous iterative reconstruction technique (SIRT) (Gilbert, 1972) is one of the most common iterative methods. This algorithm (Fig. 1(b)) iteratively reﬁnes the 3D reconstruction by backprojecting the error between the experimental projections and projections calculated from the current reconstruction. In that sense, the body of the loop s of the algorithm in Fig. 1(a) is also the core of the SIRT algorithm, though with the proper modiﬁcations related to the input and output arrays. In each iteration, it is used to calculate the projections from

571

the reconstruction, and it is used again to backproject the error. In the middle, there is a step consisting in the computation of the error between the experimental projections and the corresponding calculated projections and, afterwards, a weighting by a geometric factor (see error and gfactor in Fig. 1(b)). cproj represents the 1D projections calculated from the current slice slice[s].

3. Vector processing with SSE instructions Vector processing was the basis of most supercomputers in the 1980’s decade. Today it is again becoming important because modern CPUs include a wide variety of vector instructions, typically known as SIMD (Single Instruction, Multiple Data), for multimedia applications (Hassaballah et al., 2008) (e.g. MMX, SSE, AltiVec). They are designed to considerably accelerate integer and ﬂoating-point operations as multiple computations can be performed using a single instruction. A typical SIMD instruction achieves higher performance by operating on multiple data elements of the same type at the same time. Fig. 2 illustrates the SIMD execution model. Intel introduced the Streaming SIMD Extensions (SSE) instruction set in 1999 with its Pentium III. Soon afterwards, AMD incorporated it as well. Since then, this instruction set has continuously been improved and extended, and Intel and AMD processors have always included the most advanced version at the time of manufacturing. In addition to other data types (byte, short, integer,

Fig. 1. Three-dimensional reconstruction based on WBP and SIRT. (a) Sequential algorithm for WBP. (b) Sequential algorithm for SIRT. (c) Decomposition of a volume into slices or slabs of slices.

572

J.I. Agulleiro et al. / Journal of Structural Biology 170 (2010) 570–575

leigh and Crawford, 2000). These techniques have also been included in our implementations of the reconstruction methods, both the vectorized and the sequential, non-vectorized versions. 4.1. Efﬁcient use of the cache memory

Fig. 2. SIMD execution model. The data processed in a single step by a SIMD instruction build a vector. The number of elements in the vector depends on the size of the registers. X and Y are the source operand vectors. In a single step, an operation is carried out with all the elements in the vector.

double), these instructions work with single precision ﬂoatingpoint numbers (32 bits). SSE registers are 128 bits long, so four ﬂoating-point numbers are stored in each one. Some compilers, such us Intel ICC, allow activation of automatic vectorization but, in practice, it is only limited to small pieces of code containing rather simple loops. Therefore, to fully exploit the potential of vector instructions, programming using explicit SSE instructions is necessary (Hassaballah et al., 2008). In this work we propose a vectorized SIMD approach to tomographic reconstruction with WBP and SIRT. Our SIMD approach decomposes the global 3D problem into sub-problems of slabs of four slices whose reconstruction is then carried out with vector processing (Fig. 1(c)). Our approach aims to reconstruct the four slices in a slab at a time by taking advantage of the SSE instructions that operate with vectors of four ﬂoating-point numbers. Note that four is the maximum amount of ﬂoating-point numbers (32 bits long) in a SSE register (128 bits long). The global vectorized WBP algorithm for 3D reconstruction is shown in Fig. 3(a), where SSE(.) represents the fact that all the operations therein are performed with vector processing by means of SSE instructions. Several important parts can be identiﬁed in the algorithm. First, vectorized Slab WBP represents the reconstruction of a slab with WBP using vector processing. However, previously it is necessary to build a vector from the projection data so that vector processing can be applied. Fig. 3(c) illustrates how this works. In the algorithm in Fig. 3(a) this is denoted by Vector{.}, and basically it creates a vector containing the sinograms for the four slices in the slab. After vectorized reconstruction, it is necessary to extract the slices from the reconstructed vector slab. In the algorithm, this operation is represented by UnVector{.}. For the weighting process involved in WBP, the FFTW (the Fastest Fourier Transform in the West) library has been used (Frigo and Johnson, 2005). This library uses SSE instructions internally. Therefore, vectorization is used throughout the WBP procedure. In the case of SIRT, the strategy just described is applied for the projection and backprojection steps, as shown in Fig. 3(b). For the computation and weighting of the error between the experimental projections and the corresponding calculated projections, vectorization with SSE instructions has been used as well. These calculations are carried out for the four components in a slab at the same time. In this ﬁgure, Vcproj{.} and Verror{.} represent vectors containing the calculated projections and the error corresponding to the four slices in the slab.

4. General optimizations Single-processor code optimization techniques have been proved to improve the performance in scientiﬁc computing (Wad-

During the reconstruction process, both slices and sinograms are divided into blocks to make the most of processor cache. Fig. 4 shows the mechanism implemented, which is similar to the blocking technique widely used in scientiﬁc computing (Wadleigh and Crawford, 2000). The goal of this technique is to minimize the exchange of information with main memory by reusing the data kept in cache memory to a great extent. To this end, the data is split in small blocks that ﬁt into cache memory and the code is reorganized so as to operate with a block as much as possible before proceeding with a different block. As sketched in Fig. 4, the blocking procedure consists in reconstructing the slice at steps of R rows while reading the sinogram in small blocks of P projections. With this procedure, the block of R rows of the slice is kept in cache memory until completely processed before switching to another block. R and P are conﬁgurable, so the mechanism can be adapted to any cache size. According to our experience, small values power of two (2, 4, 8) yield good performance for current standard cache amounts (from 512 KB to 12 MB). 4.2. Projection symmetry This optimization takes advantage of the symmetry existing in the projection of a slice: if a point ðx; yÞ of the slice is projected to a point r ¼ x cosðhÞ þ y sinðhÞ in the projection corresponding to the tilt angle h, it is easy to see that the point ðx; yÞ of the slice is then projected to rs ¼ r in that projection. Therefore, for a given tilt angle h, there is only need to compute the point r in the projection for a half of the points ðx; yÞ in the slice, hence obtaining a gain in speed. To further increase cache efﬁciency (see previous section), in this work symmetric points are put together in the data structures. 4.3. Other optimizations A wide spectrum of single processor optimizations has been applied to further accelerate the code (Wadleigh and Crawford, 2000). Among them, we highlight (1) an instruction level parallelism increase, (2) pre-calculation of data that are extensively used during the reconstruction process (e.g. sines, cosines, rays, limits for projections), (3) inlining of functions, (4) replacement of power of two divisions and multiplications by shifts, (5) replacement of some divisions by multiplications, (6) loop unrolling and (7) conditionals removal. 5. Results The performance evaluation of our vectorized implementations of WBP and SIRT was done on a state-of-the-art computer based on an Intel Core 2 Q9550 at 2.83 GHz, 8 GB RAM and 12 MB L2 cache, running under Linux and with the ICC (Intel C/C++ Compiler) installed. To obtain a fair assessment of the gain provided by our vectorization approach and account for the abilities of the compiler to exploit SSE instructions, automatic vectorization was activated with ICC. This was done for both implementations, the vectorized and the sequential, non-vectorized one. Two groups of datasets were used. The ﬁrst one was based on a synthetic mitochondrion phantom (Fernández et al., 2002) and was conceived to carry out an extensive performance evaluation of the

J.I. Agulleiro et al. / Journal of Structural Biology 170 (2010) 570–575

573

Fig. 3. 3D reconstruction based on vectorized WBP and SIRT. (a) Vectorized algorithm for WBP. (b) Vectorized algorithm for SIRT. (c) Construction of vectors. Given a point (x, y) that is projected to the ray r, the left of the image sketches how the four points of four different slices are processed by the sequential version of backprojection. On the right, the four slices and the four projections are arranged so that related points are together. Then, they can be processed in parallel using SSE instructions.

Slice

Row 0 R1

R2

Sinogram 1

Row 1

5

Projection 0

Row 2

2

Projection 2

Row 3

6

Projection 3

Row 4

3

Projection 4

Row 5

7

Projection 5

… Row M-1

4 8

P1

Projection 1

…

P2

P3

P4

Projection N-1

Fig. 4. Mechanism implemented to reduce cache misses. The slice is reconstructed at steps of R rows while reading the sinogram in small blocks of P projections. The order of processing is given by 1, 2, 3, 4, etc. This way, R1 is ﬁrst processed with P1, then with P2, then with P3 and lastly with P4. This is repeated for R2, and so forth.

implementations. We tested different sizes of the tilt series and of the reconstructions. Tilt series of 60, 90, 120, 150, and 180 angles were generated. Two different image sizes were utilized (a) 512 512 and (b) 1024 1024. For those image sizes, reconstructions with thickness of (a) 64, 128, 256 and 512 and (b) 128, 256, 512 and 1024, respectively, were computed. The second group consisted of experimental datasets. First, a dataset of Vaccinia virus (Cyrklaff et al., 2005) that had 61 images of 1024 1024 in the tilt range [60°, +60°] was used to reconstruct a tomogram of

1024 1024 480. Second, a dataset of medium spiny neuron dendrites, taken from the Cell Centered Database (Martone et al., 2008), accession code 40, consisting of 71 images of 1024 1024 in the tilt range [70°, +70°] was used to yield a tomogram of 1024 1024 140. All the experiments were carried out ﬁve times, and the average computation times were then calculated. For SIRT, a number of iterations of 30 was chosen. The computation times and speedups for the synthetic datasets processed with WBP and SIRT are shown in Supplementary material (Tables S1–S4). The speedup factors prove to be higher than 3 in all cases. In order to obtain a global value of the acceleration factor provided by our vectorized approach, the average speedups for each pair method-dataset were computed. Fig. 5 clearly shows that the average acceleration factor turns out to be around 3.2–3.3 out of a maximum of 4, with little variation as revealed by the standard deviation. Interestingly, the computation times demonstrate the effectiveness of the general optimizations used for the implementations. The sequential version is actually very fast, with a performance that proves to be competitive with standard packages (e.g. IMOD (Kremer et al., 1996), see Table S5 in Supplementary material). For instance, it manages to provide a huge WBP reconstruction of 4 GB in size (1024 1024 1024, ﬂoat) from a tilt series of 90 images in less than 3 min (Table S2). The vectorized version is implemented from this sequential one, which makes it exceptionally fast. In the previous example, the huge reconstructed volume

574

J.I. Agulleiro et al. / Journal of Structural Biology 170 (2010) 570–575 4

Speedup

3

2

1

0

WBP 1024

WBP 512

SIRT 512

SIRT 1024

Fig. 5. Average speedups with vectorized reconstruction. The average speedup is around 3.2–3.3. The bars represent the standard deviation (around 0.1 for WBP and 0.01 for SIRT).

would be available in less than a minute. If SIRT (30 iterations) were the method of choice in this example, the reconstruction would be ready for analysis in less than 55 min (Table S4). It is also apparent that a WBP reconstruction of size 512 512 256, similar to that used in real-time ET systems (Zheng et al., 2007), only takes 2.3 s (Table S1) without the need of parallel systems. Table 1 shows the computation times and effective speedups for the experimental datasets. The speedups for WBP are in the range 3.1–3.2, and for SIRT are around 3.3. The speedups are slightly larger for tilt series with higher number of images, as happened with the synthetic case. These values conﬁrm the behaviour and the acceleration factors seen for the synthetic datasets. Importantly, the reconstructions of the experimental datasets with vectorized WBP are extremely fast (6 and 20 s, respectively). The times with SIRT (30 iterations) show that the reconstructions would be available in 6 and 18 min, respectively, with a time per iteration of 12 and 36 s, respectively. Taking into account the large size of the reconstruction of the second experimental dataset (VV1K), these results are especially remarkable. According to the algorithmic complexity of the reconstruction methods, which depends linearly on the size of the volume to be reconstructed and the number of tilt angles, an experimental reconstruction of 2048 2048 120 using the same number of tilt angles would be available in an amount of computation time similar to the VV1K case.

6. Discussion and conclusion The vectorized approach presented here relies on the exploitation of the SIMD instructions available in modern processors (e.g. any Intel or AMD processor), a technology not extensively used in ET yet. This work has shown that the ET reconstruction process can be sped up by a factor of around 3.3, out of a maximum of 4. Combined with single processor optimization techniques, this approach succeeds in providing full resolution tomograms with WBP in a matter of seconds and with SIRT in a matter of minutes.

Table 1 Computation times and speedups. Experimental datasets. WBP

Seq. SSE Speedup

SIRT

G14T5

VV1K

G14T5

VV1K

20.69 6.44 3.21

59.61 19.37 3.08

1237.18 373.63 3.31

3601.27 1100.76 3.27

Our approach turns out to be an interesting alternative to the standard strategy to cope with the computational demands of ET reconstruction, which is based on parallel and distributed systems (Fernández, 2008). Its speed even makes it a suitable option to be considered for real-time ET systems (Zheng et al., 2007). The main advantage of the vector approach compared to parallel systems is the fact that the technology is embedded in the processor, so no special system is required to be conﬁgured or managed. Moreover, the vector approach is transparent to the user, who does not have to deal with special commands to launch the program as it is handled as any sequential one in a Unix environment. Recently, graphics processor units (GPUs) have emerged as new computing platforms that offer massive SIMD parallelism and provide exceptional speedup factors (Castano-Diez et al., 2007). Although GPU programming has been facilitated in the last few years, the architectural model in SSE instructions is simpler and very well established and programming is somewhat easier. Despite the modest acceleration factors of SSE instructions compared to GPUs, the combination with code optimization has proved to yield processing times similar to those reported with GPUs (Castano-Diez et al., 2007). SIMD instructions are expected to be an important part of future computing systems. The SIMD instructions in the new generation of Intel processors and Intel GPUs will process more operations at a time (8–16 simultaneous ﬂoating-point operations), which will lead to an increase of the net speedup. Therefore, there are exciting future perspectives to exploit vector processing with CPU SIMD instructions in the ﬁeld of ET in particular, and 3D electron microscopy in general. Acknowledgments J.I. Agulleiro is a fellow of the Spanish FPU programme. Work funded by Grants MCI-TIN2008-01117, JA-P06-TIC-01426 and CSIC-PIE-200920I075. Appendix A. Supplementary data Supplementary data associated with this article can be found, in the online version, at doi:10.1016/j.jsb.2010.01.008. References Lucic, V., Foerster, F., Baumeister, W., 2005. Structural studies by electron tomography: from cells to molecules. Ann. Rev. Biochem. 74, 833–865. Fernández, J.J., Sorzano, C.O.S., Marabini, R., Carazo, J.M., 2006. Image processing and 3D reconstruction in electron microscopy. IEEE Signal Process. Mag. 23 (3), 84– 94. Perkins, G., Renken, C., Song, J., Frey, T., Young, S., Lamont, S., Martone, M., Lindsey, S., Ellisman, M., 1997. Electron tomography of large, multicomponent biological structures. J. Struct. Biol. 120, 219–227. Fernández, J.J., Lawrence, A.F., Roca, J., Garcı´a, I., Ellisman, M.H., Carazo, J.M., 2002. High performance electron tomography of complex biological specimens. J. Struct. Biol. 138, 6–20. Fernández, J.J., Carazo, J.M., Garcı´a, I., 2004. Three-dimensional reconstruction of cellular structures by electron microscope tomography and parallel computing. J. Paral. Distrib. Computing 64, 285–300. Fernández, J.J., 2008. High performance computing in structural determination by electron cryomicroscopy. J. Struct. Biol. 164, 1–6. Hennessy, J., Patterson, D., 2007. Computer Architecture: A Quantitative Approach. Morgan Kaufmann, Amsterdam. Wadleigh, K.R., Crawford, I.L., 2000. Software Optimization for High Performance Computers. Prentice Hall PTR. Radermacher, M., 2006. Weighted back-projection methods. In: Frank, J. (Ed.), Electron Tomography: Methods for Three-Dimensional Visualization of Structures in the Cell, second ed. Springer, pp. 245–273. Gilbert, P., 1972. Iterative methods for the 3D reconstruction of an object from projections. J. Theor. Biol. 76, 105–117. Hassaballah, M., Omran, S., Mahdy, Y.B., 2008. A review of SIMD multimedia extensions and their usage in scientiﬁc and engineering applications. Comput. J. 51, 630–649. Frigo, M., Johnson, S.G., 2005. The design and implementation of FFTW3. Proc. IEEE 93 (2), 216–231.

J.I. Agulleiro et al. / Journal of Structural Biology 170 (2010) 570–575 Cyrklaff, M., Risco, C., Fernández, J.J., Jimenez, M.V., Esteban, M., Baumeister, W., Carrascosa, J.L., 2005. Cryo-electron tomography of vaccinia virus. Proc. Natl. Acad. Sci. USA 102, 2772–2777. Martone, M.E., Tran, J., Wong, W.W., Sargis, J., Fong, L., Larson, S., Lamont, S.P., Gupta, A., Ellisman, M.H., 2008. The cell centered database project: an update on building community resources for managing and sharing 3D imaging data. J. Struct. Biol. 161, 220–231. Kremer, J., Mastronarde, D., McIntosh, J.R., 1996. Computer visualization of threedimensional image data using IMOD. J. Struct. Biol. 116, 71–76.

575

Zheng, S.Q., Keszthelyi, B., Branlund, E., Lyle, J.M., Braunfeld, M.B., Sedat, J.W., Agard, D.A., 2007. UCSF tomography: an integrated software suite for real-time electron microscopic tomographic data collection, alignment, and reconstruction. J. Struct. Biol. 157, 138–147. Castano-Diez, D., Mueller, H., Frangakis, A.S., 2007. Implementation and performance evaluation of reconstruction algorithms on graphics processors. J. Struct. Biol. 157, 288–295.

Writing Scalable SIMD Programs with ISPC