Isosurface Computation and Rendering with More GPU Hardware Support CMSC740 Advanced Computer Graphics (Instructor: Dr. Amitabh Varshney) Spring 2006 Qin Wang and Jusub Kim

1. Introduction As well known, isosurface extraction is an important method in scientific visualization to study the features of data in many areas such as physics and engineering simulation, meteorology, medical imaging. The fast rendering of these extracted isosurfaces is in no doubt desirable to the researchers who want to visually explore the data details in an interactive manner. Generally speaking, the isosurface visualization can involve three steps: extracting active cells, computing surface geometry within the cell and rendering the geometry to screen. It has been observed in previous studies that the majority of the time spent throughout the process is devoted to computing the isosurface geometry primitives. Hence, there is a great interest in studying where and how to compute these primitives to achieve an interactive display. In this project, we tested two schemes to process vertex information directly by GPU vertex program so that the geometry primitives of isosurface can be generated inside and rendered by GPU, therefore eliminating CPU’s computational work and reducing the data transferring between main memory and GPU card. The performance comparison with native OpenGL rendering shows about 70~100% speed up.

2. Related Previous Work Marching Cube algorithm[1] is the most well known technology developed in the late 80’s to make the isosurface using a lookup table scanning through each data cell. To overcome the downside of this algorithm which goes through each cell, various methods have been developed in following years such as octree-based structure[2] to reduce the number of examined cells, span space structure[3] to extract only cells that are cut by the isosurface, and seed-cells propagation for a minimum storage

overhead[4]. However, all these algorithms eventually use Marching Cube algorithm to compute the isosurface geometry within the cell by CPU and pass the computed geometry primitives from main memory to GPU card for the on-screen rendering. On the other hand, studies have shown that this amount of CPU computation time and data transferring to GPU card is not trivial, sometimes can even take the major part for the whole process of isosurface visualization[5]. As the GPU technology evolves greatly in recent years, especially about its programmability, researchers have begun to explore the advantage of doing more computation on GPU card via vertex/fragment program. The computation efficiency of running vertex/fragment program on GPU have been studied and reported in some literatures [6,7] over the past few years. The most relative paper to our study of this project is about partitioning data into a sequence of geometric primitives of tetrahedra and doing interpolation to generate the isosurface on GPU card directly [8]. The paper contains some detail about how the computation is done by vertex program and show that via its scheme it is possible that the isosurface computation and rendering can be made simple and faster with foreseeable arriving advanced GPU technology. However, its study doesn’t provide the performance comparison with traditional rendering by OpenGL, neither identified the performance bottleneck of applying computation onto GPU in their implementation. To further explore these aspects, the study conducted in our project tests the rendering by OpenGL and compares it with their scheme. Also, we try a new partitioning scheme which adopts cube partition by CPU and subtetrahedron partition on GPU itself to reduce the data traffic between main memory and GPU.

3. Isosurface Computation Methods Three basic steps in isosurface visualization usually include active cell extraction, surface geometry computation and geometric primitive rendering. The data format can be in a form of structured or unstructured mesh with scalar value at each vertex. Some indexing data structure can be used to obtain active cell efficiently in an optimal asymptotical time, such as interval tree or its variants [9, 10]. The geometry of cell can be different, but usually a tetrahedron or hexahedron (cube) is used. We assume in our studies the active cells have been obtained and loaded onto the main memory. Also, tetrahedron has been adopted as the simplest cell geometry. As a result, the cutting geometry for an active cell has a quadric shape with four vertices ( triangle can be treated as a special case of quadric). These quadric can be rendered to screen by GPU. We tested three schemes based upon what kinds of geometry are sent to GPU, which is illustrated on the diagram below:

3.1. Rendering by OpenGL Pipeline OpenGL is a widely used software package for 3-D computer graphic rendering. It configures a graphic pipeline on GPU card which takes geometric primitives and quickly generates 3-D objects as pixels on screen. The pipeline basically consists of transform/lighting, rasterization and fragmentation stages as shown in the diagram below.

The data we test has a dimension of 64x64x64 3-D grid. For each 8-vertex cube, we divide it into 6 tetrahedra and thus the isosurface generated in each tetrahedron has a quadric shape after using the lookup table and interpolation to find quadric’s vertex coordinates and normal. So, the geometric primitives provided to OpenGL are 4-vertex quadrics and rendered in the immediate mode by four glVertex( ) calls. For each quadric, there are 4 vertices and normal vectors, each with x, y, z components of float value. Accordingly, 96 bytes are passed onto GPU card for each quadric’s rendering. For this scheme, the vertex coordinate and normal are computed upon CPU and transferred from main memory to GPU over PCIexpress bus. The processing model is shown below:

The pseudo code for rendering by openGL call is shown below: glBegin(GL_QUADS); For i=0 to NumOfTetra do: ComputeVerterx(vertex[4], i, isoValue); // interpolate 4 vertices ComputeNormal(normal[4], i, isoValue); //interpolate 4 normal vectors GlNormal3f(normal[0]); // set 0th normal vector GlVertex3f(vertex[0]); // set 0th vertex coordinate GlNormal3f(normal[1]); GlVertex3f(vertex[1]); GlNormal3f(normal[2]); GlVertex3f(vertex[2]); GlNormal3f(normal[3]); GlVertex3f(vertex[3]); glEnd( ); 3.2. Rendering by Vertex Program with Tetrahedron A vertex program is made of some specialized assembly code able to be loaded onto GPU and executed every time a glVertex primitive is issued. Basically, vertex program replaces what Transform/Lighting stage does in the pipeline to compute the vertex position, normal, color, … for each vertex before sending them to primitive assembly. A vertex program has read-only access to 16 Vertex Attribute Registers that store the vertex properties and 96 constant Registers to store uniform variables like ModelViewProject matrix. There are also 12 read-write temporary registers for use during the program execution. For the sake of isosurface computing, the vertex program is provided with 4 vertex coordinates, normal vectors and scalar values for each tetrahedron and to perform linear interpolation with the given isovalue in computing the vertex coordinates and normal for cutting quadric. The stream processing model is shown below:

The implementation details have been presented in the paper[] while we just list some psuedocode to illustrate its implementing scheme: glBegin(GL_QUADS); set_isoValue(IsoValue); set_ModelViewProjectionMatrix( MVP ); For i=0 to NumOfTetra do: set_TetraVertices(vertex[4]); // set 4 vertex coordinates and scalar set_TetraNormals(normal[4]); // set 4 normal vectors glVertex2s(0,0); // 0th vertex in tetrahedron glVertex2s(1,1); // 1st vertex in tetrahedron glVertex2s(2,2); glVertex2s(3,3); glEnd( ); Now we calculate the size of data sent to GPU for each quadric computation. As stated above, each tetrahedron has 4 vertex coordinates, scalar values and normal vectors, the size of which are 112 bytes. To invoke the vertex program 4 times for computing each quadric’s vertices and normal vectors, another 2 short type of indexing number are needed in each glVertex2b( ) call. So, in total 128 bytes data are transferred to GPU for each quadric, a 25% more than OpenGL rendering. But, there is no need to compute coordinates and normal vector for each vertex on CPU, which takes longer time than GPU rendering the geometry. As a result, the rendering via vertex program with tetrahedron runs much faster than using native OpenGL call. This is illustrated by our experiment results presented in the later section.

3.3. Rendering by Vertex Program with Cube Although the vertex program can compute the isosurface faster than CPU does, transferring data to GPU might become a bottleneck of rendering part. As an attempt to improve the performance of using vertex program, we try to reduce the data size sent to GPU for each quadric vertex computation. The scheme we proposed is to use Cube partitioning and thus provide vertex program with only 8 scalar values and 8 normal vectors, then let the vertex program itself figure out what vertices are for each of six tetrahedra in a cube by some kind of lookup table. To be specific, one left-upper coordinate (x,y,z) of a cube, 8 scalar values and 8 normal vectors are sent for each cube in size of 120 bytes. Some kind of indexing numbers, of 2 short type variables, are also sent for vertex program to locate each of 6 tetrahedra divided inside one cube. Then, in average each quadric vertex computation consumes only 38 bytes data sent from main memory, only 30% of that in previous scheme. But the rendering becomes slower due to the larger number of instructions executed in the vertex program by GPU. The pseudocode that shows how this scheme is implemented is listed below: glBegin(GL_QUADS); set_isoValue(IsoValue); set_ModelViewProjectionMatrix( MVP ); For i=0 to NumOfCube do: set_CubeVertices(vertex[8]); // set cube coordinate and 8 scalar set_CubeNormals(normal[8]); // set cube normal vectors For j=0 to 6 do: // go over each of 6 tetrahedrons glVertex2s(j,0); // jth tetrahedron and 0th vertex glVertex2s(j,1); // jth tetrahedron and 1st vertex glVertex2s(j,2); glVertex2s(j,3); glEnd( );

4. Experimental Results To evaluate the performance of GPU-based isosurface rendering techniques, we used the Five Jets data from University of California at Davis ( http://www.cs.ucdavis.edu/~ma/ITR/tvdr.html). It originally consists of 128 x 128 x 128 with 4 bytes floating point value and has 2000 times. We chose one time step among the 2000 time steps and downsampled to 64 x 64 x 64 for the evaluation. Experiments have been performed on 3GHz Xeon processor with 1GB main memory with Redhat 7.0 Linux OS running. The system is loaded with NVIDIA6800 GPU with 4Gbps data transfer rate via PCI-Express bus. For the preprocessing, we decomposed each cube into 6 tetrahedra and obtained about 1.5 million tetrahedra over the dataset. Moreover, we precomputed normal vectors at each

voxel, which makes isosurface computation faster and Goroud shading possible sacrificing more memory space. In Figure 1, we compared the performance between one CPU-based isosurface renderer and two GPU-based isosurface renderers in terms of frames/sec over a set of isovalues. The framerates varies as isovalue changes. It becomes the worst around 253450 and saturates as the isovalue goes beyond 258700. Over the set of isovalues, both GPU-based isosurface renderers always show better performance against the CPU-based one. However, more interestingly, the second GPU-based isosurface renderer transmitting less data to GPU (GPU-2) shows worse performance than the first GPU-based isosurface renderer transmitting four times more data than the GPU-2 in the high number of quads isovalue area. This tells us that longer vertex program of GPU-2 becomes a performance bottleneck rather than data transmitting time from CPU to GPU when there are the high number of glVertex() calls. However, as the number of quads decreases, the number of glVertex() calls decreases and as a result, data transmitting time from CPU to GPU begins to matter more. As seen in the Figure 1, passing isovalue 256600, GPU-2 renderer achieves higher performance than GPU-1 which use shorter vertex program. Then, the performances in both GPU-renderer begin to become limited as isovalue increases more by the essential CPU code for the drawing.

Isosurface Rendering Time Comparison (Five-Jets data) 90

CPU GPU-1 GPU-2 # of Quads

90 80

60

70 60 50

50 40 40 30

30

20

20 10

10

0

0

25 10 00 25 13 50 25 17 00 25 20 50 25 24 00 25 27 50 25 31 00 25 34 50 25 38 00 25 41 50 25 45 00 25 48 50 25 52 00 25 55 50 25 59 00 25 62 50 25 66 00 25 69 50 25 73 00 25 76 50 25 80 00 25 83 50 25 87 00

Frames/sec

70

80

Isovalue

Figure1. Isosurface Rendering Time Comparison

# of Quads ( x1000 )

100

CPU Scheme 1

GPU-1 Scheme 2

GPU-2 Scheme 3

# of Quads / sec

296 K/s

576 K/s

500 K/s

Improvement over CPU

-

95%

69%

137 lines # of Instructions 91 lines (50% more than GPU-1) 38 bytes Transferred data 128 bytes (70% less than per tetrahedron GPU-1) Table 1. Performance Comparison Summary Table 1 shows the summary of performance comparison in terms of how many quads each scheme can generates per second. As seen in the table, both GPU-1 and GPU-2 schemes achieves much higher rates than CPU-based isosurface renderer. However, even though GPU-2 scheme transmits 70% less data to GPU at each tetrahedron, its average performance shows 25% worse than the GPU-1 scheme because of its 50% more number of instructions.

Figure 2. CPU-GPU Isosurface Renderer

Figure 2 shows our system for CPU/GPU –based isosurface renderer. It enables users to change isovalue, rotate, translate, and zoom in/out isosurfaces and also to toggle between GPU and CPU renderers. Our system can run on chromium to drive a tiled display though GPU-2 renderer may not run on old chromium version because of strict limitation on the vertex program length.

5. Conclusions We have implemented three isosurface computation/rendering systems: one CPU-based and two GPU-based. Because of faster floating point operation time of GPU, we could achieve considerable speed up in isosurface computation by transferring the cost of interpolation time to GPU. The first GPU-based scheme sends 4 vertices and 4 normals per each tetrahedron to GPU and let the vertex program compute isosurfaces and the second GPU-based scheme sends about only 30% of the data to GPU and let the vertex program figure out the 4 vertices and 4 normals and compute isosurfaces per each tetrahedron. We have found that a longer vertex program becomes a bottleneck rather than the amount of data transferred from CPU to GPU in the environment where not trivial amount of isosurface patches are rendered. This observation suggests the future direction. Attempts to reduce the number of glVertex() calls per tetrahedron by carefully organizing the order of tetrahedra sequence and using quads strip looks like a promising approach.

Reference 1. W. E. Lorensen and H. E. Cline. “Marching Cubes: A high resolution 3D surface construction algorithm.” In Maureen C.Stone, editor. Computer Graphics (SIGGRAPH '87 Proceedings), vol. 21, pp. 161--169, July 1987. 2. J. Wilhelms and A.Van Gelder. “Octrees for faster isosurface generation.” In Computer Graphics(San Diego Workshop on Volume Visualization), vol. 24, pp. 57-62, 1990. 3. H. W. Shen, C. D. Hansen, Y. Livnat and C. R. Johnson. “Isosurfacing in span space with utmost efficiency(ISSUE).” In IEEE Visualization'96, pp. 281--294, Oct 1996. 4. T. Itoh and K. Koyamada. “Automatic isosurface propagation using an extreme graph and sorted boundary cell lists.” In IEEE Transactions on Visualization and Computer Graphics, 1(4): pp. 319--327, Dec 1995.

5. Q. Wang, J. JaJa and A. Varshney. “An Efficient and Scalable Parallel Algorithm for Cout-of-Core Isosurface Extraction and Rendering” In IEEE International Parallel & Distributed Processing Symposium, 2006

6. WEILER M., KRAUS M., ERTL T. “Hardware based view-independent cell projection.” In Proceedings of the 2002 IEEE symposium on Volume visualization and graphics (VOLVIS-02) (Piscataway, NJ, Oct. 28–29 2002), Spencer S. N., (Ed.), IEEE, pp. 13–22. 7. RÖTTGER S., KRAUS M., ERTL T. “Hardware accelerated volume and isosurface rendering based on cellprojection.” In Proceedings Visualization 2000 (2000), Ertl T., Hamann B.„ Varshney A., (Eds.), IEEE Computer Society Technical Committee on Computer Graphics, pp. 109–116. 8. V.PASCUCCI. “Isosurface Computation Made Simple: Hardware Acceleration, Adaptive Refinement and Tetrahedral Stripping.” Euro-graphics - IEEE TCVG Symposium on Visualization. 2004 9. Y-J. Chiang and C. T. Silva. “I/O optimal isosurface extraction.” In Proceedings IEEE Visualization, pp. 293--300, 1997. 10. Y-J. Chiang, C. T. Silva and W. J. Schroeder. “Interactive out-of-core isosurface extraction.” In Proceedings IEEE Visualization, pp. 167--174, 1998.

Isosurface Computation and Rendering with More GPU ...

can be in a form of structured or unstructured mesh with scalar value at each .... of tetrahedra sequence and using quads strip looks like a promising approach.

297KB Sizes 0 Downloads 159 Views

Recommend Documents

GPU-Accelerated Incremental Storage and Computation - Usenix
chunking bandwidth compared to our optimized parallel implementation without a GPU on the same host system. .... The CUDA [6] programming ..... put data either from the network or the disk and trans- .... with Inc-HDFS client using a JAVA-CUDA interf

Read Practical Rendering and Computation with ...
http://hieroglyph3.codeplex.com By analyzing when to use various tools and the tradeoffs between different implementations, this book helps you understand the ...

GPU Enhanced Global Terrain Rendering". - Personal Web Pages
over tax the system so as to not slow other rendering systems . Using a mixture of ... working with the raw data while been an efficient storage mechanism. ... file. This meta-data is used to place the block of terrain in the appropriate location.

Download GPU PRO 3: Advanced Rendering ...
and games that run on the DirectX or OpenGL run-times or any other run-time ... A dedicated section on general purpose GPU programming f Full description.

PDF Download GPU Pro 6: Advanced Rendering ...
The latest edition of this bestselling game development reference offers proven ... dedicated section on general purpose GPU programming that covers CUDA, ...

i GPU ENHANCED GLOBAL TERRAIN RENDERING - Personal Web ...
for the degree of Master of Science in the. Department of Computer Science ..... algorithm one year may not be the best the next due to improvements in ...

i GPU ENHANCED GLOBAL TERRAIN RENDERING - Personal Web ...
for the degree of Master of Science in the. Department of Computer Science ..... algorithm one year may not be the best the next due to improvements in processing ... 2. GPU enhanced data structures for more efficient use of the GPU pipeline.

Download GPU Pro 7: Advanced Rendering ...
The latest edition of this bestselling game development reference offers ... dedicated section on general purpose GPU programming that covers CUDA and ...

Bipartite Graph Matching Computation on GPU
We present a new data-parallel approach for computing bipartite graph matching that is ... As an application to the GPU implementation developed, we propose a new formulation for a ..... transparent way to its developers. Computer vision ..... in alg

pdf-1862\accelerating-matlab-with-gpu-computing-a-primer-with ...
... of the apps below to open or edit this item. pdf-1862\accelerating-matlab-with-gpu-computing-a-primer-with-examples-by-jung-w-suh-youngmin-kim.pdf.

GPU Computing - GitHub
Mar 9, 2017 - from their ability to do large numbers of ... volves a large number of similar, repetitive cal- ... Copy arbitrary data between CPU and GPU. • Fast.

Physics and Computation - Math
is called the 'partition function' of the ensemble and X is the domain of some universal Turing machine U. ..... a universal prefix-free Turing machine U. They look at the smallest program p that outputs a ..... 000 001 010 011 100 101 110 111.

Shredder: GPU-Accelerated Incremental Storage and ... - Usenix
[28] JANG, K., HAN, S., HAN, S., MOON, S., AND PARK, K. Sslshader: cheap ssl acceleration with commodity processors. In. Proceedings of the 8th USENIX ...

Isotropic Remeshing with Fast and Exact Computation of ... - Microsoft
ρ(x)y−x. 2 dσ. (4). In practice we want to compute a CVT given by a minimizer of this function instead of merely a critical point, which may be a saddle point. If we minimize the same energy function as in .... timization extra computation is nee

Multigrid methods with space-time concurrency - Computation
resources than standard space-parallel methods with se- quential time stepping. ...... Friedhoff, S., MacLachlan, S.: A generalized predictive analysis tool for ...