A Framework to Transform In-Core GPU Algorithms to Out-of-Core Algorithms Takahiro Harada∗ Advanced Micro Devices, Inc. 1. Data management (Host) 2. Kernel implementation (Device) 3. Kernel launch (Host). The kernel implementation requires the most engineering effort as development environment is not as good as development on the host (e.g., debuging). We developed macros to transform in-core kernel to out-of-core kernel described below. Data Management

Figure 1: A screenshot of an out-of-core GPU path tracer developed using the framework rendering global illumination. The scene geometry (32GB, 428M triangles) is rendered on an AMD FireProTM W9100 GPU using 15GB geometry cache at 1280 × 720 resolution. The graphs on the left in the screenshot show the history of the amount of the page copy, the frame time, and the physical memory usage over frames.

Keywords: GPU, out of core, global illumination, path tracing Concepts: •Computing methodologies → Graphics systems and interfaces; Rendering;

1

Introduction

We essentially need to implement a software virtual memory system which receives page requests from device and fill the page from the source of the data which can be host memory or disk which is used when the data does not fit to the host memory. The reason why we implemented it in software is because we do not have full control of the GPU virtual memory. On the device side, we need to prepare buffers for page table, data storage (physical memory), and page requests. Once they are prepared, we execute the kernel. The CPU is responsible for serving page requests. It copies pages to the physical memory until it gets full. When there is no space available in the physical memory, it evicts older pages in our implementation. A pseudo code for an execution is shown in below. 1: while True do 2: Requests ← Execute kernel, passing physical memory and page table to the kernel. 3: Break if there is no requests from work items. 4: Serve the requests. Update physical memory and page table. 5: end while Kernel Implementation

Porting an existing application to the GPU requires a lot of engineering effort. However, as the GPU memory of today is smaller compared to the host memory, some application which requires to access to a large data set needs to implement an out-of-core logic on top of the GPU implementation which is additional engineering work [Garanzha et al. 2011]. We present a framework to make it easy to transform an in-core GPU implementation to an out-of-core GPU implementation. In this work, we assume that the out-of-core memory access is read only. The proposed method is implemented using OpenCL thus we use the terminology of OpenCL in this document.

2

Implementation

Overview To make a kernel execution out of core, changes need to be done for the host and device code. They are ∗ e-mail:[email protected]

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for thirdparty components of this work must be honored. For all other uses, contact c 2016 Copyright held by the owner/author(s). the owner/author(s). I3D ’16, February 26-28, 2016, Redmond, WA, USA ISBN: 978-1-4503-4043-4/16/02 DOI: http://dx.doi.org/10.1145/2856400.2876011

We need to make changes for the kernel to access out-of-core data. The first one is the memory access. It needs to read the page table first to find if the memory resides on the physical memory or not. If it is not there, we need to suspend the kernel execution and ask the system for the data which requires a suspension of the kernel execution. The mechanism to suspend and resume the kernel is the second change. We need to prepare storage for work item context and add the logic for the kernel to load and store it to the buffer so that work item can suspend and resume the execution. Since these operations are general, we developed a few macros for these which makes it easy to transform an in-core kernel to an out-ofcore kernel. If the kernel has one memory access to the out-of-core buffer, we only need to change 3 lines (1 mod, 2 add) for the body of the kernel as shown in the appendix. A few kernel examples and the macros are provided in the supplemental material. Kernel Launch The only change we need for an out-of-core execution, we need to pass a few additional buffers (physical memory, page table and page request).

3

Results

We transformed an in-core GPU path tracer to an out-of-core GPU path tracer performing out-of-core accesses to the scene geometry using the framework. Fig. 1 is a screenshot of a 32GB scene rendered on a GPU using 15GB physical memory in 720p resolution.

Table 1: Comparison of ray casting time. In core Out of core

Primary ray cast 29.4ms 58.3ms

Appendix

Secondary ray cast 15.8ms 39.8ms

The evaluation of the proposed method should be done in 1. simplicity of the kernel transformation, 2. execution performance. Although 1 is subjective, we believe the framework made it simple enough (please see the appendix and the supplemental material). Although the code developed by the proposed method can render a scene which does not fit to the GPU memory, the execution performance of the out-of-core kernel is expected be worse than the original if all the data fits in the physical memory because of one additional memory reads of the page table for an out-of-core memory access, complexity of the code, and loading and storing work item contexts. We compared the time for ray casting of our in-core and out-of-core GPU path tracing on an AMD FirePro W9100 GPU on a workstaR R tion with dual Intel Xeon CPUs. The renderer does out-of-core access to the geometry data which includes triangles (vertices, normals, texture coordinates, connectivities) and the spatial acceleration structure (BVH) while others, such as textures, are stored in core. We set the page size to 32KB, and the VM serves top 1024 most-requested pages per iteration. We tried several combinations of these parameters manually and select the best among them, but automatic setting of these is a future work. The out-of-core implementation is about 2× slower than the in core implementation in the experiment although the ratio depends on many factors such as computational complexity and memory access pattern. This would be mainly from the difference in the GPU occupancy, 79% (using 36VGPR) for the in core and 30% (using 70VGPR) for the out of core. Since the modification we need for the transformation is simple, we believe that it could be done automatically for example when compiling the code. Fully automatic transformation of the code is a future work. Other future works include optimization of the compiler to reduce the VGPR usage, application to the R&W data, and extension to multiple GPUs.

Here we show a kernel transformation example. More examples can be found in the supplemental material. There are in core and out of core impementations. The modification made for the out of core implementation is highlighted in bold. This kernels are slightly simplified from the one used for the benchmark of the paper for illustration purpose. Here we show kernels computing the geometric and the shading normals for a ray which has a hit to a triangle. The geometry data is in the out-of-core storage for implementation 2. 1. In core implementation 1 2 3 4 5 6 7 8 9 10

G ARANZHA , K., B ELY, A., P REMOZE , S., AND G ALAKTIONOV, V. 2011. Out-of-core gpu ray tracing of complex scenes. In ACM SIGGRAPH 2011 Talks, SIGGRAPH ’11, 21:1–21:1.

i f ( hasHit ( gHits [ gIdx ] ) ) { t = g T r i S t o r a g e [ g H i t s [ g I d x ] . m idx ] ; c o n s t f l o a t 4 ng = n o r m a l i z e 3 ( c r o s s 3 ( t . v1−t . v0 , t . v2−t . v0 ) ) ; c o n s t f l o a t 4 hp = gRays [ g I d x ] . g e t H i t P o i n t ( ) ; c o n s t f l o a t 4 bCrd = c a l c B a r y C r d ( hp , t . v0 , t . v1 , t . v2 ) ; c o n s t f l o a t 4 ns = n o r m a l i z e 3 ( bCrd . x∗ t . n0+bCrd . y∗ t . n1+bCrd . z ∗ t . n2 ) ; g H i t N o r m a l O u t [ g I d x ] . m ng = ng ; g H i t N o r m a l O u t [ g I d x ] . m ns = n s ; }

11 12 13 14 15 16 17 18 19 20 21 22 23 24

}

2. Out of core implementation 1 2

References

kernel v o i d ComputeNormalKernel ( g l o b a l Ray∗ gRays , g l o b a l Hit ∗ gHits , g l o b a l H i t N o r m a l ∗ gHitNormalOut , global Triangle∗ gTriStorage ) { c o n s t i n t g I d x = GET GLOBAL IDX ; Triangle t ;

3 4 5 6 7 8 9

kernel v o i d ComputeNormalKernel ( g l o b a l Ray∗ gRays , g l o b a l Hit ∗ gHits , g l o b a l H i t N o r m a l ∗ gHitNormalOut , VM KERNEL ARGS ) / / ooc { c o n s t i n t g I d x = GET GLOBAL IDX ; Triangle t ;

10

VMInitialize;

11 12

i f ( hasHit ( gHits [ gIdx ] ) ) { VMLoad( Triangle, gHits[gIdx].m idx * sizeof(Triangle), &t, 0 ); c o n s t f l o a t 4 ng = n o r m a l i z e 3 ( c r o s s 3 ( t . v1−t . v0 , t . v2−t . v0 ) ) ; c o n s t f l o a t 4 hp = gRays [ g I d x ] . g e t H i t P o i n t ( ) ; c o n s t f l o a t 4 bCrd = c a l c B a r y C r d ( hp , t . v0 , t . v1 , t . v2 ) ; c o n s t f l o a t 4 ns = bCrd . x∗ t . n0+bCrd . y∗ t . n1+bCrd . z ∗ t . n2 ) ;

13 14 15 16 17 18 19 20 21 22 23

g H i t N o r m a l O u t [ g I d x ] . m ng = ng ; g H i t N o r m a l O u t [ g I d x ] . m ns = n s ;

24 25

}

26 27

VMFinalize;

28 29

}

A Framework to Transform In-Core GPU Algorithms to Out-of ... - GitHub

Keywords: GPU, out of core, global illumination, path tracing. Concepts: .... include optimization of the compiler to reduce the VGPR usage, application to the ...

2MB Sizes 3 Downloads 175 Views

Recommend Documents

Introduction to Algorithms - GitHub
Each cut is free. The management of Serling ..... scalar multiplications to compute the 100 50 matrix product A2A3, plus another. 10 100 50 D 50,000 scalar ..... Optimal substructure varies across problem domains in two ways: 1. how many ...

OpenCUDA+MPI - A Framework for Heterogeneous GP-GPU ... - GitHub
Kenny Ballou, Boise State University Department of Computer Science ... computing limit scientists and researchers in various ways. The goal of.

Introduction to Framework One - GitHub
Introduction to Framework One [email protected] ... Event Management, Logging, Caching, . ... Extend framework.cfc in your Application.cfc. 3. Done. (or in the ... All controllers are passed the argument rc containing the request.context, and all v

GPU Computing - GitHub
Mar 9, 2017 - from their ability to do large numbers of ... volves a large number of similar, repetitive cal- ... Copy arbitrary data between CPU and GPU. • Fast.

Using the Xtivia Services Framework (XSF) to Create REST ... - GitHub
As we know the current trend in web application development is toward Single Page Applications (SPAs), where the majority of application functionality is ...

The Coco Framework - GitHub
Aug 10, 2017 - failure. In a consortium of banks, members could be large, global, systemically important financial institutions (GSIFIs). ... End users, such as a bank's customers, do not have an identity in the Coco network and cannot transact .....

Open Modeling Framework - GitHub
Prepared for the U.S. Department of Energy, Office of Electricity Delivery and Energy Reliability, under Contract ... (ORNL), and the National Renewable Energy.

A framework to leverage domain expertise to support ...
Abstract—Advances in modern technologies have afforded end- users increased convenience in performing everyday activities. However, even seemingly trivial issues can cause great annoyance for the ordinary user who lacks domain expertise of the ofte

Approximation Algorithms for Wavelet Transform ... - CIS @ UPenn
Call this strip of nodes S. Note that |S| ≤ 2q log n. The nodes in S break A into two smaller subproblems L .... Proc. of VLDB Conference, pages 19–30, 2003.

Annotated Algorithms in Python - GitHub
Jun 6, 2017 - 2.1.1 Python versus Java and C++ syntax . . . . . . . . 24. 2.1.2 help, dir ..... 10 years at the School of Computing of DePaul University. The lectures.

kirafatyangra - a tool to recommend insecticides - GitHub
Department of Computer Science and Information Technology. DWIT College. In partial fulfillment of the requirements for the Bachelor's Degree in ... Page 2 ...

A Framework for Prototyping J2EE Replication Algorithms - CiteSeerX
possible replication strategies is wide, and the choice of the best one, depending ... bean as a web service endpoint ([2], section 5.5). Also, Axis [3], a .... the stub detects a server failure, it can fail over to another server host. The .... Page

A Beginner's Introduction to CoffeeKup - GitHub
the buffer, then calls the title function which adds it s own HTML to the buffer, and ... Now it is starting to look like real HTML you d find on an ugly web page. 2 ...

A Framework for Prototyping J2EE Replication Algorithms
A J2EE application is deployed as a set of components. ... actions are handled by a central transaction manager with a well-known API. Together, these ..... configuration file defines the sequence of handlers to be executed before a service.

Intro to Webapp - GitHub
The Public Data Availability panel ... Let's look at data availability for this cohort ... To start an analysis, we're going to select our cohort and click the New ...

Introduction to R - GitHub
Nov 30, 2015 - 6 Next steps ... equals, ==, for equality comparison. .... invoked with some number of positional arguments, which are always given, plus some ...

Introduction To DCA - GitHub
Maximum-Entropy Probability Model. Joint & Conditional Entropy. Joint & Conditional Entropy. • Joint Entropy: H(X,Y ). • Conditional Entropy: H(Y |X). H(X,Y ) ...

Supplement to - GitHub
Supplemental Table S6. .... 6 inclusion or exclusion of certain genetic variants in a pharmacogenetic test ..... http://aidsinfo.nih.gov/contentfiles/AdultandAdolescentGL.pdf. .... 2.0 are expected to exhibit higher CYP2D6 enzyme activity versus ...

GPU Multiple Sequence Alignment Fourier-Space Cross ... - GitHub
May 3, 2013 - consists of three FFTs and a sliding dot-product, both of these types of ... length n, h and g, and lets call this sum f. f indicates the similarity/correlation between ... transformed back out of Fourier space by way of an inverse-FFT.

A Beautiful Constraint: How to Transform Your ...
Advantages, and Why It's Everyone's Business PDF,. EPUB, EBOOK ... its core business audience to anyone who needs to find the opportunity in constraint.

Introduction to Algorithms
Introduction to Algorithms. L1.3. Course information. 1. Staff. 2. Distance learning. 3. Prerequisites. 4. Lectures. 5. Recitations. 6. Handouts. 7. Textbook (CLRS).

War-Machine-How-To-Transform-Yourself-Into-A ...
There was a problem previewing this document. Retrying... Download. Connect more apps... Try one of the apps below to open or edit this item. War-Machine-How-To-Transform-Yourself-Into-A-Vicious-And-Deadly-Street-Fighter.pdf. War-Machine-How-To-Trans

Floatworld : A Simple Artificial Life Framework for Simulated ... - GitHub
which virtual “creatures” compete for space and energy. We will ... the ability of evolution by natural selection to drive the increase in fitness of ..... of energies ϵ.