SG14 Summary Audience The first question is what or who is it for? Particularly the study group is for engineers developing in particular fields, addressing low latency and real time requirements, as well as performance and efficiency, particularly for games, financial applications and simulations. I work for a games company and the clear majority of contributors to SG14 are also games developers, so this presentation will focus on that domain. However, much of what happens in the games domain transfers to the other domains I just mentioned.
Formation The group was created in response to a simple question at CppCon in 2014: where are the games programmers on the podium?
Meetings SG14 meets monthly for telecons: Michael Wong chairs a meeting over a conference call and we consider the issues at hand. There is a broad set of papers in flight at the moment, and I’m going to talk about some of them through this presentation. Attendance is good: for a one hour meeting we expect to have about twenty people dialling in; I’m informed that the other study groups rarely reach double figures.
Differentiation We do things a little differently: take a look at this video. This is for my studio’s latest title, Total War:Warhammer. This video was captured in real time at 60Hz. That means that all the activity and rendering needs to be calculated in 16ms. Let’s do the maths: on a 3GHz processor we can get 50 million cycles. We have, say, 20 thousand soldiers on there, so that’s 2500 cycles per soldier. With those cycles you have to choose what the soldier should do, and choose which animation to play as a result. Then you have to worry about cache coherency which cripples your performance as you’re waiting for the cache to fill up. There are some tricks we can pull: we only REALLY need to do the soldier decision making at 10Hz, not 60Hz, so that gives us 15000 cycles per soldier. Also, we demand a minimum of two cores from the user’s machine, so we could put some of the work on our extra core or cores.
However, we also need to render the game. We render the game on another processor in the computer, the video card, the GPU, but we need to tell the video card what to do, and all of THAT happens on the CPU. Telling a GPU what to do is not a trivial task: you have to work out what you want to draw from what the camera can see and what resources you need to push over to the graphics card. Then we’re into heterogeneous computing as we have two different styles of programming going on for two different types of microprocessor. Should that be nanoprocessor now? I haven’t even mentioned sound yet. The story doesn’t end there though: Total War, unlike a console game, is designed to run on a variety of hardware specifications. There may be plenty of RAM, or just 3GB. There may be a £400 card in the machine, or an integrated laptop graphics part. All of these variations have to be accommodated, and the game still has to run at an acceptable frame rate. Optimisation is front and centre: keep the caches full and the cores busy. While this isn’t much different from advice given to programmers in all fields, we don’t have the luxury of throwing more cores at the problem: the software needs to run on a broad set of hardware as specified by the buying public, or a fixed piece of hardware as specified by the console manufacturers. One simplification that has appeared with the current generation of consoles, the PlayStation 4 and the XBoxOne, is that they are powered by Intel x8664 architecture processors (actually custom AMD parts), which means that all the desktop and console games development targets the same architecture. Real time programming is quite different from a typical windowed application which waits on a message queue to do anything, for which processing takes as long as it takes, and if it takes too long for you, upgrade your hardware. SG14 also caters for embedded programmers, which also means finite RAM and limited tools, much like console game programming.
Don’t Pay For What You Don’t Use Exceptions The first item on the list is exceptions. The preeminent feature of C++ is deterministic destruction. When an object falls out of scope, or is explicitly deleted, you know that that is the end of the object. It has a well defined lifecycle. Exception handling makes this a rather tricky proposition. “goto” is bad enough within a single function, but throwing an exception through the call stack leads to all kinds of shenanigans, and the bookkeeping required is considerable. Whenever an exception leaves a function, it has to clean up after itself, destroying all the objects on the stack. If the exception zips up the call stack, through different compilation units, static or dynamically linked libraries, it has to clean as it goes through. That means the code has to be instrumented with all manner of cleanup code, because there are two ways of exiting a function: via a return statement or via an uncaught exception.
The return statement is well defined. “Here we are” we’re telling the compiler. “Here’s where we do all the cleanup”. Even if there are multiple return statements, the compiler knows at compile time what objects still exist and what objects have already been destroyed. Exception handling means cleaning up at potentially any function call. If the called function is declared in the same compilation unit, and it doesn’t leave the compilation unit via another function call, and it doesn’t throw an exception, then the compiler can relax a little, knowing that no exception is going to be thrown through the call stack. In the general case though, it needs to know which destructors to call at any point where an exception may be passed through. There are two ways a compiler can do this: either put all the exception handling code directly in the function, keeping it local and requiring a minimum of context but bloating the function and reducing instruction cache performance, or have a table of possible exception sites and appropriate context elsewhere in the binary, bloating the entire binary considerably. The latter is more common for compiling 64bit processes. Patrice Roi prepared a presentation for the SG14 meeting at CppCon 2015. You can find the paper here http://hdeb.clg.qc.ca/Sujets/Developpement/ExceptionsCosts.html but I will summarise the findings now: It’s an unpleasant choice: I’ve seen fights break out for possession of a 1Mb buffer that has suddenly become available during development after a piece of optimisation, so adding the burden of exception handling is a costly choice. I asked Eric Niebler at his closing keynote at CppCon about STL2 and the place of exceptions. Here’s what he had to say. It seems a valid point: if you’re not handling exceptions, you’re dealing with errors in some other way. But are you? The deciding factor is: what are you going to do if an exception is thrown? So what? Nobody died because of a bug in a game, so let’s simply kill the process and apologise to the user, and maybe address it in a patch if it happens to many users. Game companies have extensive QA processes and any bug which stops a game from running is caught and squashed pretty early (in the general case). Games have a very, very limited set of nondeterministic inputs: a bit of keyboard, mouse or controller input and some save games. Any errors should be able to be explicitly handled and recovered from where they happen. As a result, game companies write exception unsafe code. This has a profound impact on our standard library use, which I will come back to in a moment. I have to add a postscript to this section now. At CA, we develop two products for Intel architecture machines: Total War for Windows, which has been ported to Linux and OS X, and Halo Wars for XBoxOne and Windows 10. I learned very recently that the Halo Wars team have switched exception handling on for their product. It is a 64bit only product and as I remarked that exception implementation has less impact on the instruction cache. My opposite number on that project is Chris Gascoyne, with whom I meet monthly. I look forward to his postmortem.
RTTI Before I do though, let’s just touch on RTTI. It is through RTTI that the typeid operator and dynamic_cast<> operation are made available. Interestingly, Stroustrup didn’t want RTTI in the language and it was a later addition that made it into C++98. His experience was that he usually saw the Simula mechanisms for introspection being misused. Nonetheless, he proposed dynamic_cast with Dmitry Lenkov in 1992 so it is a consideration when writing a game. Except that it isn’t: RTTI effectively puts a vtable on every object, whether or not it contains any virtual functions. Again, with highly deterministic code, RTTI should not be needed, and the cost of coding around it is greatly exceeded by the cost of making use of it. What’s wrong with virtual functions? We’ll come back to that.
Standard library reimplementations But first, let’s consider the standard library. This is a mighty piece of work, one of the great achievements of the programming discipline. However, it makes full use of the language, and is exception safe. This presents a problem for games developers: large tracts of the library are unavailable to us because they attempt to catch exceptions. I should point out here that throwing isn’t a problem: if something goes wrong in your game then it is entirely acceptable to throw an exception. Even if you run your compiler with a no exceptions flag, that simply means “don’t handle stack unwinding”, which is the expensive part. The run time still has an exception handler that is called when an exception is thrown. In the general case where exceptions ARE handled correctly, every time you enter a try block the exception handler is modified to reflect the process’s position in the code. If you never enter a try block, the handler is never changed, and the default action is executed: you might get a dialog from the operating system saying something like the program terminated unexpectedly, or if you’re running through a debugger you should get an opportunity to examine the site of the throw and see what went wrong. Even if you aren’t, you may still get the opportunity to attach a debugger and have a look. Indeed, it is usual for games to install their own exception handler in the entry point function (WinMain, main or whatever) and compile that source file with separate flags to enable exception handling. This exception handler can “phone home” (with the user’s permission) and report back on the exceptional circumstances that the process encountered, as well as offer advice to the user about what to do next, tell them how they might avoid the problem, and apologise grovellingly. So back to the standard library. There is very little specification about implementation in the standard, and it is quite normal for implementers to catch exceptions: theirs is a broad church of users. Sadly, if exceptions are disabled, your compiler will complain at the sight of a trycatch block, knowing that it can’t handle the unwinding in the event that an exception is thrown.
You might be fortunate: your library implementation might have a preprocessor symbol that you can define such as HAS_EXCEPTIONS=0 which will turn off exception handling in the standard library. However, this isn’t something you can rely on. The standard doesn’t mandate the provision of such a flag. We could all club together and submit a proposal and attend the meetings and force a vote through by sheer weight of numbers, but I imagine it wouldn’t be very good for the language. It would force the implementers down some very dark alleyways and soak up a lot of their time, hampering further development of the language. There is in fact an implementation of the STL by Electronic Arts which was opensourced in February of this year. It addresses many of the issues that plague game developers. For example, compilers are not necessarily as good at inlining as one might hope them to be. Conversely, inlining can lead to code bloat which is very unhelpful on a fixed memory system. This makes things like allocators rather unpopular since they must be implemented as a class template and can generate some rather deep inlining when combined with a std::vector. It also avoids exception handling entirely and generally performs better than the vendor implementations. However, it is not quite standard compliant. It remains to be seen if it becomes part of the regular developer’s toolkit: having been opensourced so recently there isn’t much feedback in the wild about its performance. Historically though, rolling your own standard library containers has been a rite of passage for the journeyman game developer. Writing for performance rather than maintenance is the secret: not too many function calls, not too much inlining.
Memory constraints and inconsistent allocation patterns (std::function &c.) Allocators are actually a very serious matter. Much discussion has been devoted to the place of the heap in game dev. Run time allocation is avoided at all costs, since there is no way of knowing how long it will take: although a game has very limited nondeterministic characteristics, it’s not ENTIRELY absent. Additionally, the operating environment of the game on a desktop machine can’t be relied upon. There may be varying amounts of RAM available at different times, dependent on other processes running while the game is running. Although the modern user knows not to try streaming video from YouTube while playing a game, there are all sorts of things that can soak up RAM without the game player knowing, and this can impact on the allocation of memory. It is usual to assign budgets to various systems of the game. In Total War we got away with forwarding operator new to dlmalloc and wrapped up the calls in our own memory management functions which tracked the budget and warned us if we were overallocating. Other game designs will allocate slabs of memory for particular sized allocations themselves and write a complete operator new implementation to accommodate that. The real problem is fragmentation of the heap. Your process starts with unspoilt sunlit uplands of untouched address space, and as time goes on holes start to appear where
allocations live. Quite often you can find yourself using vectors as part of a longer term initialisation plan, which leads to a process space that looks like a piece of lace. It’s better to collect together all the allocations which have similar lifespans. In realtime software you have a better idea about how long something is going to last: for example, the renderer which puts everything on screen at 60Hz has a particular job to do and any allocations made while it’s doing it should be released by the end of the job. This runs at a different rate from the AI which ticks the soldiers, so it makes sense to operate two heaps, one for the renderer and one for the AI. Additionally, each of these systems runs on a loop, so it makes sense to have one heap for allocations which expire within a single iteration of the loop and one which outlasts the loop. We can achieve this by overloading operator new and adding an additional parameter which identifies which heap should be used. This kind of problem diminishes when writing 64bit programs. You have 16 exabytes of address space to play with, although the top twenty bits are masked off on the x8664 architecture, leaving you with a mere 16 terabytes. It’s even worse on Windows, since half the address space is reserved for the kernel, leaving you with a frankly paltry 8 terabytes. You can sensibly restrict yourself to 4GB of RAM for a game on most desktops and consoles, so you can use the most significant bits of an address to partition the remaining process space into sections for different sized allocations. This gives us linear time for allocations if we handle the tracking appropriately with for example two containers of free and occupied addresses. Memory is one of the biggest problems of game development, indeed any embedded system will spend an enormous amount of time minimising the amount of address space used. This is why special attention needs to be paid to standard library objects. For example, what is the address burden of this object: std::function func; Sadly, we simply can’t tell. The actual function object may prompt allocation for its state in the heap rather than on the stack, and the behaviour will differ from implementation to implementation. Sadly, in addition to that, vendor implementations use RTTI for some of the members. This actually isn’t such a problem for my team as we now only develop for Windows using MSVC but if you were at my talk last year you will recall that we ported one of our titles to Linux and OS X. Clang will parse the complete class template when it encounters it, rather than parsing the necessary functions as they are called. This means that std::function is unavailable to nonRTTI projects built using Clang. There are several efforts underway from SG14 contributors to come up with a more suitable implementation. However, mandating implementation details in the standard can reduce the flexibility of the growth of the standard, so they are unlikely to make popular proposals.
Inlining and virtual functions Finally in this section, let’s look at function calls. As already alluded to earlier, we can’t rely on deep inlining from a compiler, which is why we often choose to reimplement chunks of the standard library as a performance optimisation. Sadly, this can cause code bloat: obviously, function calls are smaller than inlined functions in the general case, although it’s worth taking a look at some disassembly for a function call here: As engineers our lives are used to tradeoffs. Sometimes you want to inline the code for better performance, other times you don’t for smaller size. The other problematic function call is through the virtual dispatch. Unless the static type is known at compile time, making a function call requires finding the virtual function table of the object in question, retrieving the function pointer and making the call. That results in a flushing of the data cache, then the instruction cache, then the data cache again. This isn’t something you want to be doing in time sensitive code so we try to avoid the use of virtual functions during the main loop. However, sometimes it is unavoidable: a common pattern is to contain pointers to objects of the same base type and invoke a particular function on them. It would be nice to be able to sort on virtual function pointers to minimise the instruction cache mangling. Adding a typeid function to the base class takes you part of the way and allows you to sort by type, but that doesn’t take into account which types have overloaded which particular function. If only half the derived classes overload a particular function, you want to be sure that the same overrides are gathered together.
Library extensions Ring Enough about messing with the language. Besides setting out the shortcomings of the language for our particular domain, SG14 contributors have also proposed new features, some of which are in flight as we speak. The first of these is the ring. I proposed a ring last year, and it has been through a couple of meetings acquiring suggestions for modification and a coauthor along the way. Cast your mind back, back, back to the 1980s. It was a time of airbrushes, Athena posters, the ZX Spectrum and programming the Z80 by Rodney Zaks. I first started writing C in the mid 80s when I got a hard drive for my Atari ST and discovered it came bundled with a C compiler called Lattice C and some sparse documentation in the form of readme files. I’ve learnt BASIC, Z80 and 68000, I thought to myself: I’ll try another language. Little did I know... Frankly, writing in C felt like writing in assembly but with better names for things. I suppose it was like a macro assembly language. I felt very proud of myself when I got something
compiling, linking and executing. Buffers bowed to my command, and very soon I was filling and emptying them by the magic of DMA transfer and embedding assembly in my C code. I had no shame! One thing that I was particularly proud of was processing a buffer in chunks while it was still being filled by DMA transfer. When the DMA transfer completed I could start filling the buffer again because I had already processed part of it. I called this a ring. I came across the concept time and again over the next 30 years as I carried on coding on systems with tight hardware requirements. Everyone seemed to have a different name for it as well: generally it was formed from two words, where the first word was ring, cyclic, fixed or rolling and the second was buffer, queue or FIFO. As an aside I never cease to be amazed how vocal people can be about identifiers, as well as how difficult it is to name things. Perhaps you have heard of Parkinson's law of triviality, which is C. Northcote Parkinson's 1957 argument that organisations give disproportionate weight to trivial issues. This is also known as bikeshedding, bikeshed effect, or the bicycleshed example. When I first heard the word “bikeshed” being used as a verb I was a little worried about what lay ahead. This is a very useful container for SG14. As I said, it crops up everywhere, mainly either for asynchronous processing of messages or for keeping a buffer of the last n events. It is contiguous, which the current std::queue is not, since it is implemented as an adapter of list or deque. Has anyone here submitted a paper to the standards body? Here’s my story. The ring is described in P0059R2, and I first started work on it shortly after ACCU last year. I thought I was finished after about three months of tinkering and review from some friends and colleagues. After all, it’s not a tricky concept, how hard can it be? But that was just the start. I presented it at the first SG14 face to face meeting at CppCon 2015 and received some feedback before updating it for the Kona, Hawaii meeting, where it was assessed by people from outside SG14. The reception seemed lukewarm. I wasn’t there, and the note taking was very matter of fact, which is a good thing. My implementation of pop was unpopular, and it was considered not really worth putting in the standard as it stood. Maybe if it was a circular range (oh, another name…)? Maybe owning the memory isn’t appropriate? Should it support iteration? I originally provided a header synopsis, and I was asked to send a specification instead. I found a collaborator in Arthur O’Dwyer who had started working on an alternate implementation after seeing my presentation at CppCon. He had a nonowning version based on a span constructor, like an array_view. It solved a lot of problems, so I asked him to coauthor and assist with future development. I submitted a new paper in February for the Jacksonville, Florida meeting. The review was very positive and I hope to have something ready for Oulo, Finland (these guys really travel around).
I really would encourage you to participate in the standards process. These people really care about what they’re doing, and spend a remarkable amount of their own time trying to make the language better. It has been quite humbling seeing the amount of effort going into discussing my paper.
Cachefriendly map and set Map probably comes a close third behind vector and array in the mostusedcontainer competition. Sadly, typical vendor implementations use a red black tree to store the keys for “fast” access which is about as cacheunfriendly as you can get. The standard alternative is the unordered_map and unordered_set which use hashing to store and locate elements. However, then you have to worry about good hashing algorithms which are notoriously hard to develop: avoiding collisions is deep magic. Again, a cachefriendly map and set are staples of the game development studio. This was among the first of the topics discussed on the original low latency and game dev group that was formed after CppCon 2014. Sean Middleditch, of Wargaming.net, is leading the charge on this effort. He has started working on the wording for four new containers: flat_map, flat_set, flat_multimap and flat_multiset. These differ from the two prior implementations in that the intention is to require this implementation to be cachefriendly. This raises some hefty design issues. Firstly, we have to worry about the interface. The unordered flavour has a very similar API to the ordered flavour. This means that making the decision about which to choose can be deferred until you have your usage completed and you can easily run your benchmarks to decide which is the most appropriate. I’ll have more to say about benchmarking in general later. However, for the flat_map containers a design question is what element access actually returns. For map and unordered_map a std::pair is returned of key and value. Returning a pair of references would allow the keys and value to be stored separately in memory and keep the keys more tightly packed. Alternatively, the API could be changed and access could return keys or values. This allows many of the performance advantages of reference proxy access without incurring the problems that C++ currently suffers with reference proxies. Then we have to decide on sorted or levelordered storage of elements. Early indications are that levelordered storage of elements results in improved performance for search operations, impacting both find and insert. Iteration must be ordered in either case to maintain one of the useful properties of flat containers, and interface for other operations have no dependency on the core algorithm. The only consequence of choice of algorithm is whether the iterators are contiguous, as used by some of the hashing proposals, but the use cases of contiguous iterators of a flat container are pretty thin on the ground though.
The proposed associative containers mandate contiguous allocation with growing capacity similar to vector. Iteration of elements must be ordered. The underlying algorithm is implementation defined. The interface will be compatible with existing ordered associative containers where it makes sense to do so, but will break compatibility as necessary.
Uninitialised memory algorithms And back to memory. How do we deal with a range of memory? Also, forgive my spelling: I will retain my native spelling of uninitialised in prose, and use the ISO C++ standard spelling for the symbols. Let’s start with a quick reminder of the existing uninitialised storage objects. Functions first: uninitialized_copy copies a range of objects to an uninitialised area of memory uninitialized_copy_n copies a number of objects to an uninitialised area of memory uninitialized_fill copies an object to an uninitialised area of memory uninitialized_fill_n copies an object to an uninitialised area of memory get_temporary_buffer allocates uninitialised contiguous storage return_temporary_buffer frees uninitialised contiguous storage There is also a class: raw_storage_iterator this handy beast allows algorithms to store results in uninitialised memory Does anyone here use these entities? I must confess they were new to me before they came up in discussion on the SG14 reflector. As it turns out I spent a chunk of time implementing them myself with equivalent results. Either great minds think alike or fools seldom differ. The discussion on SG14 was excellent, ranging over a five week period last summer. It was a joy to read. Brent Friedman has taken this particular bull by the horns and submitted a paper that offers some additional symbols. First of all the *_move operations: uninitialized_move moves a range of objects to an uninitialised area of memory uninitialized_move_n moves a number of objects to an uninitialised area of memory More interestingly for container writers, he also proposed construct and destroy functions: uninitialized_default_construct performs default construction of objects over a range of memory uninitialized_value_construct performs value construction of objects over a range of memory destroy calls the destructor for specified elements
We now have a full set of algorithms that operate on uninitialised memory which enables developers to write specialised arraybased containers. Particularly, valueinitialisation can become a memset to 0, default initialisation and destruction can become a noop, and move construction can become a memcpy. For example, an array of unique_ptr can be cheaply moved if you memcpy it to the destination and then memset the source to zero. You can add custom type traits for “is zero initialised”, “destructive move is memcpy and memset”, “destruction of zero initialised instances is a noop” and use SFINAE tricks and these new uninitialised storage entities to improve your custom container library. This leads us to the idea of relocatable types. Anything that is nothrow movable and nothrow destructible can be copied if we promise we will destroy the source straight away. This is relocation: it is the classic Star Trek transporter conundrum, where you destroy a selection of top ranking officers and a couple of chaps in red shirts on the transporter pad only to reassemble them on a planet. This makes resizing vectors a trivial matter for relocatable types: one call to realloc() performs, all at once: ● Allocate new storage ● Copy/move items to new storage ● Destroy old items I’ll happily take that saving.
Fixed point numbers Unless you have a background in mathematics or experience with pre486DX CPUs, you possibly take noninteger arithmetic for granted. This is not necessarily a bad thing: the point of standards is to be invisible, to simply state the way of things. However, a standard for noninteger arithmetic has been created by the IEEE: it is named IEEE 754 and celebrated its 30th birthday last year. The C++ standard makes use of two of the basic formats of IEEE 754 and adds a third which is NOT defined by the standard. Particularly, float and double correspond to binary32 and binary64. Long double is typically an 80 bit type. It’s a great piece of work, it’s a mature standard and it has been optimised to hell and back in silicon for many years now. Nowadays, floating point arithmetic is as fast as integer arithmetic; some operations are in fact faster and some are slower. Division is particularly complicated. However, there are a couple of problems. Some systems lack native floatingpoint registers and must emulate them in software. There is a problem with floating point arithmetic and simulations. Consider how a 32bit floating point number is represented: As operations are carried out, the magnitude of the number may grow or reduce, as evidenced by the exponent. Since the exponent is of a fixed width, we effectively have a dynamic radix point, the place that separates the integer part from the fractional part. This is what is meant by floating point. The ability to represent a vast chunk of the rational number line comes at the cost of precision. In fact, a floating point binary32 number is only accurate to about five decimal orders of magnitude.
This means that if you have a 10km distance, you can only resolve it to about 1m, i.e. one part in 10,000. This is problematic if you are simulating a large battlefield and you have blades passing within centimetres of each other or fists landing on faces. Everything has a position coordinate, every joint in a skeleton, and solving this without moving to double precision and doubling the size of your position data is a tricky problem. http://www.pathengine.com/Contents/Overview/FundamentalConcepts/WhyIntegerCoordinat es/page.php One solution is to decide that rather than have an uneven distribution of positions over a huge range of the number line, what you want is a fixed resolution over a smaller range. This is the motivation behind fixed point arithmetic. I think that in the long term this may be one of the most significant developments in the SG14 area. John Mcfarlane has been leading on this with a companion proposal from Laurence Crowl, and has offered a set of tools for defining and manipulating fixedpoint types in paper p0037r1. SG6 is also considering this work: indeed, it is the more appropriate group since it is devoted to numerics. An interesting feature of SG14 is that it is user focused rather than feature focused, so there is potential for a lot of crossover with other groups. Later we shall see cooperation with SG1, concurrency, when I talk about SIMD. This proposal is a pure library extension, consisting of a selection of class and function templates. Two are appended to and the remainder live in a new header file, . Let’s take a look. Fixed point numbers are specialisations of template class fixed_point; The first parameter identifies the capacity and signedness of the underlying type to represent the value. The default is int. The second parameter is the equivalent of the exponent field in a floating point type and shifts the stored value by the requisite number of bits necessary to produce the desired range. The default is 0. Next we have two helper types template using make_fixed; template using make_ufixed; The purpose of these is to support a more intuitive description by the cardinal number of integer and fractional digits. This would probably be the most popular way of distinguishing fixed_point types. For example:
make_fixed<2, 29> value {3.141592653} declares a 32bit signed fixed point number with two integer digits and 29 fractional digits. Fixed point numbers can be explicitly converted to and from arithmetic types. Significant digits are not lost but rounding errors are made. For example, make_ufixed<4, 4> (.006) == make_ufixed<4, 4>(0) equates to true and is considered an acceptable rounding error. Operator overloads are provided, performing as little runtime computation as is practically possible. With the exception of shift and comparison operators, binary operators can take any combination of one or two fixed point arguments and zero or one arguments of any arithmetic type. When the inputs are not identical fixedpoint types, a simple set of promotionlike rules are applied to determine the return type: 1. If both arguments are fixedpoint, a type is chosen which is the size of the larger type, is signed if either input is signed, and has the maximum integer bits of the two inputs (so it cannot lose highsignificance through conversion alone) 2. If one argument is floatingpoint type, then the result type is the smallest floatingpoint type of equal or greater size than the inputs 3. If one argument is an integral type, then the result is the other, fixedpoint type. For example: make_ufixed<5, 3>{8} + make_ufixed<4, 4>{3} == make_ufixed<5, 3>{11}; make_ufixed<5, 3>{8} + 3 == make_ufixed<5, 3>{11}; make_ufixed<5, 3>{8} + float{3} == float{11}; Overflow and underflow are need to be taken into consideration. For instance: make_fixed<4, 3>(15) + make_fixed<4, 3>(1) causes an overflow because a type with 4 integer bits cannot store a value of 16. The result depends on how the representation type handles overflow. So for builtin signed types the result is undefined and for builtin unsigned types the value wraps around. Regarding underflow, consider this example: make_fixed<6, 1>(15) / make_fixed<6, 1>(2) This will give an accurate result of 7.5 However, this: make_fixed<7, 0>(15) / make_fixed<7, 0>(2) yields a result of 7. This results in loss of precision but is generally considered acceptable. When all bits are lost due to underflow, the value is said to be flushed. As with overflow, the
result of a flush is the same for a fixedpoint type as it is for its underlying representation type. In the case of built in integral types, the value becomes zero. Dealing with errors resulting from overflow and flush are the biggest headaches in the domain. Integers are easier to deal with as they have no fractional digits, and floatingpoint numbers are largely shielded by their variable exponent and implicit bit. The paper presents four strategies: 1. Leave it to the user: buyer beware 2. Allow the user to provide a custom type for ReprType 3. Promote the result to a larger type 4. Adjust the exponent of the result upward, preserving the most significant digits at the cost of the least significant digits For arithmetic operators, choice 1 is taken because it most closely follows the builtin behaviour of integer types. It causes least surprise and requires less computation. Choices 3 and 4 are reasonably robust to overflow events but they represent different tradeoffs, neither of which is the best fit in all situations. Notably, where any instance of c = a + b is replaced with a += b, results may change in surprising ways. These choices are therefore presented as named functions. Function template promote returns the same value represented by a larger fixed_point specialisation. For example, promote(make_fixed<5, 2>(15.5)) is equivalent to make_fixed<11, 4>(15.5) Complementary function template, demote, reverses the process, returning a value of a smaller type. Finally, we have some named arithmetic functions: Unary functions trunc_reciprocal, trunc_square, trunc_sqrt, promote_reciprocal, promote_square Binary functions trunc_add, trunc_subtract, trunc_multiply, trunc_divide, trunc_shift_left, trunc_shift_right, promote_add, promote_sub, promote_multiply, promote_divide The paper goes into further detail about using alternative types for ReprType. It’s worth a read. Work is not quite finished on this paper. There was further feedback from the presentation at GDC2016 but it seems fixedpoint arithmetic will be entering the language.
I will be interested to see if any games adopt integer representation for their world models. Failure to find paths due to approximation artifacts is bad news and can leave agents stuck, unable to find their way around a piece of geometry. Code to deal with approximation is messy and complicated and a burden on the processor. However, the fact of the matter is that OpenGL, Vullkan and Direct3D all use floating point representation because absolutely exact results are not required. A certain amount of approximation when rendering is normal and has no adverse effects. Range is very convenient given the amount of vector maths involved. This leads us to the thorny issue of when to convert from world to renderer types and what the cost of conversion is. Establishing that cost for existing engines against the cost of accommodating approximation is a fearsome problem. Of course, you might like to bypass fixed point and SI units entirely and measure your world in mm; a 32bit range covers Greenland to Yemen.
Parallelism Coroutines https://11950069482448417429.googlegroups.com/attach/44f49f1fe5dde/D0XXX_Coroutine sAndGames.html?part=0.1&view=1&vt=ANaJVrEpElzO2tqymSHftN9DvXmyWyxVsePjt6rpei PMV9lBx6iHmEJo0c7glseLf42KJMIzin5A6rehR9hNIkpI2Y40mOFGLyLiYU5lFWli7jT9tgmB S0
SIMD Vector is second only to static for number of meanings in our domain. Besides the container and the mathematical construct, we have the idea of operating on a vector of data with a single instruction. SIMD stands for Single Instruction Multiple Data, and the idea is that you can have data level parallelism, but not concurrency: there are simultaneous computations, but only a single process at a given moment. SIMD is commonly exploited in image processing. If, for example, you want to change an attribute of an image such as the brightness or contrast, you perform the same computation on each pixel. If your pixel is a 32bit value (RGBA is common), but your registers are 256 bits wide, as in the case of AVX SIMD instructions, you can process 8 pixels on a single instruction. 256 bits is something of a luxury at the moment: I remember when MMX was introduced in 1997, with eight 64bit registers that could be treated as packages of 8, 16 or 32 bit values. Unfortunately for me, MMX shared registers with the floating point stack. That was awkward for 3D maths indeed, fixedpoint maths suddenly became a real contender. AMD set the cat among the pigeons in 1998 with the 3DNow! SIMD instruction set which added binary32 support and suddenly you had a processor that was really good for 3D games. As time went
on though, 3D cards started to take a lot of the computing load for games. By 1999 Intel was ready to debut Streaming SIMD Extensions, SSE, which contained excellent support for binary32 data and a new set of registers independent of the floating point stack. Integer support was added with SSE2 in 2001 making MMX largely redundant. SSE3 arrived in 2004, adding instructions which further improved 3D maths, SSE4 in 2007 added a dot product instruction amongst others, AVX shipped in 2011 with new instructions and a new coding scheme, AVX2 shipped in 2013 with 256 bit registers, and AVX512 is just around the corner. This is just the Intel SIMD instruction set I have been exposed to. There are of course plenty of others to be found on most modern CPUs. This naturally leads us to the biggest problem for creating language support: there is no standard to abstract from. Unlike IEEE 754, manufacturers have been able to implement whatever behaviour they see fit for their processors. The Boost library already has a candidate for inclusion, Boost.SIMD, which looks at implementing all the transcendental functions using SIMD types. Mathias Gaunard of Bloomberg, one of the authors of this proposal, drafted a paper (D0203R0) in January of this year which suggested how explicit usage of short vectors within the type system might make use of SIMD architectures. I should point out that SG1, the concurrency subgroup, is considering this and several other papers related to SIMD proposals. The first problem to consider is that in accommodating all the SIMD libraries available it becomes clear that there is a very small common set of functionality available. Some architectures provide instructions for some types but not others, some don’t have double precision, and so on. I’m not going to regurgitate the paper, but the suggestion is that a SIMD vector should be defined as follows: template int best_size_v = /*implementationdefined*/; template, class X = /*implementationdefined ABI tag*/> struct simd_vector; where N is any power of 2 up to a certain value. If vectors of arbitrary powersof2 sizes can be defined, it would be useful to provide functions to combine/slice vectors into larger/smaller ones: template simd_vector combine(simd_vector a, simd_vector b); template array, 2> slice(simd_vector a; Converting between integer and floating point, and promoting/demoting types would be achieved through the provision of some generic casting operation: simd_vector a;
simd_vector b = simd_cast(a); Some SIMD architectures have entire units dedicated to permuting all values quickly, so we could access that functionality through a shuffle function: template simd_vector shuffle(simd_vector a); template simd_vector shuffle(simd_vector a, simd_vector b); Implementing SIMD has a few nooks and crannies to be considered. Aliasing is a serious matter: Intel intrinsics, to name just one API, allow aliasing between vectors and scalars, while N4184 suggests allowing scalars to alias vectors but not the other way around. Vectors aliasing scalars is useful for code like this: void foo(float* aligned_data) { simd_vector* my_vector_data = reinterpret_cast*>(aligned_data); //… do stuff } This allows you to pass a vector of raw memory with maximum efficiency rather than copying the memory on the stack. As you might guess, scalars aliasing vectors looks like this: simd_vector v; float* p = &v[0]; p[3] = 42.0f; This looks nice, but forces a vector into memory, which rather defeats the object. For optimum optimisation opportunities, everything should stay in registers and only load/store should go to memory. sizeof seems obvious: you might argue that sizeof(simd_vector) is the same as sizeof(T[N]), but it appears interesting to let implementations have the freedom to implement simd_vector with the same backend as simd_vector, so the suggestion is that sizeof(simd_vector) >= sizeof(T) * N. Finally, calling conventions have an impact, since simd_vector is defined as a class and some ABIs do not allow passing such a type by value due to its alignment requirements, requiring passing by const reference instead. This hits performance very badly, defeating the object of the exercise, requiring something like a forceinline attribute. This might point to requiring compiler support so that the simd_vector types can be passed to functions with maximum efficiency.
There really is much more to consider: the concurrency subgroup are busy pulling things together for SIMD. Recently, SG14 started getting input from the financial sector and SIMD is an issue for them too: it is an underused resource and I for one look forward to standard vectorisation instructions.
Heterogeneous Computing You possibly recall the “The free lunch is over” and “Welcome to the jungle” blog posts from Herb Sutter. Anyone want to guess the years of publication? December 2004 and December 2011. In “The Free Lunch is Over” Herb discussed the impending expiry of Moore’s law and the need for programmers to properly get the hang of multiple cores. C++11 delivered strongly in this department by recognising threads and adding a selection of threading and synchronisation primitives to the library. In “Welcome to the Jungle” he discussed what happens after we’ve properly got the hang of multiple cores: we start using multiple CPUs of different types, on the motherboard, on addin cards, in a cloud. What is happening here is massive parallelism. This requires a new programming model. Again, this crosses over into SG1 territory. We have a bit of a head start in games. Let me take you back to 1996. It was quite a year for video games. Duke Nukem 3D, Resident Evil, Quake, Super Mario 64, Wipeout, and Diablo were launched, as were the sequels Civilisation II, Panzer Dragoon II, Warcraft II, Tekken 2, The Elder Scrolls 2 and Command and Conquer: Red Alert. Also released this year was Tomb Raider. It lead on the SEGA Saturn, and was also released on MSDOS and PlayStation. It’s the MSDOS release that is of interest, because another thing that happened in 1996 was the release of the Voodoo Graphics chip from 3dfx and the sudden appearance of a slew of addin cards. There was the Orchid Righteous, the Diamond Monster, the Canopus Pure, and many others. The Righteous had mechanical relays that clicked when the chipset was in use. A friend of mine popped round with his brand new Diamond Monster and put it into my machine, installed a patch for Tomb Raider and suddenly I had an astonishing frame rate. This patch for Tomb Raider drove a huge uptake for 3dfx cards and suddenly games development went multiprocessor. The Voodoo Graphics chip supported two APIs, a native one called Glide and OpenGL. Another thing that happened in 1996 was the first release of Direct3D in June with DirectX 2.0 for Windows 95. Glide didn’t last the distance, nor did the Voodoo support Direct3D, and 3dfx went bankrupt, to be bought up by Nvidia in 2002. The majority of the engineers stayed with Nvidia to work on the GeForce FX series of chips, while some went to work for ATI.
So for about twenty years now we have been dividing our games software into two parts: graphics and everything else. How can we export what we have learned into the C++ standard? We are fortunate to have Michael Wong convening SG14: he is the OpenMP CEO and his primary interest is in the development of an Accelerator/GPU programming model. Over the past few months our telecons have consisted of a series of talks about Accelerator design. The first was from Nvidia staff, Jared Hoberock and Michael Garland, introducing the Agency library. This contains control structures for execution named bulk_invoke, bulk_async and bulk_then. There are execution policies which parameterise control structures, execution agents which parameterise user lambdas, and executors which create execution agents and provide abstract platform facilities for execution. Let’s look at some examples: <...> Agency is Open Source, check out the repository https://github.com/jaredhoberock/agency Ben Sander spoke to us about the Heterogeneous C++ compiler (P0069) which produces code for the CPU and GPU Hartmut Kaiser spoke about Parallelism APIs in HPX Andrew Richards spoke about SYCL This is the next big frontier for C++. There needs to be a way of standardising vectorisation and parallelism. Not just SG14 but everyone will benefit.
GDC 2016 trip report
Appendices Benchmarking One of the problems we have with making games is ensuring that we have chosen the best algorithm or container for the job. There was a time, not so very long ago, when you could simply query the Time Stamp Counter register using RDTSC. It was an excellent, high resolution, low overhead way of getting CPU timing information. Unfortunately, that all changed with the introduction of multicore and hyperthreaded CPUs. Even if you lock your code to a single core, the OS can really get in your way. It may decide to change the CPU speed as a result of powersaving measures. The system may be hibernated and resumed, resetting the TSC and requiring it to be recalibrated periodically. Another problem is outoforder execution, a feature of all Intel processors since the Pentium Pro. This can cause RDTSC to be executed later than expected.
So don’t be oldschool about benchmarking. It’s much tougher now. Microsoft strongly discourages the use of the TSC register and offers the API functions QueryPerformanceCounter and QueryPerformanceFrequency. Unfortunately it doesn’t end there. With caches to worry about, you need a very accurate representation of realworld data to work with. A hot cache can completely mislead you. This is as true of the disk cache as it is of the instruction and data cache. The trouble is, sometimes you need to take caching into account because of the way you are processing data. At CA, we end up doing most of our benchmarking at the end of development where we’re looking for bottlenecks. We use a profiler: Intel’s vTune is best of breed for us although we’re always open to new suggestions.