GPU: NVIDIA Tesla M4 + cuDNN 5 RC.

GPU + GIE: NVIDIA Tesla M4 + GIE.

Today at ICML 2016, NVIDIA announced its latest Deep Learning SDK updates, including DIGITS 4, cuDNN 5.1 (CUDA Deep Neural Network Library) and the new GPU Inference Engine.

NVIDIA GPU Inference Engine (GIE) is a high-performance deep learning inference solution for production environments. Power efficiency and speed of response are two key metrics for deployed deep learning applications, because they directly affect the user experience and the cost of the service provided. GIE automatically optimizes trained neural networks for run-time performance, delivering up to 16x higher performance per watt on a Tesla M4 GPU compared to the CPU-only systems commonly used for inference today.

Figure 1 shows GIE inference performance per watt of the relatively complex GoogLeNet running on a Tesla M4. GIE can deliver 20 Images/s/Watt on the simpler AlexNet benchmark.

In this post, we will discuss how you can use GIE to get the best efficiency and performance out of your trained deep neural network on a GPU-based deployment platform.

Solving a supervised machine learning problem with deep neural networks involves a two-step process.

- The first step is to train a deep neural network on massive amounts of labeled data using GPUs. During this step, the neural network learns millions of weights or parameters that enable it to map input data examples to correct responses. Training requires iterative forward and backward passes through the network as the objective function is minimized with respect to the network weights. Often several models are trained and accuracy is validated against data not seen during training in order to estimate real-world performance.
- The next step–inference–uses the trained model to make predictions from new data. During this step, the best trained model is used in an application running in a production environment such as a data center, an automobile, or an embedded platform. For some applications, such as autonomous driving, inference is done in real time and therefore high throughput is critical.

To learn more about the differences between training and inference, see Michael Andersch’s post on inference with GPUs.

The target deployment environment introduces various challenges that are typically not present in the training environment. For example, if the target is an embedded device using the trained neural network to perceive its surroundings, then the forward inference pass through the model has a direct impact on the overall response time and the power consumed by the device. The key metric to optimize is power efficiency: the inference performance per watt.

Performance per watt is also the critical metric in maximizing data center operational efficiency. In this scenario, the need to minimize latency and energy used on large volumes of geographically and temporally disparate requests limits the ability to form large batches.

GIE is a high-performance inference engine designed to deliver maximum inference throughput and efficiency for common deep learning applications such as image classification, segmentation, and object detection. GIE optimizes your trained neural networks for run-time performance and delivers GPU-accelerated inference for web/mobile, embedded and automotive applications.

There are two phases in the use of GIE: build and deployment (See Figure 2). In the build phase, GIE performs optimizations on the network configuration and generates an optimized plan for computing the forward pass through the deep neural network. The plan is an optimized object code that can be serialized and stored in memory or on disk.

The deployment phase generally takes the form of a long running service or user application that accepts batches of input data, performs inference by executing the plan on the input data and returns batches of output data (classification, object detection, etc). With GIE you don’t need to install and run a deep learning framework on the deployment hardware. Discussion of the batching and pipeline of the inference service is a topic for another post; instead we will focus on how to use GIE for inference.

The GIE runtime needs three files to deploy a classification neural network:

- a network architecture file (deploy.prototxt),
- trained weights (net.caffemodel), and
- a label file to provide a name for each output class.

In addition, you must define the batch size and the output layer. Code Listing 1 illustrates how to convert a Caffe model to a GIE object. The builder (lines 4-7) is responsible for reading the network information. Alternatively, you can use the builder to define the network information if you don’t provide a network architecture file (deploy.prototxt).

GIE supports the following layer types.

- Convolution: 2D
- Activation: ReLU, tanh and sigmoid
- Pooling: max and average
- ElementWise: sum, product or max of two tensors
- LRN: cross-channel only
- Fully-connected: with or without bias
- SoftMax: cross-channel only
- Deconvolution

IBuilder* builder = createInferBuilder(gLogger); // parse the caffe model to populate the network, then set the outputs INetworkDefinition* network = builder->createNetwork(); CaffeParser parser; auto blob_name_to_tensor = parser.parse(“deploy.prototxt”, trained_file.c_str(), *network, DataType::kFLOAT); // specify which tensors are outputs network->markOutput(*blob_name_to_tensor->find("prob")); // Build the engine builder->setMaxBatchSize(1); builder->setMaxWorkspaceSize(1 << 30); ICudaEngine* engine = builder->buildCudaEngine(*network);

You can also use the GIE C++ API to define the network without the Caffe parser, as Listing 2 shows. You can use the API to define any supported layer and its parameters. You can define any parameter that varies between networks, including convolution layer weight dimensions and outputs as well as the window size and stride for pooling layers.

ITensor* in = network->addInput(“input”, DataType::kFloat, Dims4{…}); IPoolingLayer* pool = network->addPooling(in, PoolingType::kMAX, …);

After defining or loading the network, you must specify the output tensors as line 13 of Listing 1 shows; in our example the output is “prob” (for probability). Next, define the batch size (line 16), which can vary depending on the deployment scenario. Listing 1 uses a batch size of 1 but you may choose larger batch sizes to fit your application needs and system configuration. Underneath, GIE performs layer optimizations to reduce inference time. While this is transparent to the API user, analyzing the network layers requires memory, so you must specify the maximum workspace size (line 17).

The last step is to call `buildCudaEngine`

to perform layer optimization and build the engine with the optimized network based on your provided inputs and parameters. Once the model is converted to a GIE object it is deployable and can either be used on the host device or saved and used elsewhere.

GIE performs several important transformations and optimizations to the neural network graph. First, layers with unused output are eliminated to avoid unnecessary computation. Next, where possible convolution, bias, and ReLU layers are fused to form a single layer. Figure 4 shows the result of this vertical layer fusion on the original network from Figure 3 (fused layers are labeled CBR in Figure 4). Layer fusion improves the efficiency of running GIE-optimized networks on the GPU.

Another transformation is horizontal layer fusion, or layer aggregation, along with the required division of aggregated layers to their respective outputs, as Figure 5 shows. Horizontal layer fusion improves performance by combining layers that take the same source tensor and apply the same operations with similar parameters, resulting in a single larger layer for higher computational efficiency. The example in Figure 5 shows the combination of 3 1×1 CBR layers from Figure 4 that take the same input into a single larger 1×1 CBR layer. Note that the output of this layer must be disaggregated to feed into the different subsequent layers from the original input graph.

GIE performs its transformations during the build phase transparently to the API user after the GIE parser reads in the trained network and configuration file, as Listing 1 shows.

The inference builder (`IBuilder`

) `buildCudaEngine`

method returns a pointer to a new inference engine runtime object (`ICudaEngine`

). This runtime object is ready for immediate use; alternatively, its state can be serialized and saved to disk or to an object store for distribution. The serialized object code is called the Plan.

As mentioned earlier, the full scope of batching and streaming data to and from the runtime inference engine is beyond the scope of this article. Listing 3 demonstrates the steps required to use the inference engine to process a batch of input data to generate a result.

// The execution context is responsible for launching the // compute kernels IExecutionContext *context = engine->createExecutionContext(); // In order to bind the buffers, we need to know the names of the // input and output tensors. int inputIndex = engine->getBindingIndex(INPUT_LAYER_NAME), int outputIndex = engine->getBindingIndex(OUTPUT_LAYER_NAME); // Allocate GPU memory for Input / Output data void* buffers = malloc(engine->getNbBindings() * sizeof(void*)); cudaMalloc(&buffers[inputIndex], batchSize * size_of_single_input); cudaMalloc(&buffers[outputIndex], batchSize * size_of_single_output); // Use CUDA streams to manage the concurrency of copying and executing cudaStream_t stream; cudaStreamCreate(&stream); // Copy Input Data to the GPU cudaMemcpyAsync(buffers[inputIndex], input, batchSize * size_of_single_input, cudaMemcpyHostToDevice, stream); // Launch an instance of the GIE compute kernel context.enqueue(batchSize, buffers, stream, nullptr); // Copy Output Data to the Host cudaMemcpyAsync(output, buffers[outputIndex], batchSize * size_of_single_output, cudaMemcpyDeviceToHost, stream)); // It is possible to have multiple instances of the code above // in flight on the GPU in different streams. // The host can then sync on a given stream and use the results cudaStreamSynchronize(stream);

At the end of the day, the success of GIE comes down to the performance it provides for inference. To measure the performance benefits we compared the per-layer timings of the GoogLeNet network using Caffe and GIE on NVIDIA Tesla M4 GPUs with a batch size of 1 averaged over 1000 iterations with GPU clocks fixed in the P0 state.

The bar graph (Figure 6) is sorted to show the 10 most computationally expensive GoogLeNet layers (as run by Caffe) ordered from left to right as light green bars. The dark green bars represent the same layers run using GIE (lower is better). Since GIE can combine layers both vertically and horizontally into a single optimized kernel, the Caffe timing shown for each bar is the sum of Caffe kernels corresponding to each fused GIE kernel, while the GIE timing for each bar is for a single fused and optimized kernel.

Bars with two layers correspond to vertically fused layers, namely CBR (convolution + bias + activation/relu). Bars with four layers (or whose names contain “||” separating two CBR layer names) correspond to two CBRs that are horizontally fused, meaning two CBRs that share the same input tensor and thus gain the advantage of cache reuse on a single-pass of the input tensor vs. two separate kernel launches with the same input tensor. Unsurprisingly, the GIE kernels with four fused layers show some of the largest relative speedups and contribute to ~30% of the overall speedup. The remainder of the speedup predominately comes from the two vertically fused CBR layers, which on average have a lower relative speedup, but comprise the bulk of the computation.

The NVIDIA GPU Inference Engine enables you to easily deploy neural networks to add deep learning based capabilities to your products with the highest performance and efficiency. GIE supports networks trained using popular neural network frameworks including Caffe, Theano, Torch and Tensorflow. During the build phase GIE identifies opportunities to optimize the network, and in the deployment phase GIE runs the optimized network in a way that minimizes latency and maximizes throughput.

If you are running web or mobile applications that are backed by data center servers, GIE’s low overhead means that you can deploy more varied and complex models to add intelligence to your product that will delight your users. If you are using deep learning to create the next generation of smart devices, GIE helps you deploy networks with high performance, high accuracy, and high energy efficiency.

Moreover, GIE enables you to leverage the power of GPUs to perform neural network inference using mixed-precision FP16 data. Performing neural network inference using FP16 can reduce memory usage by half and provide higher performance on Tesla P100 and Jetson TX1 GPUs.

GIE is currently being evaluated under an Early Access (EA) Program. To be notified when GIE is ready for public release or if you are interested in participating in the EA program, please visit the GIE product page to contact us today. To learn more about neural network inference on GPUs, see Michael Andersch’s post on inference with GPUs.

]]>In this blog post I’ll provide an overview of the Pascal architecture and its benefits to you as a developer.

At GTC, Lars Nyland and I gave a talk about details of the Tesla P100 and the Pascal GP100 architecture. The slides and recording from this talk are now available (GTC on-demand site registration required). To learn more, read the Tesla P100 white paper.

The GP100 GPU used in Tesla P100 incorporates multiple revolutionary new features and unprecedented performance. Key features of Tesla P100 include:

- Extreme performance—powering HPC, deep learning, and many more GPU Computing areas;
- NVLink™—NVIDIA’s new high speed, high bandwidth interconnect for maximum application scalability;
- HBM2—Fastest, high capacity, extremely efficient stacked GPU memory architecture;
- Unified Memory and Compute Preemption—significantly improved programming model;
- 16nm FinFET—enables more features, higher performance, and improved power efficiency.

Tesla P100 accelerators will be available in two forms: A traditional GPU accelerator board for PCIe-based servers, and an SXM2 module for NVLink-optimized servers. P100 for PCIe-based servers allows HPC data centers to deploy the most advanced GPUs within PCIe-based nodes to support a mix of CPU and GPU workloads. P100 for NVLink-optimized servers provides the best performance and strong-scaling for hyperscale and HPC data centers running applications that scale to multiple GPUs, such as deep learning. The table below provides the complete specifications of both Tesla P100 accelerators.

With every new GPU architecture, NVIDIA introduces major improvements to performance and power efficiency. The heart of the computation in Tesla GPUs is the SM, or streaming multiprocessor. The streaming multiprocessor creates, manages, schedules and executes instructions from many threads in parallel.

Like previous Tesla GPUs, GP100 is composed of an array of Graphics Processing Clusters (GPCs), Streaming Multiprocessors (SMs), and memory controllers. GP100 achieves its colossal throughput by providing six GPCs, up to 60 SMs, and eight 512-bit memory controllers (4096 bits total). The Pascal architecture’s computational prowess is more than just brute force: it increases performance not only by adding more SMs than previous GPUs, but by making each SM more efficient. Each SM has 64 CUDA cores and four texture units, for a total of 3840 CUDA cores and 240 texture units.

Delivering higher performance and improving energy efficiency are two key goals for new GPU architectures. A number of changes to the SM in the Maxwell architecture improved its efficiency compared to Kepler. Pascal builds on this and incorporates additional improvements that increase performance per watt even further over Maxwell. While TSMC’s 16nm Fin-FET manufacturing process plays an important role, many GPU architectural modifications were also implemented to further reduce power consumption while maintaining high performance.

The following table provides a high-level comparison of Tesla P100 specifications compared to previous-generation Tesla GPU accelerators.

Tesla Products |
Tesla K40 |
Tesla M40 |
Tesla P100 (NVLink) |
Tesla P100 (PCIe) |

GPU / Form Factor |
Kepler GK110 / PCIe |
Maxwell GM200 / PCIe |
Pascal GP100 / SXM2 |
Pascal GP100 / PCIe |

SMs |
15 | 24 | 56 | 56 |

TPCs |
15 | 24 | 28 | 28 |

FP32 CUDA Cores / SM |
192 | 128 | 64 | 64 |

FP32 CUDA Cores / GPU |
2880 | 3072 | 3584 | 3584 |

FP64 CUDA Cores / SM |
64 | 4 | 32 | 32 |

FP64 CUDA Cores / GPU |
960 | 96 | 1792 | 1792 |

Base Clock |
745 MHz | 948 MHz | 1328 MHz | 1126 MHz |

GPU Boost Clock |
810/875 MHz | 1114 MHz | 1480 MHz | 1303 MHz |

FP64 GFLOPs[1] |
1680 | 213 | 5304 | 4670 |

Texture Units |
240 | 192 | 224 | 224 |

Memory Interface |
384-bit GDDR5 | 384-bit GDDR5 | 4096-bit HBM2 | 3072-bit HBM2 (12GB) 4096-bit HBM2 (16GB) |

Memory Bandwidth |
288 GB/s | 288 GB/s | 720 GB/s | 540 GB/s (12GB) 720 GB/s (16GB) |

Memory Size |
Up to 12 GB | Up to 24 GB | 16 GB | 12 GB or 16 GB |

L2 Cache Size |
1536 KB | 3072 KB | 4096 KB | 4096 KB |

Register File Size / SM |
256 KB | 256 KB | 256 KB | 256 KB |

Register File Size / GPU |
3840 KB | 6144 KB | 14336 KB | 14336 KB |

TDP |
235 Watts | 250 Watts | 300 Watts | 250 Watts |

Transistors |
7.1 billion | 8 billion | 15.3 billion | 15.3 billion |

GPU Die Size |
551 mm² | 601 mm² | 610 mm² | 610 mm² |

Manufacturing Process |
28-nm | 28-nm | 16-nm | 16-nm |

**[1]*** The GFLOPS in this chart are based on GPU Boost Clocks.*

GP100’s SM incorporates 64 single-precision (FP32) CUDA Cores. In contrast, the Maxwell and Kepler SMs had 128 and 192 FP32 CUDA Cores, respectively. The GP100 SM is partitioned into two processing blocks, each having 32 single-precision CUDA Cores, an instruction buffer, a warp scheduler, and two dispatch units. While a GP100 SM has half the total number of CUDA Cores of a Maxwell SM, it maintains the same register file size and supports similar occupancy of warps and thread blocks.

GP100’s SM has the same number of registers as Maxwell GM200 and Kepler GK110 SMs, but the entire GP100 GPU has far more SMs, and thus many more registers overall. This means threads across the GPU have access to more registers, and GP100 supports more threads, warps, and thread blocks in flight compared to prior GPU generations.

Overall shared memory across the GP100 GPU is also increased due to the increased SM count, and aggregate shared memory bandwidth is effectively more than doubled. A higher ratio of shared memory, registers, and warps per SM in GP100 allows the SM to more efficiently execute code. There are more warps for the instruction scheduler to choose from, more loads to initiate, and more per-thread bandwidth to shared memory (per thread).

Compared to Kepler, Pascal’s SM features a simpler datapath organization that requires less die area and less power to manage data transfers within the SM. Pascal also provides superior scheduling and overlapped load/store instructions to increase floating point utilization. The new SM scheduler architecture in GP100 improves upon the advances of the Maxwell scheduler and is even more intelligent, providing increased performance and reduced power consumption. Each warp scheduler (one per processing block) is capable of dispatching two warp instructions per clock.

Because of the importance of high-precision computation for technical computing and HPC codes, a key design goal for Tesla P100 is high double-precision performance. Each GP100 SM has 32 FP64 units, providing a 2:1 ratio of single- to double-precision throughput. Compared to the 3:1 ratio in Kepler GK110 GPUs, this allows Tesla P100 to process FP64 workloads more efficiently.

Like previous GPU architectures, GP100 supports full IEEE 754‐2008 compliant single- and double‐precision arithmetic, including support for the fused multiply‐add (FMA) operation and full speed support for denormalized values.

The rapidly growing field of deep learning is one of the fastest growing fields of computing. Deep learning has proven to provide a high-level of accuracy and adaptability in applications spanning automatic image captioning, autonomous driving object recognition, natural language understanding and translation, and even computer-generated art. For an in-depth introduction to deep learning, check out the Deep Learning in a Nutshell series here on Parallel Forall.

Unlike other technical computing applications that require high-precision floating-point computation, deep neural network architectures have a natural resilience to errors due to the backpropagation algorithm used in their training. Storing FP16 data compared to higher precision FP32 or FP64 reduces memory usage of the neural network, allowing training and deployment of larger networks. Using FP16 computation improves performance up to 2x compared to FP32 arithmetic, and similarly FP16 data transfers take less time than FP32 or FP64 transfers.

The GP100 SM ISA provides new arithmetic operations that can perform two FP16 operations at once on a single-precision CUDA Core, and 32-bit GP100 registers can store two FP16 values.

Atomic memory operations are important in parallel programming, allowing concurrent threads to correctly perform read-modify-write operations on shared data. Kepler significantly increased the throughput of atomic operations to global memory compared to the earlier Fermi architecture; however, both Fermi and Kepler implemented shared memory atomics using an expensive lock/update/unlock pattern.

Maxwell improved this by implementing native hardware support for shared memory atomic operations for 32-bit integers, and native shared memory 32-bit and 64-bit compare-and-swap (CAS), which can be used to implement other atomic functions with reduced overhead (compared to the Fermi and Kepler methods which were implemented in software).

GP100 further improves atomics by providing an FP64 atomic `add`

instruction for values in global memory. The `atomicAdd()“ function in CUDA now applies to 32 and 64-bit integer and floating-point data. Previously, FP64 atomic addition had to be implemented using a compare-and-swap loop, which is generally slower than a native instruction.

GP100 supports the new Compute Capability 6.0. The following table compares parameters of different Compute Capabilities for NVIDIA GPU architectures.

GPU | Kepler GK110 | Maxwell GM200 | Pascal GP100 |

Compute Capability | 3.5 | 5.3 | 6.0 |

Threads / Warp | 32 | 32 | 32 |

Max Warps / Multiprocessor | 64 | 64 | 64 |

Max Threads / Multiprocessor | 2048 | 2048 | 2048 |

Max Thread Blocks / Multiprocessor | 16 | 32 | 32 |

Max 32-bit Registers / SM | 65536 | 65536 | 65536 |

Max Registers / Block | 65536 | 32768 | 65536 |

Max Registers / Thread | 255 | 255 | 255 |

Max Thread Block Size | 1024 | 1024 | 1024 |

CUDA Cores / SM | 192 | 128 | 64 |

Shared Memory Size / SM Configurations (bytes) | 16K/32K/48K | 96K | 64K |

Many applications today are bottlenecked by memory bandwidth, especially in high-performance computing. Today, developers of high-performance software across all types of processors devote a lot of effort to optimizing code for efficient memory accesses, and to keep data in the parts of the memory hierarchy closest to the computational units. Some applications—for example in deep learning where many-layered neural networks are trained using massive data sets—are limited more by memory capacity. So memory poses two challenges to computing performance: bandwidth and capacity.

Tesla P100 tackles both memory challenges using stacked memory, a technology which enables multiple layers of DRAM components to be integrated vertically on the package along with the GPU. Tesla P100 is the first GPU accelerator to use High Bandwidth Memory 2 (HBM2). HBM2 memory provides much greater bandwidth, more than twice the capacity, and higher energy efficiency, compared to current off-package GDDR5.

Rather than requiring numerous discrete memory chips surrounding the GPU as in traditional GDDR5 GPU board designs, HBM2 includes one or more vertical stacks of multiple memory dies. The memory dies are linked using tiny wires that are called through-silicon vias and microbumps. One 8 Gb HBM2 die contains over 5,000 through-silicon via holes. A passive silicon interposer is then used to connect the memory stacks and the GPU die. The combination of HBM2 stacks, GPU die, and silicon interposer are packaged in a single 55mm x 55mm BGA package.

Tesla P100 accelerators have four 4-die HBM2 stacks, for a total of 16 GB of memory, and 720 GB/s peak bandwidth, which is 3 times higher than the Tesla M40 memory bandwidth.

Another HBM2 benefit is native support for error correcting code (ECC) funtionality, which provides higher reliability for technical computing applications that are sensitive to data corruption, such as in large-scale clusters and supercomputers, where GPUs process large datasets with long application run times.

ECC technology detects and corrects single-bit soft errors before they affect the system. In comparison, GDDR5 does not provide internal ECC protection of the contents of memory and is limited to error detection of the GDDR5 bus only: Errors in the memory controller or the DRAM itself are not detected.

GK110 Kepler GPUs offered ECC protection for GDDR5 by allocating some of the available memory for explicit ECC storage. 6.25% of the overall GDDR5 is reserved for ECC bits. In the case of a 12 GB Tesla K40 (for example), 750 MB of its total memory is reserved for ECC operation, resulting in 11.25 GB (out of 12 GB) of available memory with ECC turned on for Tesla K40. Also, accessing ECC bits causes a small decrease in memory bandwidth compared to the non-ECC case. Since HBM2 supports ECC natively, Tesla P100 does not suffer from the capacity overhead, and ECC can be active at all times without a bandwidth penalty. Like the GK110 GPU, the GP100 GPU’s register files, shared memories, L1 cache, L2 cache, and the Tesla P100 accelerator’s HBM2 DRAM are protected by a Single‐Error Correct Double‐Error Detect (SECDED) ECC code.

NVLink is NVIDIA’s new high-speed interconnect technology for GPU-accelerated computing. Supported on SXM-2 based Tesla P100 accelerator boards, NVLink significantly increases performance for both GPU-to-GPU communications, and for GPU access to system memory.

Today, multiple GPUs are common in workstations as well as the nodes of HPC computing clusters and deep learning training systems. A powerful interconnect is extremely valuable in multiprocessing systems. Our vision for NVLink was to create an interconnect for GPUs that would offer much higher bandwidth than PCI Express Gen 3 (PCIe), and be compatible with the GPU ISA to support shared memory multiprocessing workloads.

Support for the GPU ISA means that programs running on NVLink-connected GPUs can execute directly on data in the memory of another GPU as well as on local memory. GPUs can also perform atomic memory operations on remote GPU memory addresses, enabling much tighter data sharing and improved application scaling.

NVLink uses NVIDIA’s new High-Speed Signaling interconnect (NVHS). NVHS transmits data over a differential pair running at up to 20 Gb/sec. Eight of these differential connections form a “Sub-Link” that sends data in one direction, and two sub-links—one for each direction—form a “Link” that connects two processors (GPU-to-GPU or GPU-to-CPU). A single Link supports up to 40 GB/sec of bidirectional bandwidth between the endpoints. Multiple Links can be combined to form “Gangs” for even higher-bandwidth connectivity between processors. The NVLink implementation in Tesla P100 supports up to four Links, allowing for a gang with an aggregate maximum theoretical bandwidth of 160 GB/sec bidirectional bandwidth.

The figure below shows an 8-GPU Hybrid Cube Mesh that includes two fully NVLink-connected quads of GPUs, with NVLink connections between the quads, and GPUs within each quad connected to their respective CPUs directly through PCIe.

While NVLink primarily focuses on connecting multiple NVIDIA Pascal GP100 GPUs together it can also connect Pascal GP100 GPUs with IBM Power CPUs with NVLink support. The following figure highlights an example of a four-GPU system with dual NVLink-capable CPUs connected with NVLink. In this configuration, each GPU has 120 combined GB/s bidirectional bandwidth to the other 3 GPUs in the system, and 40 GB/s bidirectional bandwidth to a CPU.

Unified Memory is an important feature of the CUDA programming model that greatly simplifies programming and porting of applications to GPUs by providing a single, unified virtual address space for accessing all CPU and GPU memory in the system. Pascal GP100 features provide a significant advancement for GPU computing by expanding the capabilities and improving the performance of Unified Memory.

CUDA 6 introduced Unified Memory, which creates a pool of managed memory that is shared between the CPU and GPU, bridging the CPU-GPU divide. Managed memory is accessible to both the CPU and GPU using a single pointer. The CUDA system software automatically migrates data allocated in Unified Memory between GPU and CPU, so that it looks like CPU memory to code running on the CPU, and like GPU memory to code running on the GPU. For details of how Unified Memory in CUDA 6 and later simplifies porting code to the GPU, see the post “Unified Memory in CUDA 6”.

CUDA 6 Unified Memory was limited by the features of the Kepler and Maxwell GPU architectures: all managed memory touched by the CPU had to be synchronized with the GPU before any kernel launch; the CPU and GPU could not simultaneously access a managed memory allocation; and the Unified Memory address space was limited to the size of the GPU physical memory.

Expanding on the benefits of CUDA 6 Unified Memory, Pascal GP100 adds features to further simplify programming and sharing of memory between CPU and GPU, and allowing easier porting of CPU parallel compute applications to use GPUs for tremendous speedups. Two main hardware features enable these improvements: support for large address spaces and page faulting capability.

GP100 extends GPU addressing capabilities to enable 49-bit virtual addressing. This is large enough to cover the 48-bit virtual address spaces of modern CPUs, as well as the GPU’s own memory. Therefore, GP100 Unified Memory allows programs to access the full address spaces of all CPUs and GPUs in the system as a single virtual address space, unlimited by the physical memory size of any one processor.

Memory page faulting support in GP100 is a crucial new feature that provides more seamless Unified Memory functionality. Combined with the system-wide virtual address space, page faulting provides several benefits. First, page faulting means that the CUDA system software doesn’t need to synchronize all managed memory allocations to the GPU before each kernel launch. If a kernel running on the GPU accesses a page that is not resident in its memory, it faults, allowing the page to be automatically migrated to the GPU memory on-demand. Alternatively, the page may be mapped into the GPU address space for access over the PCIe or NVLink interconnects (mapping on access can sometimes be faster than migration). Note that Unified Memory is system-wide: GPUs (and CPUs) can fault on and migrate memory pages either from CPU memory or from the memory of other GPUs in the system.

With the new page fault mechanism, global data coherency is guaranteed with Unified Memory. This means that with GP100, the CPUs and GPUs can access Unified Memory allocations simultaneously. This was illegal on Kepler and Maxwell GPUs, because coherence could not be guaranteed if the CPU accessed a Unified Memory allocation while a GPU kernel was active. Note, as with any parallel application, developers need to ensure correct synchronization to avoid data hazards between processors.

Finally, on supporting platforms, memory allocated with the default OS allocator (e.g. ‘malloc’ or ‘new’) can be accessed from both GPU code and CPU code using the same pointer (see code example below). On these systems, Unified Memory is the default: there is no need to use a special allocator or for the creation of a special managed memory pool. Moreover, GP100’s large virtual address space and page faulting capability enable applications to access the entire system virtual memory. This means that applications can oversubscribe the memory system: in other words they can allocate, access, and share arrays larger than the total physical capacity of the system, enabling out-of-core processing of very large datasets.

Certain operating system modifications are required to enable Unified Memory with the system allocator. NVIDIA is collaborating with Red Hat and working within the Linux community to enable this powerful functionality.

Data scientists and artificial intelligence (AI) researchers require accuracy, simplicity, and speed from their Deep Learning systems. Faster training and iteration ultimately means faster innovation and faster time to market. The NVIDIA DGX-1 is the world’s first purpose-built server for Deep Learning, with fully integrated hardware and software that can be deployed quickly and easily. Its revolutionary performance of up to 170 FP16 TFLOP/s significantly accelerates training time, making the NVIDIA DGX-1 the first “AI supercomputer in a box”.

The NVIDIA DGX-1 server is the first server using Tesla P100 accelerators interconnected with NVLink. Available in an eight (8) Tesla P100 accelerator configuration, the DGX-1 system is built with high performance/high reliability components in a 3U rack-mountable chassis for standalone use or cluster integration.

The 8-GPU configuration features two NVLink fully-connected P100 GPU quads that are tied together by four additional NVLinks in a Hybrid Cube Mesh topology (See the 8-GPU NVLink diagram above). Every GPU in a quad is also directly connected via PCIe to a PCIe switch that connects to a CPU. The bottom line is that an NVIDIA DGX-1 server with eight Tesla P100 accelerators can deliver over 12x the Deep Learning performance compared to previous GPU-accelerated solutions.

Combining powerful hardware with software tailored to Deep Learning, the NVIDIA DGX-1 enables developers and researchers with a turnkey solution for high-performance GPU-accelerated Deep Learning application development, testing, and network training.

As you can see, the new NVIDIA Tesla P100 accelerator is a performance powerhouse with revolutionary new features for technical computing and deep learning. Faster in every way than its predecessors, Tesla P100 provides massive leaps in computational throughput, memory bandwidth and capacity, interconnect performance, and programmability.

In this blog post I’ve really only scratched the surface—there are a many more new features in the Pascal GP100 architecture and in Tesla P100, including new instructions, powerful features such as compute preemption, and more that I couldn’t fit into this post. To learn more, read the Tesla P100 white paper and check out our talk “Inside Pascal” from GTC 2016.

*Note, this post was first published Apr 5, 2016, and updated June 19, 2016 with information on Tesla P100 for PCIe.*

Many applications need to partition graphs into subgraphs, or to find clusters within them. For example, graph partitioning can be used in the numerical solution of partial differential equations (PDEs) to perform more efficient sparse matrix-vector multiplications, and graph clustering can be used to identify communities in social networks and for cybersecurity (see Figure 1).

The quality of graph partitioning or clustering can have a significant impact on the overall performance of an application. Therefore it is very important not only to find the splitting into subgraphs fast by taking advantage of GPUs (our GPU spectral graph partitioning scheme performs up to 7x faster than a CPU implementation), but also to find the best possible splitting, which requires development of new algorithms.

Also, graph partitioning and clustering aims to find a splitting of a graph into subgraphs based on a specific metric. In particular, spectral graph partitioning and clustering relies on the *spectrum—*the eigenvalues and associated eigenvectors—of the Laplacian matrix corresponding to a given graph. Next, I will formally define this problem, show how it is related to the spectrum of the Laplacian matrix, and investigate its properties and tradeoffs.

Let a graph be defined by its vertex set and edge set . The vertex set represents nodes in a graph, with each node identified by a unique integer number . The edge set represents edges in a graph, with each edge from node to identified by a pair .

Applications often need to find a splitting of the graph into subgraphs of similar size connected by as few edges as possible. This statement is often formulated as the problem of finding a set of vertices that induces a minimum balanced cut of a graph in the sense of a cost function

where denotes complement of set with respect to , denotes the cardinality (number of elements) of a set, and denotes the boundary of a set.

For example, a minimum balanced cut of a graph , with and induced by a set of vertices with and boundary is shown in Figure 2.

It is important to point out that both partitioning and clustering aim to split the original graph into multiple sub-graphs. However, in partitioning the number of partitions and often their size is fixed, while in clustering the fact that there are no partitions can be a result in itself [1]. Also, the optimality of the splitting can be measured by different cost functions, including modularity, betweenness centrality, or flow.

I will focus on the popular ratio and normalized cut cost functions, which are variations of the minimum balanced cut of a graph . The ratio and normalized cut cost functions are defined as

and

respectively, where set for , the intersection of sets for , and denotes the volume of a set, where is the degree (number of edges) of node .

These cost functions are simpler than they look. Notice that the numerator refers to the number of edges cut between partitions and the denominator is related to the number of elements assigned to a particular partition. For example, in distributed sparse matrix-vector multiplication, the numerator relates to the number of elements that must be sent between partitions and the denominator to the work done per partition, measured in terms of the number of rows in the former or number of non-zero elements in the latter equation.

Also, it turns out that it is possible to express these cost functions as

and

where tall matrix , is the identity matrix and is the trace of a matrix (the sum of its diagonal elements). Here vectors take only discrete values, with non-zeroes corresponding to indices in the set , while is the Laplacian matrix, which will be defined next.

The Laplacian matrix is defined as , where is the adjacency matrix of the graph and the diagonal matrix , where vector .

For example, the Laplacian matrix for the graph shown ing Figure 2 can be written

The Laplacian matrix has very interesting properties. To illustrate it, let the set as shown in Figure 2 and vector be such that it has 1 at position and zero otherwise. Then, I can express the cardinality , volume as well as the cardinality of the boundary of the set using the vector and the Laplacian matrix in the following way.

This illustrates why in the previous section I could express all the terms in the ratio and normalized cost functions in terms of vector and the Laplacian matrix . A more detailed explanation is given in our technical report [2].

Notice that obtaining the minimum of the cost function by finding the best non-zero discrete values for the vector is no easier than finding the best indices for the set . The two formulations of the cost functions are equivalent and both are NP-hard problems.

The key idea of spectral partitioning and clustering is not to look for the discrete solution directly, but instead to do it in two steps.

First, relax the discrete constraints and let vector take real instead of discrete values. In this case, following the linear algebra Courant-Fischer theorem (sometimes referred to as the Min-Max theorem), the minimum of the cost function is obtained by the eigenvectors associated with the smallest eigenvalues of the Laplacian matrix.

Second, map the obtained real values back to the discrete ones to find the solution of interest. This step can be done using simple heuristics, such as sorting and looking for a gap in the real values, or using more advanced multi-dimensional algorithms, such as the *k*-means algorithm. In the former case all real values in between gaps, while in the latter case all real values clustered around a particular centroid, will be assigned the same discrete value and therefore will belong to the same particular partition or cluster.

There is no guarantee that the two-step approach will find the best solution, but in practice it often finds a good enough approximation and works reasonably well.

Figure 3 provides a visual outline of the process and Algorithm 1 presents the algorithm in pseudocode.

Let G=(V,E) be an input graph Let A be the adjacency matrix of G Let diagonal matrix D = diag(Ae) Set the Laplacian matrix L=D-A Solve the eigenvalue problem L u = λu Use heuristics to transform real into discrete values

The solution of the eigenvalue problem is often the most time consuming part of spectral graph partitioning /clustering. There are many eigenvalue solvers that can be used to solve it, including Lanczos, Tracemin, Jacobi-Davidson and LOBPCG. In particular, Figures 3 and 4 show experimental results comparing the performance and quality, respectively, of the Lanczos and LOBPCG methods when looking for the 30 smallest eigenvectors of a few matrices from the DIMACS graph collection. Although Lanczos is often the fastest eigenvalue solver, when incomplete-LU with 0 fill-in (ILU0) is available, the preconditioned LOBPCG eigenvalue solver can be competitive and often computes a superior-quality solution.

Now I’ll compare the spectral scheme on the GPU with the spectral scheme implemented on the CPU in the CHACO software package. The experiments are performed on a workstation with a 3.2 GHz Intel Core i7-3930K CPU and an NVIDIA Tesla K40c GPU.

The schemes are very similar, but not identical because CHACO has a slightly different implementation of the algorithms, and also attempts to provide a load-balanced cut within a fixed threshold , so that for instance . Therefore, CHACO’s cost function is similar to the ratio cut, but the clustering at the end is biased towards providing a load-balanced partitioning, while still minimizing the edge cuts. Also, CHACO implements spectral bisection, so when comparing to it I split the graph into only two partitions.

Figures 5 and 6 show the performance and quality of both spectral schemes, respectively. Notice that the GPU spectral scheme using Lanczos often obtains the solution faster, but with variable quality compared to the CPU spectral scheme in CHACO, which also uses a variation of the Lanczos method. On the other hand when using preconditioned LOBPCG the GPU implementation is usually faster, and most of the time obtains a higher quality solution as measured by the cost functions. The detailed results of these experiments can be found in our technical report [2].

Finally, as mentioned earlier there exist many different partitioning and clustering strategies. In particular, some of the popular approaches for providing a balanced cut of a graph use multi-level schemes, implemented in software packages such as METIS. Both spectral and multi-level schemes are global methods that work on the entire graph, in contrast to local heuristics, such as the Kernighan-Lin algorithm.

It is interesting to compare the quality of spectral and multi-level schemes in terms of the the edge-cut and cost function obtained by them. The numerical experiments shown in Figures 7 and 8 plot the ratio of these quantities (the cost obtained by METIS divided by the cost obtained by the GPU spectral scheme), for 30 partitions. The result trends indicate that the behavior of the spectral and multi-level schemes is starkly different for two classes of problems: (i) meshes arising from discretization of PDEs and (ii) social network graphs that often have power-law-like distributions of edges per node. My conjecture is that the difference in quality between these schemes results from the fact that multi-level schemes often rely on local information to construct a graph hierarchy that is used to partition the graph.

Notice that for PDEs the quality of the partitioning obtained by both schemes is essentially the same, while for networks with high degree nodes, such as social networks, spectral schemes can obtain significantly higher quality partitions. Even though in our experiments the time taken by the spectral schemes is often larger than that taken by the multi-level schemes, I think that spectral schemes can be a good choice in applications where quality is important. For example, in sparse linear algebra applications, even modest improvements in the quality of partitioning can lead to a significant impact on the overall application performance, so the extra partitioning cost of a spectral scheme may be worthwhile.

I hope that after reading this blog post you have learned some of the intuition behind the spectral graph partitioning/clustering scheme and how it compares to other similar algorithms. A more formal treatment of the subject, with precise derivation of the theoretical results and detailed numerical experiments, can be found in our technical report [2].

The numerical experiments show that spectral partitioning on GPUs can outperform spectral partitioning on the CPU by up to 7x. Also, it is clear that multi-level schemes are a good choice for partitioning meshes arising from PDEs, while spectral schemes can achieve high quality partitioning and clustering on network graphs with high-degree nodes, such as graphs of social networks.

If you need to accelerate graph algorithms in your applications, check out the new GPU-accelerated nvGRAPH library. You can also read more about nvGRAPH in the post “CUDA 8 Features Revealed”. We are considering adding spectral partitioning to nvGRAPH in the future. Please let us know in the comments if you would find this useful.

Finally, note that eigenvectors of the Laplacian matrix have many other applications. For example, they can be used for drawing graphs. In fact, the graph drawings in this blog were done with them. The interpretation of eigenvectors for this application has been studied in [3].

[1] M.E.J. Newman, “Modularity and Community Structure in Networks”, Proc. National Academy of Science, Vol. 103, pp. 8577–8582, 2006.

[2] M. Naumov and T. Moon, “Parallel Spectral Graph Partitioning”, NVIDIA Research Technical Report, NVR-2016-001, March, 2016.

[3] Y. Koren, “Drawing Graphs by Eigenvectors: Theory and Practice”, Computers & Mathematics with Applications, Vol. 49, pp. 1867–1888, 2005.

]]>**Brad: Can you talk a bit about your current research?**

Wei: Matrix factorization (MF) is at the core of many popular algorithms, such as collaborative-filtering-based recommendation, word embedding, and topic modeling. Matrix factorization factors a sparse ratings matrix (*m*-by-*n*, with non-zero ratings) into a *m*-by-*f* matrix (*X*) and a *f*-by-*n* matrix (Θ^{T}), as Figure 1 shows.

Suppose we obtained *m* users’ ratings on items (say, movies). If user *u* rated item *v*, we use as the non-zero element of *R* at position (*u*, *v*). We want to minimize the following cost function *J*. To avoid overfitting, we use weighted-λ-regularization proposed in [1], where and denote the number of total ratings on user *u* and item *v*, respectively.

Many optimization methods, including Alternating Least Squares (ALS) and Stochastic Gradient Descent (SGD) [1] have been applied to minimize . We adopt the ALS approach, which first optimizes *X* while fixing , and then optimizes while fixing *X*. That is, in one iteration we need to solve these two equations alternatively:

**B: What are some of the biggest challenges in this project?**

W: Recently, the GPU has emerged as an accelerator for parallel algorithms. It has massive computing power (typically 10x higher floating-point operations per second (FLOP/s) versus a CPU) and memory bandwidth (typically 5x versus a CPU), but with limited amount of control logic and memory capacity. Particularly, the GPU’s success in deep learning inspired us to try GPUs for MF. In deep learning, the computation is mainly dense matrix multiplication which is **compute bound**. As a result, a GPU can train deep neural networks 10x as fast as a CPU by saturating its FLOP/s. However, unlike deep learning, a MF problem involves sparse matrix manipulation which is usually **memory bound**. Given this, we want to explore a MF algorithm and a system that can still leverage the GPU’s compute and memory capability.

Given the widespread use of MF, a scalable and speedy implementation is very important. There are two challenges:

- On a single GPU, MF deals with sparse matrices, which makes it difficult to utilize the GPU’s compute power.
- We need to scale to multiple GPUs to deal with large, industry-scale problems (hundreds of millions of non-zero ratings).

**B: What NVIDIA technologies are you using to overcome the challenges? **

W: We developed **cuMF**, a CUDA-based matrix factorization library that optimizes the alternating least square (ALS) method to solve very large-scale MF. cuMF uses CUDA 7.0+, cuBLAS and cuSPARSE, and has been tested on both Kepler (Tesla K40/K80) and Maxwell (Titan X) GPUs.

cuMF achieves excellent scalability and performance by innovatively applying the following techniques on GPUs:

(1) On a single GPU, we optimize memory access in ALS by various techniques including reducing discontiguous memory access, retaining hotspot variables in faster memory, and aggressively using registers. These techniques allow cuMF to get closer to the roofline performance of a single GPU. See Figure 2.

(2) On multiple GPUs, we add data parallelism to ALS’s inherent model parallelism. Data parallelism needs a faster reduction operation among GPUs, leading to (3).

In distributed machine learning, model parallelism and data parallelism are two common schemes. Model parallelism partitions parameters among multiple learners with each one learns a subset of parameters. Data parallelism partitions the training data among multiple learners so that each one learns all parameters from its partial observation. These two schemes can be combined when both the number of model parameters and the training data are large.

**Model parallelism**. As seen from equation 1, the solution of each is independent so model parallelism is straightforward.

**Data parallelism**. To tackle large-scale problems, on top of the existing model parallelism, we design a data-parallel approach. When is big and cannot stay in one GPU, we re-write the summation matrix in eq. 1 to its data-parallel form as:

That is, instead of transferring all s to one GPU to calculate the sum, it calculates a local sum on each GPU with only the local s, and reduces (aggregates) many local s later (See Figure 3).

(3) We also developed an innovative topology-aware, parallel reduction method to fully leverage the bandwidth between GPUs. This way, cuMF ensures that multiple GPUs are efficiently utilized simultaneously.

The experimental results demonstrate that with up to four Titan X GPUs on one machine, cuMF is (1) competitive compared with multi-core methods, on medium-sized problems; (2) much faster than vanilla GPU implementations without memory optimization; (3) 6 to 10 times as fast, and 33 to 100 times as cost-efficient as distributed CPU systems, on large-scale problems; (4) more significantly, able to solve the largest matrix factorization problem ever reported.

CuMF can be used standalone, or to accelerate the ALS implementation in Spark MLlib. We modified Spark’s ml/recommendation/als.scala (code) to detect GPUs and offload the ALS forming and solving to GPUs, while retain shuffling on Spark RDDs.

This approach has several advantages. First, existing Spark applications relying on mllib/ALS need no change. Second, we leverage the best of Spark (to scale-out to multiple nodes) and GPUs (to scale-up in one node).

**B: What is the impact of your research to the larger data science community?**

W: First, by using cuMF for matrix factorization, a large category of workloads that derive latent features from observations can be accelerated. These workloads include recommendation, topic modeling and word embedding.

Second, cuMF sheds light on accelerating machine learning algorithms involving sparse and graph data. We are glad to see that NVIDIA is also moving toward this direction by releasing nvGRAPH in the forthcoming CUDA 8.

**B: When and why did you start looking at using NVIDIA GPUs?**

W: We started looking at using NVIDIA GPUs to solve this problem in late 2014. We thought NVIDIA GPUs massive floating point throughput and large memory bandwidth would help in accelerating MF.

**What is your GPU Computing experience and how have GPUs impacted your research?**

It was a very smooth experience learning and using CUDA. I learned a lot from the two online parallel programming courses that both use CUDA — Intro to Parallel Programming and Heterogeneous Parallel Programming. There are a few excellent books that I can refer to, including *Programming Massively Parallel Processors: A Hands-on Approach*, by David Kirk and Wen-mei Hwu. In addition, I always get useful information and answers from the NVIDIA Forums.

GPUs have helped cuMF to be the state-of-the-art MF solution in terms of speed. Our research result is to be published in HPDC [2], a premier high performance computing conference.

**B: What’s next for cuMF? **

W: We have open-sourced cuMF. In future work we plan to make cuMF more user-friendly by providing a Python interface. We also want to further enhance it by using the latest hardware and software features such as NVLink, FP16 and NCCL.

**B: In terms of technology advances, what are you looking forward to in the next five years?**

W: In my personal opinion, in the next five years HPC technology is going to impact many aspects of machine learning. HPC has enabled and will keep enabling much more sophisticated machine learning, dealing with even bigger data, in data center and on devices.

You can read the HPDC paper describing cuMF linked in reference [2], below, or watch the GTC 2016 talk. If you’d like to try out cuMF, check out the source code on Github.

If you are interested in data analytics applications on GPU, you may be interested in the new nvGraph library.

[1] Y. Koren, R. M. Bell, and C. Volinsky, Matrix factorization techniques for recommender systems. Computer, vol. 42, no. 8, pp. 30- 37, 2009.

[2] Wei Tan, Liangliang Cao, Liana Fong. Faster and Cheaper: Parallelizing Large-Scale Matrix Factorization on GPUs. HPDC, Kyoto, Japan, 2016

]]>OpenAI researcher John Schulman shared some details about his organization, and how OpenAI Gym will make it easier for AI researchers to design, iterate and improve their next generation applications. John studied physics at Caltech, and went to UC Berkeley for graduate school. There, after a brief stint in neuroscience, he studied machine learning and robotics under Pieter Abbeel, eventually honing in on reinforcement learning as his primary topic of interest. John lives in Berkeley, California, where he enjoys running in the hills and occasionally going to the gym.

**What is OpenAI? What is your mission?**

OpenAI is a non-profit artificial intelligence research company. Day to day, we are working on research projects in unsupervised learning and reinforcement learning. Our mission and long-term goal is to advance artificial intelligence in the ways that will maximally benefit humanity as a whole.

**What is reinforcement learning? How is it different from supervised** **and unsupervised learning?**

Reinforcement learning (RL) is the branch of machine learning that is concerned with making sequences of decisions. It assumes that there is an *agent* that is situated in an *environment*. At each step, the agent takes an *action*, and it receives an *observation* and *reward* from the environment. An RL algorithm seeks to maximize the agent’s total reward, given a previously unknown environment, through a learning process that usually involves lots of trial and error.

The reinforcement learning problem sketched above, involving a reward-maximizing agent, is extremely general, and RL algorithms have been applied in a variety of different fields. They have been applied in business management problems such as deciding how much inventory a store should hold or how it should set prices. They have also been applied to robotic control problems, and rapid development is currently occurring in this area. The following video shows Hopper: a two-dimensional one-legged robot trained to hop forward as fast as possible with OpenAI Gym.

While reinforcement learning is focused on making good decisions, supervised and unsupervised learning are mostly focused on making predictions. However, there are lots of connections between these areas, some of which are active topics of research. Besides its different focus, the sequential nature of reinforcement learning also sets it apart from most supervised learning problems. In reinforcement learning, the agent’s decisions affect what input data it receives, i.e., what situations it ends up in. That makes it harder to develop stable algorithms and necessitates *exploration*—the agent needs to actively seek out unknown territory where it might receive large rewards.

**What is OpenAI Gym, and how will it help advance the development of AI?**

OpenAI Gym is a toolkit for developing and comparing reinforcement learning algorithms. It includes a curated and diverse collection of *environments,* which currently include simulated robotics tasks, board games, algorithmic tasks such as addition of multi-digit numbers, and more. We expect this collection to grow over time, and we expect users to contribute their own environments. These environments all expose a common interface, making it possible to write generic algorithms that can be applied to many different environments.

OpenAI Gym also has a site where people can post their results on these environments and share their code. The goal is to make it easy for people to iterate on and improve RL algorithms, and get a sense for which algorithms really work.

Just to give you a sense of what the code looks like, here’s how you can create one of our environments (the classic cart-pole task, where the aim is to balance a vertical pole on a rolling cart), simulate some random actions, and then submit the results to the scoreboard. (In reality, you’ll only want to submit your results if you’ve implemented a learning algorithm.)

import gym env = gym.make("CartPole-v0") # Creates the classic cart-pole environment env.monitor.start("/tmp/gym-results", algorithm_id="random") # ^^^ starts logging statistics and recording videos for _ in xrange(1000): action = env.action_space.sample() # random action observation, reward, done, info = env.step(action) env.monitor.close() gym.upload("/tmp/gym-results", api_key="JMXoHnlRtm86Fif6FUw4Qop1DwDkYHy0") # ^^^ upload stats to the website

This snippet does not include any learning or training—that would require much more code. Soon we are hoping to show clean implementations of various important algorithms on OpenAI Gym environments. If you’re interested, keep an eye on our website.

**How are neural networks used in reinforcement learning?**

To answer this, I’ll need to talk a little bit about what RL algorithms learn. Some reinforcement learning algorithms are focused on learning a *policy*, which is a function that takes in observations (e.g., camera images) and outputs actions (e.g., motor torques). Other algorithms are focused on learning *value functions*, which measure the goodness of states (i.e., the state of the world) and actions. Since we usually don’t have access to the full state of the world, we typically use one or more past observations instead. A Q-function (a type of value function) measures the goodness of a state-action pair , i.e., tells you “how much reward am I going to receive, if I’m in state s, and I choose action a”. Given this Q-function, you can simply choose the action with highest expected reward. In other words, the Q-function defines a policy. The following video shows a Deep Q-Network (DQN) trained to play Breakout on OpenAI Gym.

Policy-based algorithms and Q-function-based algorithms are very similar at their core, and we can use neural networks to represent the policies and Q-functions. For example, when playing Atari games, the input to these networks is an image of the screen, and there is a discrete set of actions, e.g. {NOOP, LEFT, RIGHT, FIRE}. As a Q-function for this task, you can use a convolutional neural network that takes the screen image as input and outputs a real number for each of the 4 actions, specifying the goodness of that action (Mnih et al, 2013). As a policy for this task, you can use a convolutional neural network with similar architecture, which instead outputs the probability of each action. (Guo et al., 2014, Schulman et al., 2015).

**What makes OpenAI Gym unique? Are there other similar open source environments currently available?**

There are various open-source collections of environments, including but not limited to RL-Glue, RLPy, and Arcade Learning Environment. We’ve drawn inspiration and code from these libraries. OpenAI Gym also integrates work done recently by researchers at UC Berkeley on benchmarking deep reinforcement learning algorithms. A paper describing this benchmark is available on the ArXiv and will be presented at ICML this year.

OpenAI Gym goes beyond these previous collections by including a greater diversity of tasks and a greater range of difficulty (including simulated robot tasks that have only become plausibly solvable in the last year or so). Furthermore, OpenAI Gym uniquely includes online scoreboards for making comparisons and sharing code.

**Who will use OpenAI Gym? How will AI researchers benefit from RL-Gym?**

We hope to make OpenAI Gym accessible to people with different amounts of background. Users with no background in RL can download the codebase and get started experimenting in minutes. They can visit the scoreboards for different environments and download the code linked to solutions. Then they can verify the solutions (an important and useful service!) and tinker with them.

AI researchers will be able to use the included environments for their RL research. Each environment is semantically versioned, making it possible to report results in papers and have them remain meaningful. Researchers will be able to compare their algorithms’ performance with others on the scoreboard, and find code for the leading algorithms.

**Do you have plans to accelerate OpenAI Gym with NVIDIA GPUs? How will GPUs benefit your work?**

GPUs are becoming indispensable for learning problems that involve large neural networks. We will be using GPUs for training networks on the larger-scale tasks, and we expect many of our users to do the same.

Also, our current environments include robotics simulations, and we expect to include more complicated and realistic 3D environments in the future. GPUs will enable high-quality rendering and physics simulation in these more complicated 3D environments.

** Would more realistically rendered environments be useful to transfer learning to the real-world?**

Yes, I believe that photorealistic rendering is good enough to allow robots to be trained in simulation, so that the learned policies will transfer to the real world. Lots of exciting possibilities lie ahead.

**What’s next for OpenAI and OpenAI Gym?**

We’ll soon be publishing results from some of our ongoing research projects in unsupervised learning and reinforcement learning. We’re excited to see what users do with OpenAI Gym, and plan to continue building it into a tool that’s great for the research community as well as newcomers to the field.

To learn more see the OpenAI Gym announcement blog post. For an introduction to Deep Learning, check out the Deep Learning in a Nutshell series.

]]>NCCL (pronounced “Nickel”) is a library of multi-GPU collective communication primitives that are topology-aware and can be easily integrated into your application. Initially developed as an open-source research project, NCCL is designed to be light-weight, depending only on the usual C++ and CUDA libraries. NCCL can be deployed in single-process or multi-process applications, handling required inter-process communication transparently. Finally, the API will be very familiar to anyone with experience using MPI’s collectives.

Collective communication routines are common patterns of data transfer among many processors. If you have experience with MPI, then you are probably already familiar with several collective operations. For example, all-reduce starts with independent arrays $latex *V_k$ *of *N* values on each of *K* processors and ends with identical arrays *S** *of *N* values, where $S[k] = V_0[k]+V_1[k]+…+V_k[k]$, for each processor *k*. See Figure 1.

Another common collective is all-gather, where each of K processors begins with an independent array of *N* values, and collects data from all other processors to form a result of dimension *N***K*, as Figure 2 shows.

Broadcast is a third example. Here an *N*-element buffer on one processor is copied to all other processors, as Figure 3 shows.

All of the above collectives can be performed either “out-of-place,” with separate input and output buffers, or “in-place,” with overlapping input and output.

There are numerous approaches to implementing collectives efficiently. However, it is critical that our implementation takes the topology of interconnects between processors into account. For example, consider a broadcast of data from GPU0 to all other GPUs in the PCIe tree topology pictured below.

A two-step tree algorithm is a common choice in this situation. In the first step the data is sent from the GPU0 to a second GPU, and in the second step both of these send data to the remaining processors. However, we have a choice. We can either send data from GPU0 to GPU1 in the first step and then GPU0 to GPU2 and GPU1 to GPU3 in the second, or we can perform the initial copy form GPU0 to GPU2 and then GPU0 to GPU1 and GPU2 to GPU3 in the second step. Examining the topology, it is clear that the second option is preferred, since sending data simultaneously from GPU0 to GPU2 and GPU1 to GPU3 would cause contention on the upper PCIe links, halving the effective bandwidth for this step. In general, achieving good performance for collectives requires careful attention to the interconnect topology.

To optimize Broadcast bandwidth, an even better approach is to treat the PCIe topology above as a ring.

The broadcast is then performed by relaying small chunks of the input around the ring from GPU0 to GPU3. Interestingly, ring algorithms provide near optimal bandwidth for nearly all of the standard collective operations, even when applied to “tree-like” PCIe topologies. But note that selecting the correct ring order remains important.

In order to provide maximum bandwidth, NCCL implements ring-style collectives. NCCL implicitly indexes the GPUs into the optimal ring order under the hood. This provides great performance for your application while freeing you from having to worry about specific hardware configurations.

Many collectives requires a buffer for intermediate results. In order to minimize memory overhead to a few MB on each GPU, NCCL splits large collectives into many small chunks. It would be highly inefficient to launch separate kernels and cudaMemcpy calls for every step and chunk of a collective algorithm. Instead, each collective is implemented by a monolithic CUDA kernel. NCCL makes extensive use of GPUDirect Peer-to-Peer direct access to push data between processors. Where peer-to-peer direct access is not available (e.g., when traversing a QPI interconnect), the pushed data is staged through a buffer in pinned system memory. Similarly, synchronization is performed by polling on volatile variables in device (or pinned system) memory.

Internally, NCCL implements each collective in terms of three primitives: Copy, Reduce, and ReduceAndCopy. Each of these is optimized to efficiently transfer fine-grained slices of data (4-16KB) between GPUs. The kernels have also been specifically optimized to achieve maximum bandwidth at low occupancy. As a result, NCCL can saturate a PCIe 3.0 x16 interconnect using a single block of CUDA threads. This leaves the bulk of the GPU free to execute compute tasks concurrently with the communication.

NCCL currently supports the all-gather, all-reduce, broadcast, reduce, and reduce-scatter collectives. Any number of GPUs can be used, as long as they reside in a single node.

The open-source release of NCCL is available on Github. The library should build on any common Linux distribution and is compatible with CUDA 7.0 and later. CUDA supports direct access only for GPUs of the same model sharing a common PCIe root hub. GPUs not fitting these criteria are still supported by NCCL, though performance will be reduced since transfers are staged through pinned system memory.

The NCCL API closely follows MPI. Before performing collective operations, a communicator object must be created for each GPU. The communicators identify the set of GPUs that will communicate and maps the communication paths between them. We call the set of associated communicators a clique. There are two ways to initialize communicators in NCCL. The most general method is to call `ncclCommInitRank()`

once for each GPU.

ncclResult_t ncclCommInitRank(ncclComm_t* comm, int nGPUs, ncclUniqueId cliqueId, int rank);

This function assumes that the GPU belonging to the specified rank has already been selected using `cudaSetDevice()`

. `nGPUs`

is the number of GPUs in the clique. `cliqueId`

allows the ranks of the clique to find each other. The same `cliqueId`

must be used by all ranks. To achieve this, call `ncclGetUniqueId()`

in one rank and broadcast the resulting `uniqueId`

to the other ranks of the clique using the communication framework of your choice (for example, `MPI_Bcast`

).

The last argument to `ncclCommInitRank()`

specifies the index of the current GPU within the clique. It must be unique for each rank in the clique and in the range `[0, nGPUs)`

. This index is used, for example, to identify the source GPU for a broadcast operation, or to order the contributions to an all-gather.

Upon successful initialization, `ncclCommInitRank()`

returns `ncclSuccess`

and `*comm`

is set to the new NCCL communicator object.

Internally, `ncclInitRank()`

performs a synchronization between all communicators in the clique. As a consequence, it must be called from a different host thread for each GPU, or from separate processes (e.g., MPI ranks). Specifically, calling `ncclInitRank()`

in a single threaded loop over GPUs will result in deadlock.

The second way to initialize the communicators is to use `ncclCommInitAll()`

. This is essentially a convenience routine that spares you the effort of spawning extra host threads to initialize NCCL in an otherwise single-threaded application.

ncclResult_t ncclCommInitAll(ncclComm_t* comms, int nGPUs, int* devList);

The comms argument now refers to an array of `ncclComm_t`

objects, one for each of the `nGPUs`

ranks in the clique. `devList`

specifies which CUDA device gets associated with each rank. With the communicator object initialized, you can call the collectives through their host functions, such as `ncclAllReduce()`

.

ncclResult_t ncclAllReduce(void* sendoff, void* recvbuff, int count, ncclDataType_t type, ncclRedOp_t op, ncclComm_t comm, cudaStream_t stream);

Documentation for each collective is provided in `nccl.h`

. Most arguments have analogues in the MPI collectives API. The notable exception is the stream argument. Similar to many CUDA “Async” routines, NCCL collectives schedule the operation in a stream but may return before the collective is complete. By queuing collectives and other compute kernels in separate streams, the GPU is able to overlap collective communication with more compute-intensive workloads. To maximize overlap, schedule NCCL collectives on high-priority streams to allow them to slip in among compute intensive grids.

`ncclAllReduce()`

must be called once for each communicator in the clique, each call providing its own send and receive buffers, etc. NCCL collectives assume that the appropriate CUDA context is already current. Because NCCL is asynchronous, a simple loop can be used to initiate a collective from a single threaded application.

for (int gpu=0; gpu<nGPUs; ++gpu) { cudaSetDevice(devList[gpu]); ncclAllReduce(...); }

The performance that NCCL collectives achieve depends on the exact topology of the computer. In the best case, all GPUs share peer access. This is most common for workstations equipped with a few GPUs. Larger servers usually sport dual CPUs, with some GPUs segregated on different IO hubs. Figure 6 shows NCCL bandwidth for various collectives measured on an NVIDIA Digits DevBox equipped with 4 GeForce GTX Titan X GPUs arranged as in Figure 4 above.

The red bar at 10.4 GB/s represents the bandwidth achieved by a large cudaMemcpy between two of the GPUs (specifically GPU 0 and 2 in Figure 4). NCCL sustains a high percentage of this peak bandwidth while performing communication among all four GPUs.

The goal of NCCL is to deliver topology-aware collectives that can improve the scalability of your multi-GPU applications. By using NCCL you can get great performance without having to think about low-level hardware details.

Simple code examples for both single-process and MPI applications are distributed with NCCL. As a research project, we welcome your feedback as we continue to evolve the project. Please try it out and let us know what you think! For an update on the latest developments, come see my NCCL talk at GTC.

]]>This week at GTC 2016, we announced the latest update to NVIDIA Deep Learning SDK, which now includes cuDNN 5. Version 5 offers new features, improved performance and support for the latest generation NVIDIA Tesla P100 GPU. New features in cuDNN 5 include:

- Faster forward and backward convolutions using the Winograd convolution algorithm;
- 3D FFT Tiling;
- Spatial Transformer Networks;
- Improved performance and reduced memory usage with FP16 routines on Pascal GPUs;
- Support for LSTM recurrent neural networks for sequence learning that deliver up to 6x speedup.

One of the new features we’ve added in cuDNN 5 is support for Recurrent Neural Networks (RNN). RNNs are a powerful tool used for sequence learning in a number of fields, from speech recognition to image captioning. For a brief high-level introduction to RNNs, LSTM and sequence learning, I recommend you check out Tim Dettmers recent post Deep Learning in a Nutshell: Sequence Learning, and for more depth, Soumith Chintala’s post Understanding Natural Language with Deep Neural Networks Using Torch.

I’m excited about the RNN capabilities in cuDNN 5; we’ve put a lot of effort into optimizing their performance on NVIDIA GPUs, and I’ll go into some of the details of these optimizations in this blog post.

cuDNN 5 supports four RNN modes: ReLU activation function, tanh activation function, Gated Recurrent Units (GRU), and Long Short-Term Memory (LSTM). In this case study I’ll look at the performance of an LSTM network, but most of the optimizations can be applied to any RNN.

The following equations govern the forward propagation of data through an LSTM unit. Figure 2 shows a diagram of an LSTM unit.

From a computational perspective this boils down to eight matrix-matrix multiplications (GEMMs)—Four with input i, four with input h—and lots of point-wise operations.

The starting point for this case-study is an LSTM implemented operation-by-operation. For each iteration, for each layer, the implementation calls cuBLAS `sgemm`

to perform each of the eight GEMMs, and hand-written CUDA kernels to call each of the point-wise operations. Pseuduocode for the method follows.

for layer in layers: for iteration in iterations: perform 4 SGEMMs on input from last layer perform 4 SGEMMs on input from last iteration perform point-wise operations

As a benchmark I measure run time per step, per layer on a Tesla M40 GPU. My benchmark LSTM has 512 hidden units and computes mini batches of size 64. The performance of this baseline implementation is fairly poor, achieving approximately 350 GFLOPS on the M40. The peak performance of this GPU is around 6000 GFLOPs, so there is a lot of room to improve. Let’s get started.

GPUs have very high peak floating-point throughput, but they need a lot of parallelism to approach this peak. The more parallel work you give them, the higher the performance they can achieve. Profiling this LSTM code shows that the GEMM operations use significantly fewer CUDA thread blocks than there are SMs on the GPU, indicating the GPU is massively underused.

GEMM is typically parallelized over the output matrix dimensions, with each thread computing many output elements for maximum efficiency. In this case each of the eight output matrices comprises 512×64 elements, which results in only four thread blocks. Ideally can run significantly more blocks than the GPU has SMs, and to maximize the theoretical occupancy for this kernel at least four blocks per SM (or 96 in total) are needed. (See the CUDA Best Practices guide for more on occupancy.)

If *n* independent matrix multiplications share the same input, then they can be combined into one larger matrix multiplication with an output *n* times larger. The first optimization is therefore to combine the four weight matrices operating on the recurrent step into one weight matrix, and to combine the four weight matrices operating on the input into another. This gives us two matrix multiplications instead of eight, but each is four times the size and has four times the parallelism (16 blocks per GEMM). This optimization is fairly common in most framework implementations: it’s a very easy change that leads to a good speedup: the code runs roughly 2x faster.

Even with the larger combined GEMMs, performance is still limited by lack of parallelism: there are 16 blocks instead of four, but the target is to have at least 96. The two remaining GEMMs are independent so they can be computed concurrently using CUDA streams. This doubles the number of possible concurrent blocks to 32.

Figure 3 shows that now a lot of time is spent in point-wise operations. There’s no need to do these in separate kernels; fusing them into a single kernel reduces data transfers to and from global memory and significantly reduces kernel launch overhead.

At this point I’m fairly happy with the performance of a single iteration: the majority of the computation is in the GEMMs, and they’re exposing as much parallelism as they can. This implementation is about 5x faster than the baseline implementation, but there are more improvements to come.

for layer in layers: for iteration in iterations: perform sgemm on input from last layer in stream A perform sgemm on input from last iteration in stream B wait for stream A and stream B perform point-wise operations in one kernel

In an RNN the operations for a single iteration are repeated many times. This means it’s important to have those operations running efficiently, even if this comes at an upfront cost.

When performing a GEMM the standard BLAS API allows you to transpose either of the two input matrices. Some of the four combinations of transpose/not-transposed run slightly faster or slower than others. Depending on the way that the equations are mapped to the computation, a slower version of the GEMM may be used. By performing a transpose operation up-front on the weight matrix, each step can be made slightly faster. This comes at the cost of the transpose, but that is fairly cheap, so if the transposed matrix is to be used for more than a few iterations it is often worth it.

In many cases all of the inputs are available at the start of the RNN computation. This means that the matrix operations working on these inputs can be started immediately. It also means that they can be combined into larger GEMMs. While at first this may seem like a good thing (there’s more parallelism in the combined GEMMs), propagation of the recurrent GEMMs depends upon the completion of the input GEMMs. So there’s a tradeoff: combining input GEMMs gives more parallelism in that operation, but also prevents overlap with the recurrent GEMMs. The best strategy here depends a lot on the RNN hyperparameters. Combining two input GEMMs works best in this case.

for layer in layers: transpose weight matrices for iteration in iterations / combination size: perform sgemm on combined input from last layer in stream A for sub-iteration in combination size: perform sgemm on input from last iteration in stream B wait for stream A wait for stream B for sub-iteration in combination size; perform pointwise operations in one kernel

The final step is to consider optimization between layers. Once again, there’s quite a lot of parallelism to be found here. Figure 4 shows the dependency graph for an RNN. As iteration n of a given layer only depends on iteration n-1 of that layer and iteration n of the previous layer it is possible to start on a layer before you’ve finished on the previous layer. This is really powerful: if there are two layers there is twice as much parallelism available.

Going up from one to four layers, throughput increases by roughly 1.7x: from 2.3 TFLOPs to 3.9 TFLOPs. At this point the gains from exposing more parallelism are starting to become more limited. Compared to the original implementation which only had four blocks running at any given time, this implementation can run up to 128 blocks concurrently. This is enough to make use of all of the M40’s resources, achieving nearly 70% of peak floating point performance and running more than 10x faster than the original implementation.

The following table shows the performance achieved after each of the optimizations I have described and the speedup vs. the baseline code.

Optimization | GFLOPS | Speedup |

Baseline | 349 | (1.0x) |

Combined GEMMs | 724 | 2.1x |

GEMM Streaming | 994 | 2.8x |

Fused point-wise operations | 1942 | 5.5x |

Matrix pre-transposition | 2199 | 6.3x |

Combining Inputs | 2290 | 6.5x |

Four layers | 3898 | 11.1x |

Propagating gradients through the backward pass is very similar to propagating values forward. Once the gradients have been propagated the weights can be updated in a single call spanning all iterations: there are no longer any recurrent dependencies. This results in a very large efficient matrix multiplication.

To get the best performance out of Recurrent Neural Networks you often have to expose much more parallelism than direct implementation of the equations provides. In cuDNN we’ve applied these optimizations to four common RNNs, so I strongly recommend that you use cuDNN 5 if you are using these RNNs in your sequence learning application.

For more information:

- Watch my GTC 2016 talk S6165, either live Thursday 7th at 14:00 in Room 210H, or via the recording available soon after.
- Read our paper describing these methods in more detail on arXiv.
- Download the code used for this blog post on Github.

Other resources:

]]>- Support for the Pascal GPU architecture, including the new Tesla P100 accelerator;
- New Unified Memory capabilities;
- The new nvGRAPH GPU-Accelerated Graph Analytics library;
- Powerful new profiling capabilities; and
- Improved compiler performance and heterogeneous lambda support.

To learn more at GTC 2016, you can check out my talk from GTC 2016, “CUDA 8 and Beyond” (GTC on-demand site registration required).

A crucial goal for CUDA 8 is to provide support for the powerful new Pascal architecture, the first incarnation of which was launched at GTC today: Tesla P100. For full details on P100 and the Pascal GP100 GPU architecture, check out the blog post “Inside Pascal“.

In a nutshell, Tesla P100 provides massive double-, single- and half-precision computational performance, 3x the memory bandwidth of Maxwell GPUs via HBM2 stacked memory, and with its support for NVLink, up to 5x the GPU-GPU communication performance of PCI Express. Pascal also improves support for Unified Memory thanks to a larger virtual address space and a new page migration engine.

CUDA 8 will enable CUDA applications to get high performance on Tesla P100 out of the box. Moreover, improvements in CUDA 8 enable developing efficient code for new Tesla P100 features such as NVLink and improved Unified Memory.

Unified Memory is an important feature of the CUDA programming model that greatly simplifies programming and porting of applications to GPUs by providing a single, unified virtual address space for accessing all CPU and GPU memory in the system. Pascal GP100 features provide a significant advancement for GPU computing by expanding the capabilities and improving the performance of Unified Memory.

CUDA 6 introduced Unified Memory, which creates a pool of managed memory that is shared between the CPU and GPU, bridging the CPU-GPU divide. Managed memory is accessible to both the CPU and GPU using a single pointer. The CUDA system software automatically migrates data allocated in Unified Memory between GPU and CPU, so that it looks like CPU memory to code running on the CPU, and like GPU memory to code running on the GPU. For details of how Unified Memory in CUDA 6 and later simplifies porting code to the GPU, see the post “Unified Memory in CUDA 6”.

CUDA 6 Unified Memory was limited by the features of the Kepler and Maxwell GPU architectures: all managed memory touched by the CPU had to be synchronized with the GPU before any kernel launch; the CPU and GPU could not simultaneously access a managed memory allocation; and the Unified Memory address space was limited to the size of the GPU physical memory.

Expanding on the benefits of CUDA 6 Unified Memory, Pascal GP100 adds features to further simplify programming and sharing of memory between CPU and GPU, and allowing easier porting of CPU parallel compute applications to use GPUs for tremendous speedups. Two main hardware features enable these improvements: support for large address spaces and page faulting capability.

GP100 extends GPU addressing capabilities to enable 49-bit virtual addressing. This is large enough to cover the 48-bit virtual address spaces of modern CPUs, as well as the GPU’s own memory. Therefore, GP100 Unified Memory allows programs to access the full address spaces of all CPUs and GPUs in the system as a single virtual address space, unlimited by the physical memory size of any one processor.

Memory page faulting support in GP100 is a crucial new feature that provides more seamless Unified Memory functionality. Combined with the system-wide virtual address space, page faulting provides several benefits. First, page faulting means that the CUDA system software doesn’t need to synchronize all managed memory allocations to the GPU before each kernel launch. If a kernel running on the GPU accesses a page that is not resident in its memory, it faults, allowing the page to be automatically migrated to the GPU memory on-demand. Alternatively, the page may be mapped into the GPU address space for access over the PCIe or NVLink interconnects (mapping on access can sometimes be faster than migration). Note that Unified Memory is system-wide: GPUs (and CPUs) can fault on and migrate memory pages either from CPU memory or from the memory of other GPUs in the system.

With the new page fault mechanism, global data coherency is guaranteed with Unified Memory. This means that with GP100, the CPUs and GPUs can access Unified Memory allocations simultaneously. This was illegal on Kepler and Maxwell GPUs, because coherence could not be guaranteed if the CPU accessed a Unified Memory allocation while a GPU kernel was active. Note, as with any parallel application, developers need to ensure correct synchronization to avoid data hazards between processors.

Finally, on supporting platforms, memory allocated with the default OS allocator (e.g. ‘malloc’ or ‘new’) can be accessed from both GPU code and CPU code using the same pointer. On these systems, Unified Memory is the default: there is no need to use a special allocator or for the creation of a special managed memory pool. Moreover, GP100’s large virtual address space and page faulting capability enable applications to access the entire system virtual memory. This means that applications can oversubscribe the memory system: in other words they can allocate, access, and share arrays larger than the total physical capacity of the system, enabling out-of-core processing of very large datasets.

Certain operating system modifications are required to enable Unified Memory with the system allocator. NVIDIA is collaborating with Red Hat and working within the Linux community to enable this powerful functionality.

To learn more about the Tesla P100 accelerator and the Pascal architecture, see the blog post Inside Pascal.

There are two main ways that programmers benefit from Unified Memory.

**Simpler programming and memory model**

Unified Memory lowers the bar of entry to parallel programming on GPUs, by making explicit device memory management an optimization, rather than a requirement. Unified Memory lets programmers focus on developing parallel code without getting bogged down in the details of allocating and copying device memory. This makes it easier to learn to program GPUs and simpler to port existing code to the GPU. But it’s not just for beginners; Unified Memory also makes complex data structures and C++ classes much easier to use on the GPU. With GP100, applications can operate out-of-core on data sets that are larger than the total memory size of the system. On systems that support Unified Memory with the default system allocator, any hierarchical or nested data structure can automatically be accessed from any processor in the system.

**Performance through data locality**

By migrating data on demand between the CPU and GPU, Unified Memory can offer the performance of local data on the GPU, while providing the ease of use of globally shared data. The complexity of this functionality is kept under the covers of the CUDA driver and runtime, ensuring that application code is simpler to write. The point of migration is to achieve full bandwidth from each processor; the 750 GB/s of HBM2 memory bandwidth is vital to feeding the compute throughput of a GP100 GPU. With page faulting on GP100, locality can be ensured even for programs with sparse data access, where the pages accessed by the CPU or GPU cannot be known ahead of time, and where the CPU and GPU access parts of the same array allocations simultaneously.

An important point is that CUDA programmers still have the tools they need to explicitly optimize data management and CPU-GPU concurrency where necessary: CUDA 8 introduces useful APIs for providing the runtime with memory usage hints (`cudaMemAdvise()`

) and for explicit prefetching (`cudaMemPrefetchAsync()`

). These tools allow the same capabilities as explicit memory copy and pinning APIs without reverting to the limitations of explicit GPU memory allocation.

In addition to Pascal support in CUDA 8, CUDA 8 platform support for Unified Memory expands to Mac OS X. Now developers using Macs with NVIDIA GPUs can take advantage of the benefits and convenience of Unified Memory in their applications.

Graphs are mathematical structures used to model pairwise relations between objects. Graphs can be used to model many types of relations and processes in physical, biological, social and information systems, and their use is becoming increasingly common in the solutions to high-performance data analytics problems.

Corporations, scientists and non-profit groups have access to large graphs representing the social and commercial activities of their customers and users and there is tremendous untapped opportunity for them to use that data to communicate more effectively, create better product, and reduce waste by discovering valuable patterns in the information.

Cyberanalytics is the application of data mining, graph theory and networking technology to understand internet traffic patterns, especially toward detecting attacks on secure systems and identifying their sources. It may also be used in an ‘offensive’ capacity to identify central points of failure or critical paths in a network. Currently this is limited to mostly forensic, after-the-fact analysis, due to the large amounts of data involved and computation required, and the immaturity of tools in this area.

Graph analytics are very important since the internet is best described as a graph of links between computer systems. These links have a time-dependent nature as well, which makes cyberanalytics a dynamic graph problem. The goal is to increase performance at scale to move away from ‘forensic’ analytics and towards real-time, interactive analytics with the capability to respond to and even to prevent attacks. Key algorithms for analysis are Breadth First Search (BFS), Minimal Spanning Tree (MST), connected components (CC), Time dependent and congestion aware routing (shortest path), clustering, pagerank, partitioning, and pattern matching.

Genomics is the study of how genes interact in a cell and of variations in genes both over time and across a population. Humans have about 21000 genes which code for proteins, and a larger number of non-coding RNA strands which regulate the activity of those genes; each gene can typically produce several different proteins. These interactions can be described as graphs, and graph methods are used to find variations (branches) from a ‘standard’ genome (as far as one exists). Since each genome is unique, assembling a full sequence requires trying many options for branches and locations in a nearest-fit search. A human genome contains about 3 billion base pairs, so this search requires solving a very large graph problem. We can reduce it by two orders of magnitude by focusing on genes instead of bases and working with graph methods. Graph partitioning and shortest path search are key algorithms for genomics.

The computational requirements of large-scale graph processing for cyberanalytics, genomics, social network analysis and other fields demand powerful and efficient computing performance that only accelerators can provide. As more and more companies figure out how to capture data from customer interactions and networks of sensors, the need for real-time analysis of commercial-scale graph data will be ubiquitous. nvGRAPH is a new library of GPU-accelerated graph algortithms that aims to make real-time graph analytics possible without the need to spend time sampling or breaking up the data into smaller graphs.

The first release of nvGRAPH included in CUDA 8 supports some of the key algorithms mentioned above, and we intend to make nvGRAPH the fastest implementation on GPUs for all of the algorithms mentioned. Specifically, nvRank 1.0 supports PageRank, Single-Source Shortest Path, and Single-Source Widest Path. PageRank is an important algorithm useful in Internet and other search applications, recommendation engines, and social ad placement, among others. Single-Source Shortest Path is useful for path planning in robotics and autonomous vehicles and power network, logistics, and supply chain planning. Single-source widest path is useful for IP routing, chip design and EDA, and traffic sensitive routing.

nvGRAPH 1.0 provides single-GPU implementations of these algorithms. nvGraph achieves a 4x speedup running PageRank on an 84-million-edge Wikipedia graph on a single K40 GPU, compared to a CPU implementation using MKL on a 48-core Xeon E5-2697 as the chart above shows.

In heterogeneous applications that do significant computation on both CPUs and GPUs, it can be a challenge to locate the best place to spend your optimization effort. Ideally, when optimizing your code, you would like to target the locations in the application that will provide the highest speedup for the least effort. To this end, we are continuously improving the NVIDIA profiling tools: NSight, NVIDIA Visual Profiler (nvvp). CUDA 7.5 introduced PC sampling, providing instruction-level profiling so that you could pinpoint specific lines of code that are taking the most time in your application.

But the longest-running kernel in your application is not always the most critical optimization target. As the Figure 6 shows, sometimes a kernel with a shorter run time may be holding up the CPU from proceeding. In the image, Kernel X is the longest running, but speeding up Kernel Y will reduce the time the CPU spends waiting, so it is the best optimization target.

In CUDA 8, the Visual Profiler provides dependency analysis between GPU kernels and CPU CUDA API calls, enabling critical path analysis in your application to help you more profitably target your optimization effort. Figure 7 shows critical path analysis in the CUDA 8 Visual Profiler. You can see that with the critical path focused, GPU kernels, copies, and API calls that are not on the critical path are greyed out.

Figure 8 shows what happens when we highlight execution dependencies in the Visual Profiler.

In addition to critical path analysis, CUDA 8 also provides the ability to profile both CPU and GPU code in the same application, to provide a list of CPU hotspots and call hierarchy, as well as visualizing the CPU source code in the profiler along with the GPU code.

With CUDA 8 you can also profile OpenACC code, just as you can with CUDA C++ code. And the Visual Profiler adds support for two important Pascal features: NVLink and Unified Memory. The profiler can show the topology of NVLink connections in your system and profile the bandwidth achieved across the links. For Unified Memory, it shows page faults and migrations on the timeline, and allows introspection into the sources of page faults.

The NVCC compiler in CUDA 8 has been optimized for compilation time, so that you spend less time waiting for the CUDA files in your application to compile. Compilation time is 2x or more faster for a range of codes, especially those that heavily use C++ templates, such as Thrust and Eigen 3. Figure 9 shows some characteristic speedups.

Lambda expressions are a powerful C++11 feature that enable anonymous functions (and closures) that can be defined in line with their use, can be passed as arguments, and can capture variables. I wrote at length about the C++11 features in CUDA 7 in my blog post The Power of C++11 Programming in CUDA 7. CUDA 7.5 extended this with experimental support for GPU lambdas: anonymous device function objects that you can define in host code, by annotating them with a `__device__`

specifier. (I wrote about CUDA 7.5 here.)

The GPU lambda support in CUDA 8 expands to support heterogeneous lambdas: lambda expressions annotated with a `__host__ __device__`

specifier, which allows them to be called on either the CPU or GPU, just like other `__host__ __device__`

functions.

Heterogeneous lambdas allow us to modify the Thrust SAXPY example from my CUDA 7.5 blog post so that it can run on either the CPU or GPU, as the following code shows.

void saxpy(float *x, float *y, float a, int N) { using namespace thrust; auto r = counting_iterator(0); auto lambda = [=] __host__ __device__ (int i) { y[i] = a * x[i] + y[i]; }; if(N > gpuThreshold) for_each(device, r, r+N, lambda); else for_each(host, r, r+N, lambda); }

GPU lambda support in CUDA 8 is experimental, and must be enabled by passing the flag `--expt-extended-lambda`

to NVCC at compilation time.

CUDA 8 is the most feature-packed and powerful release of CUDA yet. CUDA 8 will be available in August 2016 and there will be a release candidate available around June. To get started with CUDA, download the latest CUDA Toolkit.

To learn more about the Tesla P100 accelerator and the Pascal architecture, see the blog post Inside Pascal.

]]>Did you see the White House’s recent initiative on Precision Medicine and how it is transforming the ways we can treat cancer? Have you avoided clicking on a malicious website based on OpenDNS’s SecureRank predictive analytics? Are you using the Wikidata Query Service to gather data to use in your machine learning or deep learning application? If so, you have seen the power of graph applications. Graphs are not just nodes and links, they are powerful data structures you can use to represent complex dependencies in your data. Graph applications are everywhere, ranging from cancer research to large-scale cyber threat detection to collaborative filtering recommendation systems. Social networks, protein networks, and communication networks are all examples that can benefit from graph-based techniques.

See BlazeGraph talks at GTC 2016 or visit Booth #206. Register now to save 20% with code “S7B2XS”!

Unless you’ve been heads down performance tuning your CUDA code for the last year, you’ve seen the buzz around GPU-accelerated big data analytics. We foresee the trend of increasing memory bandwidth and the arrival of NVLink and Pascal enabling faster and easier multi-GPU computing. So how do you apply all that bandwidth for graph analytics?

At BlazeGraph, our answer is to build software that exploits the bandwidth advantages of GPUs to accelerate graph and predictive analytics. In this blog post, I will describe our journey so far and show you where we’re heading in the future.

In 2013, we started working on the DARPA XDATA program. This lead to the development of MapGraph, a high-level API for GPU-accelerated graph analytics, in 2014. We first started using libraries like moderngpu, cub, and others in our software, which we still use today. Building on prior success in scalable graph traversal on GPUs, which showed the potential for graphs on GPUs and with DARPA funding, we collaborated with the University of Utah SCI Institute to extend MapGraph to run on multi-GPU clusters. At the IEEE Bigdata conference later in 2014 we demonstrated a breadth-first search (BFS) on a Graph 500 scale 27 (4.3 billion edges) graph at a rate of 29.1 billion traversed edges per second (GTEPS) running on a cluster of 64 NVIDIA K40 on the PSG Cluster.

At the time, we had the world’s fastest BFS result for 64 GPUs. If you look at GTEPS in terms of performance over hardware cost, the multi-GPU solution provides as much as a 10x advantage over a comparable supercomputer. In 2015, we continued to develop this technology, consolidating it with our open source graph database platform. We announced our new GPU-accelerated product, now called Blazegraph, in December 2015.

Given the great rise in interest in and uptake of Apache Spark for large scale data processing in 2015, we were curious about how GPUs compared to Spark performance. We did a comparison of our multi-GPU implementation of BFS to a Spark implementation in Scala using GraphX, the Spark graph processing framework. We used the ICWSM 2010 Twitter dataset, which has 1.9 billion edges. We compared to several different CPU-based configurations. The first was a single “fat” node with dual K-80s on which we processed a 700 million-edge subset of the data. The second and third were a 16-node cluster of Amazon EC2 instances of type r3.2xlarge (488 GB of RAM) and a 16-node cluster of r3.4xlarge (1,952 GB of RAM), respectively. We compared these to a cluster of 16 K40s (192 GB of RAM) using the full 1.9B edge graph. Note the relative differences in the amount of RAM between the CPU and GPU clusters. Figure 3 shows the GPU speedups we achieved.

In all cases, the multi-GPU solution was 700-1800X faster than the CPU-based Spark implementation. The comparisons are taken only for the execution of the BFS graph traversals. We took extra care to make sure that Spark RDDs were forced into memory (`MEMORY_ONLY_SER`

) for the comparison. We first presented these results at Super Computing 2015. While this result is impressive, it was not unexpected. Spark and GraphX have a stated design tradeoff to “achieve a balance between expressiveness, performance, and ease of use”. When we compare to optimized, multi-core CPU implementations our own results show 3.9X to 100X speedups. These results are true even at a very large scale where others have shown 7.8X speedup on clusters of 4096 GPUs.

However, what really excites us is where Spark succeeds wildly—in making it easy for you to load large data sets and quickly develop complex analytics. You can be up and running graph analytics on a Spark cluster in only a few minutes. It’s a bit more complicated to set up a GPU cluster and run PageRank. What if you could get the full speed of CUDA and GPUs through an environment like Spark?

So, GPUs are fast; very fast for graph processing and analytics where memory bandwidth is a bottleneck. *Great; we can get over 100x acceleration by moving to GPUs. Let’s go find some analytics developers and business analysts to develop algorithms for these multi-GPU clusters.* As we discovered, overcoming the technical challenges of processing graphs on multi-GPU systems is just the beginning. We’ve found that individuals with the combination of expertise to understand a business problem and build an algorithm to improve it, often do not have the set of skills required to optimize the implementation on a multi-GPU system; and vice-versa.

It is non-trivial to scale applications onto multi-core, parallel architectures. GPU algorithms development requires significant knowledge of CUDA and the CPU and GPU memory systems. We saw a need to both accelerate existing high-level APIs for graph query and to enable iterative graph analytics at a high level, while delivering the speed of CUDA and multi-GPU systems.

Before we became GPU-junkies, we were graph nerds (still in recovery). The SPARQL language is the W3C standard query language for graph data represented using the Resource Description Framework (RDF). You can think of it much like SQL for graphs. Our Blazegraph DB open-source platform provides a non-GPU implementation of a graph database that supports RDF and SPARQL as well as the Tinkerpop APIs for property graphs. Blazegraph GPU is our new capability that provides GPU-acceleration for SPARQL queries based on our existing platform. We’ve just started shipping this in Q1 2016 to our first customers. If you’re going to be at GTC this year, check out our talk to find out about this capability on the Cirrascale Cloud.

We benchmarked our GPU-accelerated version using the Lehigh University Benchmark (LUBM). LUBM is a well-known graph benchmark using extensional queries over a large data set. It generates synthetic data that models a university structure: students take courses, courses are taught by professors, and so forth. The benchmark provides a set of 14 queries. Query #9 is an example of a conjunctive query that is one of the most challenging in the benchmark:

SELECT ?x ?y ?z WHERE{ ?x a ub:Student . ?y a ub:Faculty . ?z a ub:Course . ?x ub:advisor ?y . ?y ub:teacherOf ?z . ?x ub:takesCourse ?z . }

This is a conjunctive query that asks, “Show me all of the students who take a course taught by a faculty member who is also the student’s advisor.” We ran the benchmark using the LUBM 1000 University synthetic data, which contains 167,697,263 edges. This size is the largest LUBM configuration that will fit on the memory of a single K40 (or half of the memory on a K80 accelerator). We compared the performance of our non-GPU platform (“core”), our GPU-accelerated platform (“GPU”), and a cpu-only version of the operators (“CPU”) that implement SPARQL queries. The following table shows the results for all 14 queries.

average runtime [ms] |
Speedups (pos. numbers indicate gain) |
|||||

U1000 |
blazegraph core | blazegraph CPU | blazegraph GPU | GPU vs. core |
CPU vs. core |
GPU vs. CPU |

query1.txt |
27 | 34 | 45 | -40.74% | -20.79% | -25.19% |

query10.txt |
26 | 31 | 36 | -27.78% | -16.13% | -13.89% |

query11.txt |
33 | 56 | 35 | -5.66% | -40.12% | 57.55% |

query12.txt |
32 | 46 | 43 | -25.58% | -29.93% | 6.20% |

query13.txt |
25 | 33 | 39 | -35.04% | -24.00% | -14.53% |

query14.txt |
21 | 29 | 24 | -11.27% | -27.59% | 22.54% |

query2.txt |
111235 | 59729 | 134 | 83118.20% | 86.23% | 44584.79% |

query3.txt |
29 | 35 | 39 | -25.86% | -17.31% | -10.34% |

query4.txt |
48 | 48 | 66 | -27.27% | 0.00% | -27.27% |

query5.txt |
44 | 48 | 42 | 3.15% | -9.03% | 13.39% |

query6.txt |
27 | 23 | 24 | 12.50% | 19.12% | -5.56% |

query7.txt |
47 | 47 | 58 | -18.86% | 0.00% | -18.86% |

query8.txt |
123 | 233 | 61 | 102.75% | -47.13% | 283.52% |

query9.txt |
25450 | 73077 | 228 | 11078.77% | -65.17% | 31998.39% |

total [ms] |
137167 | 133468 | 873 | 15606.18% | 2.77% | 15182.56% |

total [s] |
137.17 | 133.47 | 0.87 | |||

total [m] |
2.29 | 2.22 | 0.01 |

You’ll see that overall for Query 9, the example above, we have a 110X speedup over our non-GPU core version. Over all of the queries, our GPU platform was on average 156X faster than the non-GPU version. This means that you can realized 150X performance improvement for your existing graph database just by adding a GPU. That’s not too bad, but if you look at the “compute” execution line for Query 9 in the NVIDIA Visual Profiler in Figure 4, you’ll see that there’s lots of room for performance tuning (gaps in the timeline) using different memory allocation approaches such as cnmem.

Graph query is an important capability, but many algorithms such PageRank also need iterative graph traversal capabilities. To solve this challenge, we developed DASL (pronounced “dazzle”). DASL is an acronym for a *Domain specific language for graphs with Accelerated Scala using Linear algebra* (quite a mouthful). With DASL, we provide a binding to the Scala programming language for graphs and machine learning algorithms. Blazegraph DASL is our system that executes DASL programs by automatic translation into program code that is optimized for GPUs. DASL enables you to implement complex applications that efficiently run on GPUs, without the need for parallel programming expertise or low-level device optimization.

There is a significant amount of work with graphs as matrix representations and graph algorithms as operations on those matrices. You can easily represent graphs as adjacency or incidence matrices. A duality with Linear Algebra (LA) operations over the matrices that represent fundamental graph operations enables recasting many graph algorithms—such as Louvain Modularity for community detection or PageRank—as a combination of Linear Algebra operations. For example, you can implement a BFS step as a Vector-Matrix Multiply over an adjacency matrix (without edge weights).

With DASL, you can implement algorithms for graph and big data analytics in terms of the underlying linear algebra for the algorithms, and they will be automatically translated into optimized GPU code. The primary abstractions of DASL are vectors and matrices, where the latter may be used to represent a graph by its adjacency matrix, its incidence matrix, or as an edge-weight matrix. You can specify these abstractions as generic types with a type parameter that captures the type of elements stored in a vector or matrix. To allow for an easy adoption of DASL by analytics experts, we have designed the language so that statements resemble linear algebra formulas. For instance, you can express the statement, given a matrix A and a vector v, using the following DASL statement.

var v2 = A <+>:<*> v

This statement initializes the variable `v2`

with the vector that is the result of the general matrix-vector multiplication of `A`

and `v`

, where the semiring used for this general operation can be an arbitrary semiring that is specified as the default for operations that involve `A`

and `v`

(which may be specified on a per-vector or a per-matrix basis). DASL provides built-in support for a number of typical semirings and basic functions over numeric data types and boolean values. You can also register additional semirings and functions. For instance, the following code snippet registers a unary function over integers and applies it element-wise over an integer vector v.

val myUnaryFct = daslContext.functionRegistry.registerUnaryFunction( (x:Int) => if (x > 100) 100 else x ) val v2 = myUnaryFct.applyElementWise( v )

Blazegraph DASL is in testing now and will be launched at GTC this year. It provides integration with Apache Spark for accessing and managing data so you can easily move data to and from GPUs. We’ll be featuring DASL and the underlying implementation in our GTC 2016 talk.

The memory bandwidth advantages of GPUs provide a great way to accelerate data-intensive analytics and graph analytics in particular. Our work with graph query and iterative graph analytics has demonstrated significant speedups over CPU-based approaches. We’ve also focused strongly on making the speed of GPUs easily available by adopting existing graph standards and providing new ways to author iterative analytics within the Apache Spark ecosystem. I hope you’ve enjoyed this blog post. We’d love to hear from you, to get your thoughts on our work and hear about your experiences with GPUs for data-intensive analytics. If you’re interested, please see one of our talks at GTC 2016 this year or stop by our Booth #206.

In addition to the whole team at Blazegraph, we’d like to thank and recognize our collaborators in this work: Dr. Martin Berzins of the SCI Institute, Dr. Olaf Hartig of the Hasso Plattner Institute, and Dr. Michael Schmidt of Metaphacts, GmbH. We’d also like to thank Dr. Chris White (formerly at DARPA) and Mr. Frank Pound (DARPA) for their support in this research.

To learn even more about GPUs for data intensive analytics, come to the 2016 GPU Technology Conference (April 4-7 in San Jose, CA) and learn from the experts. Register now to save 20% with code “S7B2XS”!

]]>Calculating the economic value of the liabilities and capturing the dependence of the liabilities to different scenarios such as movements of the interest rate or changes of mortality cannot be achieved without detailed models of the underlying contracts and requires a significant computational effort.

The calculations have to be executed for millions of pension and life insurance contracts and have to be performed for thousands of interest rate and mortality scenarios. Because of the level of parallelism, this is an excellent case for the application of GPUs and GPU clusters. In addition, variations in the products have to be captured. While implementing a separate code for many products is possible, a lot can be gained from abstractions at a higher level.

To solve these problem, we use the following technologies.

- The Actulus Modeling Language (AML), a domain specific language for actuarial modeling;
- Alea GPU, QuantAlea’s high performance GPU compiler for .NET C# and F#. See this previous post on Alea GPU;
- The modern functional-first programming language F#.

Armed with these technologies, we can significantly improve the level of abstraction and increase generality. Our system will allow actuaries to be more productive and to harness the power of GPUs without any GPU coding. The performance gain of GPU computing makes it much more practical and attractive to use proper stochastic models and to experiment with a large and diverse set of risk scenarios.

The Actulus Modeling Language (AML) is a domain-specific language for rapid prototyping in which actuaries can describe life-based pension and life insurance products, and computations on them. The idea is to write declarative AML product descriptions and from these automatically generate high-performance calculation kernels to compute reserves and cash flows under given interest rate curves and mortality curves and shocks to these.

AML allows a formalized and declarative description of life insurance and pension products. Its notation is based on actuarial theory and reflects a high-level view of products and reserve calculations. This has multiple benefits:

- The specification of life insurance products and risk models can be handled by actuaries without programming background.
- A uniform language for product description can guarantee coherence across the entire life insurance company.
- Rapid experiments with product designs allow faster and less expensive development of new pension products.
- The same product description can be used for prototyping and subsequent administration, reporting to tax authorities and auditors, solvency computations, etc.
- The DSL facilitates the construction of tools for automated detection of errors and inconsistencies in the design of insurance products.
- Reserve calculations can be optimized automatically for given target hardware, such as GPUs, via code generation.
- Auditors and regulatory bodies such as financial services authorities can benefit from a formalization of products that is

independent of the low-level software concerns of administration, efficient computations, and so on. - Products and risk models are independent of the technology on which computations are executed. Since pensions contracts are extremely long-lived—a contract entered with a 25-year old woman today is very likely to still be in force in 2080—this isolation from technology is very useful.

The AML system is based on continuous-time Markov models for life insurance and pension products. A continuous-time Markov model consists of a finite number of states and transition intensities between these states. The transition intensity from state to state at time , when integrated over a time interval, gives the transition probability from state to state during the time interval. The Markov property states that future transitions depend on the past only through the current state.

Life insurance products are modeled by identifying states in a Markov model and by attaching payment intensities to the states and lump-sum payments to the transitions.

As an example we consider a product that offers disability insurance. The product can be modeled with three states: active labor market participation, disability, and death. There are transitions from active participation to disability and to death, and from disability to death. Another example is a collective spouse annuity product with future expected cashflows represented by a seven-state Markov model as Figure 1 shows.

Additionally, some products may allow for reactivation, where a previously disabled customer begins active labor again. The product pays a temporary life annuity with repeated payments to the policy holder until some expiration date , provided that he or she is alive. The disability sum pays a lump sum when the policy holder is declared unable to work prior to some expiration .

The state-wise reserve is the reserve at time given that the insured is in state at that time. It is the expected net present value at time of future payments of the product, given that the insured is in state at time . The principle of equivalence states that the reserves at the beginning of the product should be zero, or the expected premiums should equal the expected benefits over the lifetime of the contract.

The state-wise reserves can be computed using Thiele’s differential equation

where is the interest rate at time . Note that the parameters can be divided into three categories: those that come from a product ( and ), those that come from a risk model () and the market variables ().

Traditionally, it has often been possible to obtain closed-form solutions to Thiele’s differential equations and then use tabulations of the results. With the more flexible products expressible in AML, closed-form solutions are in general not possible. In particular, by allowing reactivation from disability to active labor market participation mentioned above, one obtains a Markov model with a cycle, and in general this precludes closed-form solutions.

Good numerical solutions of Thiele’s differential equations can be obtained using a 4th-order Runge-Kutta (RK4) solver. A reserve computation typically starts with the boundary condition that the reserve is zero (no payments or benefits) after the insured’s age is 120 years, when he or she is assumed to be dead. Then the differential equations are solved, and the reserves computed, backwards from age 120 to the insured’s current age in fixed time steps.

Here is a code fragment of the inner loops of a simplistic RK4 solver expressed in C#.

for (int y=a; y>b; y--) { double[] v = result[y-b]; v = daxpy(1.0, v, bj_ii(y)); double t = y; for (int s=0; s<steps; s++) { // Integrate backwards over [y,y-1] double[] k1 = dax(h, dV(t, v)); double[] k2 = dax(h, dV(t + h/2, daxpy(0.5, k1, v))); double[] k3 = dax(h, dV(t + h/2, daxpy(0.5, k2, v))); double[] k4 = dax(h, dV(t + h, daxpy(1.0, k3, v))); v = daxpy(1/6.0, k1, daxpy(2/6.0, k2, daxpy(2/6.0, k3, daxpy(1/6.0, k4, v)))); t += h; } Array.Copy(v, result[y-1-b], v.Length); }

It computes and stores the annual reserve backwards from `y`

= `a`

= 120 to `y`

= `b`

= 35, where `dV`

is an array-valued function expressing the right-hand sides of Thiele’s differential equations, and `h`

is the within-year step size, typically between 1 and 0.01.

The computations in the Runge-Kutta code have to be performed sequentially for each contract, consisting of the products relating to a single insured life. However, it is easily parallelized over a portfolio of contracts, of which there are typically hundreds of thousands, one for each customer. Thus, reserve and cash flow computations present an excellent use case for GPUs. Using GPUs for reserve or cash-flow computations is highly relevant in practice, because such computations can take dozens or hundreds of CPU hours for a reasonable portfolio size. Even with cloud computing this results in slow turnaround times; GPU computing could make it much more practical and attractive to use proper stochastic models and to experiment with risk scenarios.

The Runge-Kutta 4 solver fits the GPU architecture very well because it uses fixed step sizes and therefore causes little thread divergence provided the contracts are sorted suitably before the computations are started. By contrast adaptive-step solvers such as the Runge-Kutta-Fehlberg 4/5 or Dormand-Prince are often faster on CPUs. They are more likely to cause thread divergence on GPUs because different input data will lead to iteration counts in the inner loop. Moreover, the adaptive-step solvers deal poorly with the frequent discontinuities in the derivatives that appear in typical pension products, which require repeatedly stopping and then restarting the solver to avoid a deterioration of the convergence rate.

In preliminary experiments, we have obtained very good performance on the GPU over a CPU-based implementation. For instance, we can compute ten thousand reserves for even the most complex insurance products in a few minutes. The hardware we use is an NVIDIA Tesla K40 GPU (Kepler GK110 architecture). The software is a rather straightforward implementation of the Runge-Kutta fixed-step solver, using double-precision (64-bit) floating-point arithmetic. The kernels are written, or in some experiments automatically generated, in the functional language F#, and compiled and run on the GPU using the Alea GPU framework.

The F# language is widely used in the financial industry, along with other functional languages. We use it for several reasons:

- It provides added type safety and convenience in GPU programming compared to C.
- F# is an ideal language for writing program generators, such as generating problem-specific GPU kernels from AML product descriptions.
- The project’s commercial partner uses the .NET platform for all development work, and F# fits well with that ecosystem.

For these reasons the Actulus project selected Quantalea’s Alea GPU platform to develop our GPU code. We find that the Alea GPU platform offers excellent performance and robustness. An additional benefit of Alea GPU is its cross platform capability: the same assemblies can execute on Windows, Linux and Mac OS X.

The chief performance-related problems in GPU programming are the usual ones: How to lay out data (for instance, time-dependent interest rate curves and mortality rates) in GPU memory for coalesced memory access; whether to pre-interpolate or not in such time-indexed tables; how to balance occupancy, thread count and GPU register usage per thread; and so on. Alea GPU is feature complete so that we can implement all the required optimizations to tune the code for maximal

performance.

Figure 2 shows the number of products processed per second as a function of the batch size, i.e., the number of products computed at the same time.

The product in question is a collective spouse annuity product with future expected cashflows calculated for a 30-year-old insured represented by the seven-state Markov model depicted in Figure 1. This product is among the most complex to work with. Depending on the modeling details, the current CPU-based production code, running on a single core at 3.4 GHz, can process between 0.75 and 1.03 collective spouse annuity insurance products per second. If we compare this with the GPU throughput of 30 to 50 insurance products per second we arrive at a speed-up factor in the range of 30 to 65.

The computation kernels are implemented in F# using work flows (also known as computation expressions or monads) and code quotations, a feature-complete and flexible way of using the Alea GPU framework. In our experience the resulting performance is clearly competitive with that of raw CUDA C++ code.

Using F# through Alea GPU permits much higher programmer productivity, both because F#’s concise mathematical notation suits the application area, and because F# has a better compiler-checked type system than C++. For instance, the confusion of device pointers and host pointers that may arise in CUDA C++ is avoided entirely in F#. Hence much less time is spent chasing subtle bugs and mistakes, which is especially important for experimentation and exploration of different implementation strategies. The core Runge-Kutta 4 solver code looks like the following snippet, using code quotations and imperative F# constructs.

let! kernel = <@ fun (input:deviceptr) (output:deviceptr) ... -> ... while iterDir y do bj.Invoke (float(y)) (input+inputIndexing.Invoke i boundaryConditionLength) tmp nV daxpy.Invoke 1.0 v tmp v nV for s = 0 to steps-1 do let t = float(y) + (float(s)/float(steps)) * dir; // k1 = ... dV.Invoke steps ... v k1 dax.Invoke h2 k1 k1 nV daxpy.Invoke 0.5 k1 v tmp nV // k2 = ... dV.Invoke steps ... tmp k2 dax.Invoke h2 k2 k2 nV daxpy.Invoke 0.5 k2 v tmp nV // k3 = ... dV.Invoke steps ... tmp k3 dax.Invoke h2 k3 k3 nV daxpy.Invoke 1.0 k3 v tmp nV // k4 = ... dV.Invoke steps ... tmp k4 dax.Invoke h2 k4 k4 nV // v(n+1) = v + k1/6 + k2/3 + k3/3 + k4/6 daxpy.Invoke (1.0/6.0) k4 v tmp nV daxpy.Invoke (2.0/6.0) k3 tmp tmp nV daxpy.Invoke (2.0/6.0) k2 tmp tmp nV daxpy.Invoke (1.0/6.0) k1 tmp v nV output.[(pos+y+int(dir)-a)]

At the same time, F#’s code quotations, or more precisely the splicing operators, provide a simple and obvious way to inline a function such as `GMMale`

into multiple kernels without source code duplication:

let stocasticMarriageProbabilityODE_technical = <@ fun (x:float) ... (res:deviceptr) -> let GMMale = %GMMale ...

While similar effects can be achieved using C preprocessor macros, F# code quotations and splice operators do this in a much cleaner way, with better type checking and IDE support. What’s more, F# code quotations allow kernels to be parameterized with both “early” (or kernel compile-time) arguments such as map, and late (or kernel run-time) arguments such as `n`

and `isPremium`

:

let Reserve_GF810_dV_Technical (map:Map<Funcs,Expr>) = <@ fun (n:int) ... (isPremium : int) -> let GMMale = %%map.[Funcs.GMFemale] ...

An additional reason for using F# is that in the longer term we want to automatically generate the GPU kernels that solve Thiele’s differential equations. The input to the code generator is a description of the underlying state models (describing life, death, disability and so on) and the functions and tables that express age-dependent mortalities, time-dependent future interest rates, and so on. As a strongly typed functional language with abstract data types, pattern matching, and higher-order functions, the F# language is supremely suited for such code generation processes. The state models and auxiliary functions are described by recursive data structures (so-called abstract syntax), and code generation proceeds by traversing these data structures using recursive functions.

Also, the F# language supports both functional programming, used to express the process of generating code on the host CPU, and imperative programming, used to express the computations that will be performed on the GPU. In other words, high-level functional code generates low-level imperative code, both within the same language, which even supports scripting of the entire generate-compile-load-run cycle:

let compileAndExecute template (inputArray:float[]) ... (b:float[]) = let irm = Compiler.Compile(template).IRModule use program = Worker.Default.LoadProgram(irm) let resultsWithTime = program.Run inputArray ... b resultsWithTime

The code generation approach will help support a wide range of life insurance and pension products. There are of course alternatives to code generation: First, one might hand-write the differential equations for each product, but this is laborious and error-prone and slows down innovation and implementation, or severely limits the range of insurance products supported. Second, one might take an interpretive approach, by letting the (GPU) code analyze the abstract syntax of the product description, but this involves executing many conditional statements, for which the GPU hardware is ill-suited as it may lead to branch divergence. Hence code generation is the best way to support generality while maintaining the highest performance.

This work was done in the context of the Actulus project, a collaboration between Copenhagen University, the company Edlund A/S, and the IT University of Copenhagen, funded in part by the Danish Advanced Technology Foundation contract 017-2010-3. Thanks are due to the many project participants who contributed to AML and in particular due to Christian Gehrs Kuhre and Jonas Kastberg Hinrichsen for their many competent experiments with GPU computations for advanced insurance products. Quantalea graciously provided experimental licenses for Alea GPU and supported us in various GPU related aspects.

Daniel Egloff also contributed to this post. To learn more about Alea GPU and .NET GPU programming, I recommend that you attend Daniel’s tutorial, “Simplified CUDA Development with C#” at GTC 2016 on Monday, April 4 at 10:30. To get a 20% discount on your GTC registration, register using the discount code S7B2XS.

]]>