Now, __launch_bounds__ and maxregcount limit register usage by two different mechanisms.

**__launch_bounds__**

nvcc decides the number of registers to be used by a __global__ function through balancing the performance and the generality of the kernel launch setup. Saying it differently, such a choice of the number of used registers “guarantees effectiveness” for different numbers of threads per block and of blocks per multiprocessor.

However, if an approximate idea of the maximum number of threads per block and (possibly) of the minimum number of blocks per multiprocessor is available at compile-time, then this information can be used to optimize the kernel for such launches.

In other words:

#define MAX_THREADS_PER_BLOCK 256 #define MIN_BLOCKS_PER_MP 2 __global__ void __launch_bounds__(MAX_THREADS_PER_BLOCK, MIN_BLOCKS_PER_MP) fooKernel(int *inArr, int *outArr) { // ... Computation of kernel }

informs the compiler of a likely launch configuration, so that nvcc can select the number of registers for such a launch configuration in an “optimal” way.

The MAX_THREADS_PER_BLOCK parameter is mandatory, while the MIN_BLOCKS_PER_MP parameter is optional. Also note that if the kernel is launched with a number of threads per block larger than MAX_THREADS_PER_BLOCK, the kernel launch will fail.

The limiting mechanism is described in the Programming Guide as follows:

*If launch bounds are specified, the compiler first derives from them the upper limit L on the number of registers the kernel should use to ensure that minBlocksPerMultiprocessor blocks (or a single block if minBlocksPerMultiprocessor is not specified) of maxThreadsPerBlock threads can reside on the multiprocessor. The compiler then optimizes register usage in the following way:*

*– If the initial register usage is higher than L, the compiler reduces it further until it becomes less or equal to L, usually at the expense of more local memory usage and/or higher number of instructions;*

Accordingly, __launch_bounds__ can lead to register spill.

**maxrregcount**

maxrregcount is a compiler flag that simply hardlimits the number of employed registers to a number set by the user, at variance with __launch_bounds__, by forcing the compiler to rearrange its use of registers. When the compiler can’t stay below the imposed limit, it will simply spill it to L1, L2 and DRAM.

]]>Many times, the applications requiring the SVD calculation deal with large matrices and/or request the SVD computation in an iterative process.

Fortunately, the SVD can be quickly computed in CUDA using the routines provided in the cuSOLVER library. Below, we provide a representative example relevant in common situations.

Before proceeding further, two points to remember:

- gesvd routines assume Nrows >= Ncols. Of course, the case Nrows < Ncols can be dealt with by matrix transposition, for example by cublas<t>geam();
- column major ordering is assumed.

The full example is contained in the SVD.cu file.

The first step towards SVD calculation in CUDA is to initialize the cuSOLVER, as is required for any other routine of the cuSOLVER library:

cusolverDnHandle_t solver_handle;cusolverDnCreate(&solver_handle);

The second step is to allocate the buffer space required by the SVD calculation routine:

cusolveSafeCall(cusolverDnDgesvd_bufferSize(solver_handle, Nrows, Ncols, &work_size));double *work; gpuErrchk(cudaMalloc(&work, work_size * sizeof(double)));

The buffer space dimension is calculated by cusolverDnDgesvd_bufferSize.

The third step is the final SVD calculation:

cusolveSafeCall(cusolverDnDgesvd(solver_handle, 'A', 'A', Nrows, Ncols, d_A, Nrows, d_S, d_U, Nrows, d_V, Ncols, work, work_size, NULL, devInfo));int devInfo_h = 0; gpuErrchk(cudaMemcpy(&devInfo_h, devInfo, sizeof(int), cudaMemcpyDeviceToHost)); if (devInfo_h != 0) std::cout << "Unsuccessful SVD execution\n\n";

Note that cusolverDnDgesvd returns the matrix V transposed. Also, U is the matrix of the *left* singular vectors, namely, the singular vectors spanning the output space of the matrix, while V is the matrix of the *right* singular vectors, namely, the singular vectors spanning the input space of the matrix.

For the example at SVD.cu, the output is:

Singular valuesd_S[0] = 314.891390715956 d_S[1] = 14.4118456776669 d_S[2] = 0.842607258487678 d_S[3] = 0.0277423461350813 d_S[4] = 0.000710344049837318 U[0,0]=-0.026900 U[1,0]=-0.043393 U[2,0]=-0.091636 U[3,0]=-0.181098 U[4,0]=-0.319440 U[5,0]=-0.513289 U[6,0]=-0.768565 U[0,1]=-0.366199 U[1,1]=-0.428964 U[2,1]=-0.477524 U[3,1]=-0.454763 U[4,1]=-0.323752 U[5,1]=-0.055840 U[6,1]=0.372983 U[0,2]=0.747999 U[1,2]=0.316640 U[2,2]=-0.072851 U[3,2]=-0.312084 U[4,2]=-0.353727 U[5,2]=-0.162134 U[6,2]=0.293467 U[0,3]=-0.530903 U[1,3]=0.565260 U[2,3]=0.392143 U[3,3]=-0.039990 U[4,3]=-0.324668 U[5,3]=-0.264268 U[6,3]=0.260770 U[0,4]=0.152553 U[1,4]=-0.590073 U[2,4]=0.467756 U[3,4]=0.350750 U[4,4]=-0.204309 U[5,4]=-0.423010 U[6,4]=0.256984 U[0,5]=0.007732 U[1,5]=-0.054514 U[2,5]=0.051584 U[3,5]=0.281387 U[4,5]=-0.717133 U[5,5]=0.607745 U[6,5]=-0.177468 U[0,6]=-0.021961 U[1,6]=0.207773 U[2,6]=-0.618898 U[3,6]=0.677639 U[4,6]=-0.081169 U[5,6]=-0.298327 U[6,6]=0.136131 V[0,0]=-0.349562 V[0,1]=-0.395802 V[0,2]=-0.442097 V[0,3]=-0.488660 V[0,4]=-0.535638 V[1,0]=0.637619 V[1,1]=0.405149 V[1,2]=0.114078 V[1,3]=-0.224424 V[1,4]=-0.604909 V[2,0]=0.635177 V[2,1]=-0.323686 V[2,2]=-0.505308 V[2,3]=-0.213766 V[2,4]=0.436743 V[3,0]=-0.258432 V[3,1]=0.705292 V[3,2]=-0.279236 V[3,3]=-0.497617 V[3,4]=0.331934 V[4,0]=0.031788 V[4,1]=-0.277461 V[4,2]=0.676925 V[4,3]=-0.646162 V[4,4]=0.215062

These results can be compared with the following simple Matlab code:

clear allclose allclc Nrows = 7; Ncols = 5; A = zeros(Nrows, Ncols); for j = 0 : Nrows - 1 for i = 0 : Ncols - 1 A(j + 1, i + 1) = (i + j * j) * sqrt(i + j); end end [U, S, V] = svd(A);

which outputs:

314.8914 14.4118 0.8426 0.0277 0.0007U =-0.0269 0.3662 -0.7480 0.5309 0.1526 0.0076 -0.0220 -0.0434 0.4290 -0.3166 -0.5653 -0.5901 -0.0528 0.2082 -0.0916 0.4775 0.0729 -0.3921 0.4678 0.0466 -0.6193 -0.1811 0.4548 0.3121 0.0400 0.3508 0.2868 0.6754 -0.3194 0.3238 0.3537 0.3247 -0.2043 -0.7178 -0.0754 -0.5133 0.0558 0.1621 0.2643 -0.4230 0.6053 -0.3032 -0.7686 -0.3730 -0.2935 -0.2608 0.2570 -0.1764 0.1375V =-0.3496 -0.6376 -0.6352 0.2584 0.0318 -0.3958 -0.4051 0.3237 -0.7053 -0.2775 -0.4421 -0.1141 0.5053 0.2792 0.6769 -0.4887 0.2244 0.2138 0.4976 -0.6462 -0.5356 0.6049 -0.4367 -0.3319 0.2151

As it can be seen, the CUDA and Matlab results coincide for the singular values, but not for the singular vectors. However, the singular vectors span the same spaces.

]]>The first thing to do is to recover the pointer to the first element of the real data from the Matlab input array/matrix:

double *h_input = mxGetPr(prhs[0]);

We can also recover the number of elements of the input variable (the input variable can be also a matrix) as:

int numElements = mxGetN(prhs[0]) * mxGetM(prhs[0]);

After that, we have to move the host data to the device and perform the computations on the device.

Finally, we have to create the real output Matlab array/matrix

plhs[0] = mxCreateDoubleMatrix(1, numElements, mxREAL);

recover the pointer to the first element of the real ouput array/matrix and move the results from the GPU to the CPU.

Once compiled, the mex functionl can be used in the following way:

a = 1 : 10; b = CUDA_mex_host_to_device_real(a);

where a is an input array residing on the host and, similarly, b is a result array residing on the host.

]]>CUDA provides the cudaMallocPitch function to “pad” 2D matrix rows with extra bytes so to achieve the desired alignment. Please, refer to the “CUDA C Programming Guide”, Sections 3.2.2 and 5.3.2, for more information.

Assuming that we want to allocate a 2D padded array of Nrow x Ncols floating point (single precision) elements, the syntax for cudaMallocPitch is the following:

cudaMallocPitch(&devPtr, &devPitch, Ncols * sizeof(float), Nrows);

where

- devPtr is an output pointer to float (float *devPtr) pointing to the allocated memory space;
- devPitch is a size_t output variable denoting the length, in bytes, of the padded row;
- Nrows and Ncols are size_t input variables representing the matrix size.

Recalling that C/C++ and CUDA store 2D matrices by row, cudaMallocPitch will allocate a memory space of size, in bytes, equal to Nows * pitch. However, only the first Ncols * sizeof(float) bytes of each row will contain the matrix data. Accordingly, cudaMallocPitch consumes more memory than strictly necessary for the 2D matrix storage, but this is returned in more efficient memory accesses.

CUDA provides also the cudaMemcpy2D function to copy data from/to host memory space to/from device memory space allocated with cudaMallocPitch. Under the above hypotheses (single precision 2D matrix), the syntax is the following:

cudaMemcpy2D(devPtr, devPitch, hostPtr, hostPitch, Ncols * sizeof(float), Nrows, cudaMemcpyHostToDevice)

where

- devPtr and hostPtr are input pointers to float (float *devPtr and float *hostPtr) pointing to the (source) device and (destination) host memory spaces, respectively;
- devPitch and hostPitch are size_t input variables denoting the length, in bytes, of the padded rows for the device and host memory spaces, respectively;
- Nrows and Ncols are size_t input variables representing the matrix size.

Note that cudaMemcpy2D allows also for pitched memory allocation on the host side. If the host memory has no pitch, then hostPtr = Ncols * sizeof(float). Furthermore, cudaMemcpy2D is bidirectional. For the above example, we are copying data from host to device. If we want to copy data from device to host, then the above line changes to

cudaMemcpy2D(hostPtr, hostPitch, devPtr, devPitch, Ncols * sizeof(float), Nrows, cudaMemcpyDeviceToHost)

The access to elements of a 2D matrix allocated by cudaMallocPitch can be performed as in the following example

int tidx = blockIdx.x*blockDim.x + threadIdx.x; int tidy = blockIdx.y*blockDim.y + threadIdx.y; if ((tidx < Ncols) && (tidy < Nrows)) { float *row_a = (float *)((char*)devPtr + tidy * pitch); row_a[tidx] = row_a[tidx] * tidx * tidy; }

In such an example, tidx and tidy are used as column and row indices, respectively (remember that, in CUDA, x-threads span the columns and y-threads span the rows to favor coalescence).

The pointer to the first element of a row is calculated by offsetting the initial pointer devPtr by the row length tidy * pitch in bytes (char * is a pointer to bytes and sizeof(char) is 1 byte), where the length of each row is computed by using the pitch information.

On our GitHub site, we provide a fully worked example to show these concepts.

Finally, please, notice that CUDA makes the cudaMallocPitch available to allocate padded 2D arrays and cudaMalloc3D to allocate padded 3D arrays. However, a “cudaMalloc2D” function does not exist, its role being accomplished by cudaMallocPitch.

The final question regards whether the use of pitched memory allocation, i.e., memory allocation done with cudaMallocPitch, really leads to improved performance as compared to non-pitched memory allocation, i.e., done with cudaMalloc.

Actually, the improvements arising from the use of cudaMallocPitch depend on the compute capability and are expected to be more significant for older ones. However, for most recent compute capabilities, pitched memory allocation does not seem to lead to a relevant speedup.

The attached code provides a performance testbench between the uses of non-pitched and pitched memories.

In particular, the code performs the summation between three (non-pitched or pitched) matrices. The reason for dealing with three matrices is the need to highlight memory transactions as compared to computation, so to highlight the differences between non-pitched and pitched allocations. Below are the timing results for a GTX 960 card and a GT 920M cards.

**GTX 960**

Non-pitched – Time = 3.242208; Memory = 65320000 bytes

Pitched – Time = 3.150944; Memory = 65433600 bytes

**GT 920M**

Non-pitched – Time = 20.496799; Memory = 65320000 bytes

Pitched – Time = 20.418560; Memory = 65433600 bytes

As it can be seen, there is not much difference in the two implementations for the two cards. The above results also show the increase in memory occupancy due to the use of pitched memory allocation.

]]>Suppose to construct a kernel which has the task of computing the number of thread blocks of a thread grid. One possible idea is to let each thread in each block with *threadIdx.x == 0* increase a global counter. To prevent race conditions, all the increases must occur sequentially, so they must be incorporated in a critical section.

This is illustrated in the code on our GitHub web page . Such a code has two kernel functions: *blockCountingKernelNoLock* and *blockCountingKernelLock*. The former does not use a critical section to increase the counter and, as you will see, returns wrong results. The latter encapsulates the counter increase within a critical section and so produces correct results. But how does the critical section work?

The critical section is governed by a global state d_state. Initially, the state is 0. Furthermore, two *__device__* methods, lock and unlock, can change this state. The lock and unlock methods can be invoked only by a single thread within each block and, in particular, by the thread with local thread index *threadIdx.x == 0*.

Randomly during the execution, one of the threads with local thread index *threadIdx.x == 0* and global thread index, say, *t *will be the first invoking the lock method. In particular, it will launch *atomicCAS(d_state, 0, 1)*. Since initially* d_state == 0*, then d_state will be updated to 1, *atomicCAS* will return 0 and the thread will exit the lock function, passing to the update instruction. In the meanwhile such a thread performs the mentioned operations, all the other threads of all the other blocks with * threadIdx.x == 0* will execute the lock method. They will however find a value of *d_state* equal to 1, so that *atomicCAS(d_state, 0, 1)* will perform no update and will return 1, so leaving these threads running the while loop. After that thread *t* accomplishes the update, then it executes the unlock function, namely* atomicExch(d_state, 0)*, thus restoring *d_state* to 0. At this point, randomly, another of the threads with *threadIdx.x == 0* will lock again the state.

The code on our GitHub web page contains also a third kernel function, namely *blockCountingKernelDeadlock*. However, this is another wrong implementation of the critical section, leading to deadlocks. Indeed, we recall that warps operate in lockstep and they synchronize after every instruction. So, when we execute *blockCountingKernelDeadlock*, there is the possibility that one of the threads in a warp, say a thread with local thread index *t*≠0, will lock the state. Under this circumstance, the other threads in the same warp of *t*, including that with * threadIdx.x == 0*, will execute the same while loop statement as thread *t*, being the execution of threads in the same warp performed in lockstep. Accordingly, all the threads will wait for someone to unlock the state, but no other thread will be able to do so, and the code will be stuck in a deadlock.

]]>

First step: configure the Windows system

Download Putty.

Install Xming. Use a simple google search for “sourceforge xming x server windows”. When asking for the fonts to be installed in the “Custom installation” stage, we would recommend to install all the fonts.

Second step: configure the Linux system.

Install the openssh server: sudo apt-get install openssh-server.

Open the file etc/ssh/ssh_config, uncomment the ForwardX11 line and change the option from “no” to “yes”.

After that, use Putty to connect to the linux machine via ssh. In particular:

Session -> Host Name -> put the address of the linux system;

Connection -> SSH -> X11 -> X11 forwarding -> Enable X11 forwarding.

As an example, in the terminal, type

*Xclock & (do not forget “&”).*

Accordingly, even if setting, by omp_set_num_threads(), an overall number of threads larger than 1, any call to omp_get_num_threads() will return 1, unless we are in a parallel section.

The example on our GitHub website tries to clarify this point.

]]>In Visual Studio do the following:

1) File -> New Project; Select location and name; in the project type, select Templates -> Visual C++ -> Win32 -> Win32 Console Application -> OK;

2) In the Win32 Application Wizard, click Next, in the Application Type choose DLL, then click Finish.

3) Project -> Properties -> Configuration Manager -> Active Solution Platform -> New -> Type or Select the New Platform choose x64 -> OK -> OK;

4) Project -> Properties -> Configuration Manager -> Active Solution Platform -> x64 -> Ok;

5) Project -> Properties -> Configuration Manager -> Active Solution Configuration -> Release -> OK;

6) Project -> Properties -> Configuration Properties -> General -> Target Extension -> change .dll to .mexw64;

7) Project -> Properties -> Configuration Properties -> VC++ Directories -> Include Directories -> Add: C:\Program Files\MATLAB\R2015b\extern\include;

7a – optional – only if needed) Project -> Properties -> Configuration Properties -> C/C++ -> Additional Include Directories -> C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v8.0\include;

7b – optional – only if needed) Project -> Properties -> Configuration Properties -> C/C++ -> Additional Include Directories -> FFTW directory;

8) Project -> Properties -> Configuration Properties -> Linker -> General -> Additional Library Directories -> C:\Program Files\MATLAB\R2015b\extern\lib\win64\microsoft;

8a – optional – only if needed) Project -> Properties -> Configuration Properties -> Linker -> General -> Additional Library Directories -> D:\FFTW64;

9) Project -> Properties -> Configuration Properties -> Linker -> Input -> Additional Dependencies -> libmx.lib;libmex.lib;libmat.lib;

9a – optional – only if needed) Project -> Properties -> Configuration Properties -> Linker -> Input -> Additional Dependencies -> libfftw3-3.lib; libfftw3f-3.lib;

10) Project -> Properties -> Configuration Properties -> Linker -> Command Line -> Additional Options -> Add /export:mexFunction;

11) Add the C/C++ source code for the mex file; remember that the .cpp files need #include “stdafx.h”.

12) After the compilation, the mex file projectTitle.mexw64 will be available at …\projectRootDirectory\x64\Release; remember to add this directory in the Matlab script file; after that, you can call the C/C++ mex routine as a function whose name is the project name; remember to terminate the Matlab script with “clear mex” so that you can newly compile the project after any modification without having Matlab keeping the .mexw64 file busy.

We expect that this procedure will work also for Visual Studio 2010 and Visual Studio 2012 and for other Matlab versions.

]]>Consider this example with the uncommented destructor and do not pay too much attention on what the code actually does. If you run that code, you will receive the following output:

Calling destructor Counting in the locked case: 512 Calling destructor GPUassert: invalid device pointer D:/Project/passStructToKernel/passClassToKernel/Utilities.cu 37

As you can see, there are two calls to the destructor, once at the kernel exit and once at the main exit.

The error message is related to the fact that, if the memory locations pointed to by d_state are freed at the kernel exit, they cannot be freed anymore at the main exit.

Accordingly, the destructor must be different for host and device executions. This is accomplished by the commented destructor in the above code.

]]>The two approaches use CUDA Thrust:

- Using thrust::counting_iterator and thrust::upper_bound, following the histogram Thrust example;
- Using thrust::unique_copy and thrust::upper_bound.

A fully worked example is available on our GitHub page.

The first approach has shown to be the fastest. On an NVIDIA GTX 960 card, we have had the following timings for a number of N = 1048576 array elements:

First approach:2.35msFirst approach without thrust::adjacent_difference:1.52Second approach:4.67ms

Please, note that there is no strict need to calculate the adjacent difference explicitly, since this operation can be manually done during a kernel processing, if needed.

]]>