Accordingly, even if setting, by omp_set_num_threads(), an overall number of threads larger than 1, any call to omp_get_num_threads() will return 1, unless we are in a parallel section.

The example on our GitHub website tries to clarify this point.

]]>In Visual Studio do the following:

1) File -> New Project; Select location and name; in the project type, select Templates -> Visual C++ -> Win32 -> Win32 Console Application -> OK;

2) In the Win32 Application Wizard, click Next, in the Application Type choose DLL, then click Finish.

3) Project -> Properties -> Configuration Manager -> Active Solution Platform -> New -> Type or Select the New Platform choose x64 -> OK -> OK;

4) Project -> Properties -> Configuration Manager -> Active Solution Platform -> x64 -> Ok;

5) Project -> Properties -> Configuration Manager -> Active Solution Configuration -> Release -> OK;

6) Project -> Properties -> Configuration Properties -> General -> Target Extension -> change .dll to .mexw64;

7) Project -> Properties -> Configuration Properties -> VC++ Directories -> Include Directories -> Add: C:\Program Files\MATLAB\R2015b\extern\include;

7a – optional – only if needed) Project -> Properties -> Configuration Properties -> C/C++ -> Additional Include Directories -> C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v8.0\include;

7b – optional – only if needed) Project -> Properties -> Configuration Properties -> C/C++ -> Additional Include Directories -> FFTW directory;

8) Project -> Properties -> Configuration Properties -> Linker -> General -> Additional Library Directories -> C:\Program Files\MATLAB\R2015b\extern\lib\win64\microsoft;

8a – optional – only if needed) Project -> Properties -> Configuration Properties -> Linker -> General -> Additional Library Directories -> D:\FFTW64;

9) Project -> Properties -> Configuration Properties -> Linker -> Input -> Additional Dependencies -> libmx.lib;libmex.lib;libmat.lib;

9a – optional – only if needed) Project -> Properties -> Configuration Properties -> Linker -> Input -> Additional Dependencies -> libfftw3-3.lib; libfftw3f-3.lib;

10) Project -> Properties -> Configuration Properties -> Linker -> Command Line -> Additional Options -> Add /export:mexFunction;

11) Add the C/C++ source code for the mex file; remember that the .cpp files need #include “stdafx.h”.

12) After the compilation, the mex file projectTitle.mexw64 will be available at …\projectRootDirectory\x64\Release; remember to add this directory in the Matlab script file; after that, you can call the C/C++ mex routine as a function whose name is the project name; remember to terminate the Matlab script with “clear mex” so that you can newly compile the project after any modification without having Matlab keeping the .mexw64 file busy.

We expect that this procedure will work also for Visual Studio 2010 and Visual Studio 2012 and for other Matlab versions.

]]>Consider this example with the uncommented destructor and do not pay too much attention on what the code actually does. If you run that code, you will receive the following output:

Calling destructor Counting in the locked case: 512 Calling destructor GPUassert: invalid device pointer D:/Project/passStructToKernel/passClassToKernel/Utilities.cu 37

As you can see, there are two calls to the destructor, once at the kernel exit and once at the main exit.

The error message is related to the fact that, if the memory locations pointed to by d_state are freed at the kernel exit, they cannot be freed anymore at the main exit.

Accordingly, the destructor must be different for host and device executions. This is accomplished by the commented destructor in the above code.

]]>The two approaches use CUDA Thrust:

- Using thrust::counting_iterator and thrust::upper_bound, following the histogram Thrust example;
- Using thrust::unique_copy and thrust::upper_bound.

A fully worked example is available on our GitHub page.

The first approach has shown to be the fastest. On an NVIDIA GTX 960 card, we have had the following timings for a number of N = 1048576 array elements:

First approach:2.35msFirst approach without thrust::adjacent_difference:1.52Second approach:4.67ms

Please, note that there is no strict need to calculate the adjacent difference explicitly, since this operation can be manually done during a kernel processing, if needed.

]]>In the code, we have also provided an overall operations count in terms of complex matrix multiplications and additions.

It can be indeed shown that each radix-4 butterfly involves 3 complex multiplications and 8 complex additions.

Since there are log_{4}N = log_{2}N / 2 stages and each stage involves N / 4 butterflies, so the operations count is

complex multiplications = (3 / 8) * N * log]]>_{2}(N) complex additions = N * log_{2}(N)

The first one refers to pushing the stack phase, while the second one illustrates the popping the stack phase.

In particular, the two figures illustrate the Matlab implementation that you may find on our GitHub website:

The above recursive implementation is the counterpart of the iterative implementation of the radix-2 Decimation In Frequency. Please, note that the present recursive one does not require any reverse bit ordering, neither of the input nor of the output sequences.

]]>A recursive approach is also possible.

The implementation calculates also the number of performed multiplications and additions and compares it with the theoretical calculations reported in “Number of operation counts for radix-2 FFTs”.

The code is obviously much slower than the highly optimized FFTW exploited by Matlab.

Note also that the twiddle factors omegaa^((2^(p – 1) * n)) can be calculated off-line and then restored from a lookup table, but this point is skipped in the mentioned code.

]]>

A recursive approach is also possible.

The implementation calculates also the number of performed multiplications and additions and compares it with the theoretical calculations reported in “Number of operation counts for radix-2 FFTs”.

The code is obviously much slower than the highly optimized FFTW exploited by Matlab.

Note also that the twiddle factors omegaa^(interButterflyIndex * 2^(numStages – p)) can be calculated off-line and then restored from a lookup table, but this point is skipped in the mentioned code.

]]>To compute the number of floating point operations for a *Decimation In Time (DIT) radix-2 approach*, refer to the following figure:

Let N be the length of the sequence to be transformed. There is an overall number of log_{2}N stages and each stage contains N/2 butterflies.

Let us then consider the generic butterfly:

Let us rewrite the output of the generic butterfly as

E(i + 1) = E(i) + W * O(i) O(i + 1) = E(i) - W * O(i)

A butterfly thus involves one complex multiplication and two complex additions. On rewriting the above equations in terms of real and imaginary parts, we have

real(E(i + 1)) = real(E(i)) + (real(W) * real(O(i)) - imag(W) * imag(O(i))) imag(E(i + 1)) = imag(E(i)) + (real(W) * imag(O(i)) + imag(W) * real(O(i))) real(O(i + 1)) = real(O(i)) - (real(W) * real(O(i)) - imag(W) * imag(O(i))) imag(O(i + 1)) = imag(O(i)) - (real(W) * imag(O(i)) + imag(W) * real(O(i)))

Accordingly, we have

**4 multiplications**

real(W) * real(O(i)), imag(W) * imag(O(i)), real(W) * imag(O(i)), imag(W) * real(O(i)).

**6 sums**

real(W) * real(O(i)) – imag(W) * imag(O(i)) (1) real(W) * imag(O(i)) + imag(W) * real(O(i)) (2) real(E(i)) + eqn.1 imag(E(i)) + eqn.2 real(E(i)) – eqn.1 imag(E(i)) – eqn.2

Therefore, the number of operations for the Decimation In Time radix 2 approach are

2N * log_{2}(N) multiplications 3N * log_{2}(N) additions

These operation counts may change if the multiplications are differently arranged, see Complex numbers product using only three multiplications.

The same results apply to the case of Decimation in Frequency radix 2, see figure

]]>x = a + i * b; y = c + i * d; real(x * y) = a * c - b * d; imag(x * y) = (a + b) * (c + d) - a * c - b * d;

Of course, the question is: *can we perform the product between two complex numbers with less than three real multiplications*?

The answer is **NO** and is provided by Winograd’s theorem in: Winograd, “On the number of multiplications required to compute certain functions”, *Commun. Pure Appl. Math.* 23 (1970), 165-179.

**NB. The minimum number of multiplications required in the computation of the product between two complex numbers is three.**