Igor Ostrovsky Blogging

How GPU came to be used for general computation

Fast and slow if statements: branch prediction in modern processors

And if you like my blog, subscribe!

Graphs, trees, and origins of humanity

Igor Ostrovsky — Fri, 02 Jul 2010 05:44:18 +0000

Imagine that 90,000 years ago, every man alive at the time picked a different last name. Assuming that last names are inherited from father to son, how many different last names do you think there would be today?

It turns out that there would be only one last name!

Similarly, imagine that 200,000 years ago, every woman alive picked a different secret word, and told the secret word to her daughters. And, the female descendants would follow this tradition – a mother would always pass her secret word to her daughters.

As you might guess, today there would be only one secret word in circulation.

The man whose last name all men would carry is called “Y-chromosomal Adam” and the woman whose secret word all women would know is “Mitochondrial Eve”. The names come from the fact that Y chromosome is a piece of DNA inherited from father to son (just like last names), and mitochondrial DNA is inherited from mother to children (just like the hypothetical secret words).

Why the convergence?

So, how likely is it that one last name and one secret word will eventually come to dominate? Given enough time, it is virtually guaranteed, under some assumptions (e.g., the population does not become separated).

Here is a simulation of how last names of five men could flow through six generations:

After six generations, the last name shown as green is the only last name around, purely by chance! In biology, this random effect is called genetic drift.

And, the convergence does not only happen for small populations. Here are the numbers that I got by simulating different population sizes:

Population	Generations to convergence
10	23
50	73
100	239
500	891
1000	1395
5000	7312
10000	13491

Here is the implementation of this simulation:

static int MitochondrialEve(int populationSize)
{
    Random random = new Random();

    int generations = 0;
    int[] cur = new int[populationSize];
    for (int i = 0; i < populationSize; i++) cur[i] = i;

    for ( ; cur.Max() != cur.Min(); generations++)
    {
        int[] next = new int[populationSize];
        for (int i = 0; i < next.Length; i++)
        {
            next[i] = cur[random.Next(populationSize)];
        }
        cur = next;
    }

    return generations;
}

This simulation is not a sophisticated model of a human population, but it is sufficient for the purposes of illustrating genetic drift. It assumes that every man has on average one son, with a standard deviation of roughly one, and a binomial probability distribution.

By the way, the Y-chromosomal Adam lived roughly 60,000–90,000 years ago, while Eve lived roughly 200,000 years ago. The reason why genetic drift acted faster for men is that men have a larger variation in the number of offspring – one man can have many more children than one woman.

UPDATE: Based on comments at Hacker News and Reddit, some readers are dissatisfied with the assumption of a fixed population size. Of course, human population grew over the ages, but there were also periods when it shrank, sometimes a lot. For example, roughly 70,000 years ago, the human population may have dropped down to thousands of individuals (1, 2). So, a fixed population size is a reasonable simplification for my example.

The roles of Mitochondrial Eve and Y-chromosomal Adam

One noteworthy fact about Mitochondrial Eve and Y-chromosomal Adam is that their positions in the history are not as special as they may first appear. Let’s take a look at this in more depth.

Tracing from me, I can follow a path of paternal ancestry all the way to Y-chromosomal Adam:

If I only trace the male ancestry, there is exactly one path that starts at me, and that path leads to Y-chromosomal Adam. You could start the diagram at any man alive today and you’d get a similar picture, with the lineage finally reaching the Y-chromosomal Adam.

However, if I trace ancestry via both parents, the number of ancestors explodes. I have two parents, four grandparents, eight great grandparents, and so forth. The number of ancestors in each generation grows exponentially, although that cannot continue for long. If all ancestors in each generations were distinct, I would have more than 1 billion ancestors just 30 generations back, so the tree certainly has to start collapsing into a directed acyclic graph by then.

Tracing ancestry through both parents, there are many paths to follow, and each generation of ancestors contains a lot of people. Some of those paths will reach Y-chromosomal Adam, but other paths will reach other men in his generation. Similarly, some paths will reach Mitochondrial Eve, but other paths will reach other women in her generation.

Most recent common ancestor

So, Mitochondrial Eve only has a special position with respect to the mother-daughter relationships, and Y-chromosomal Adam only with respect to the father-son relationships.

What if you consider all types of ancestry, father-son, father-daughter, mother-son and mother-daughter? In the resulting directed acyclic graph, neither the Y-chomosomal Adam nor the Mitochondrial Eve appear in a special position. In fact, in the combined graph, the most recent ancestor of all today’s people lived much later than Y-chromosomal Adam. The most recent ancestor is estimated to have lived roughly 15,000 to 5,000 years ago.

One way to visualize the relationship between Mitochondrial Eve, Y-chromosomal Adam, and the Most Recent Common Ancestor (MRCA) is to look at a small genealogy diagram with just a few people:

For the last generation consisting of just four people, this graph shows the Mitochondrial Eve, the Y-chromosomal Adam, and the most recent common ancestors (a couple in this example, but could also be one man or one woman). Adam is at the root of the blue tree, Eve is at the root of the red tree, and the most recent common ancestors are much lower in the graph.

The dating of Y-Adam and M-Eve

Finally, I’ll briefly give you an idea on how biologists calculate when Mitochondrial Eve and Y-chromosomal Adam lived.

The dating is based on DNA analysis. Changes in DNA accumulate at a certain rate that depends on various factors – region of the DNA, the species, population size, etc. To date Mitochondrial Eve, biologists calculate an estimate of the mitochondrial DNA (mtDNA) mutation rate. Then, they look at how much mtDNA varies between today’s women, and then calculate how long it would take to achieve that degree of variation.

Another interesting fact is that the titles of Mitochondrial Eve and Y-chromosomal Adam are not permanent, but instead are reassigned over time. For example, the woman who we call “Mitochondrial Eve” today did not hold that title during her lifetime. Instead, there was another unknown woman who was the most recent common matrilineal ancestor of all women alive at Eve’s time.

Final words

I hope you enjoyed the article. I originally learned about Y-chromosomal Adam and Mitochondrial Eve from reading Before the Dawn, and immediately knew I had to blog about them from a programmer’s perspective. If you want to read more on the topic, the Wikipedia page on Mitochondrial Eve is a good start.

Read more of my articles:

Human heart is a Turing machine, research on XBox 360 shows. Wait, what?

And if you like my blog, subscribe!

Fast and slow if-statements: branch prediction in modern processors

Igor Ostrovsky — Sat, 15 May 2010 21:11:32 +0000

Did you know that the performance of an if-statement depends on whether its condition has a predictable pattern? If the condition is always true or always false, the branch prediction logic in the processor will pick up the pattern. On the other hand, if the pattern is unpredictable, the if-statement will be much more expensive. In this article, I’ll explain why today’s processors behave this way.

Let’s measure the performance of this loop with different conditions:

for (int i = 0; i < max; i++) if () sum++;

Here are the timings of the loop with different True-False patterns:

Condition	Pattern	Time (ms)
(i & 0x80000000) == 0	T repeated	322
(i & 0xffffffff) == 0	F repeated	276
(i & 1) == 0	TF alternating	760
(i & 3) == 0	TFFFTFFF…	513
(i & 2) == 0	TTFFTTFF…	1675
(i & 4) == 0	TTTTFFFFTTTTFFFF…	1275
(i & 8) == 0	8T 8F 8T 8F …	752
(i & 16) == 0	16T 16F 16T 16F …	490

A “bad” true-false pattern can make an if-statement up to six times slower than a “good” pattern! Of course, which pattern is good and which is bad depends on the exact instructions generated by the compiler and on the specific processor.

Let’s look at the processor counters

One way to understand how the processor used its time is to look at the hardware counters. To help with performance tuning, modern processors track various counters as they execute code: the number of instructions executed, the number of various types of memory accesses, the number of branches encountered, and so forth. To read the counters, you’ll need a tool such as the profiler in Visual Studio 2010 Premium or Ultimate, AMD Code Analyst or Intel VTune.

To verify that the slowdowns we observed were really due to the if-statement performance, we can look at the Mispredicted Branches counter:

The worst pattern (TTFFTTFF…) results in 774 branch mispredictions, while the good patterns only get around 10. No wonder that the bad case took the longest 1.67 seconds, while the good patterns only took around 300ms!

Let’s take a look at what “branch prediction” does, and why it has a major impact on the processor performance.

What’s the role of branch prediction?

To explain what branch prediction is and why it impacts the performance numbers, we first need to take a look at how modern processors work. To complete each instruction, the CPU goes through these (and more) stages:

1. Fetch: Read the next instruction.

2. Decode: Determine the meaning of the instruction.

3. Execute: Perform the real ‘work’ of the instruction.

4. Write-back: Store results into memory.

An important optimization is that the stages of the pipeline can process different instructions at the same time. So, as one instruction is getting fetched, a second one is being decoded, a third is executing and the results of fourth are getting written back. Modern processors have pipelines with 10 – 31 stages (e.g., Pentium 4 Prescott has 31 stages), and for optimum performance, it is very important to keep all stages as busy as possible.

Image from http://commons.wikimedia.org/wiki/File:Pipeline,_4_stage.svg

Branches (i.e. conditional jumps) present a difficulty for the processor pipeline. After fetching a branch instruction, the processor needs to fetch the next instruction. But, there are two possible “next” instructions! The processor won’t be sure which instruction is the next one until the branching instruction makes it to the end of the pipeline.

Instead of stalling the pipeline until the branching instruction is fully executed, modern processors attempt to predict whether the jump will or will not be taken. Then, the processor can fetch the instruction that it thinks is the next one. If the prediction turns out wrong, the processor will simply discard the partially executed instructions that are in the pipeline. See the Wikipedia page on branch predictor implementation for some typical techniques used by processors to collect and interpret branch statistics.

Modern branch predictors are good at predicting simple patterns: all true, all false, true-false alternating, and so on. But if the pattern happens to be something that throws off the branch predictor, the performance hit will be significant. Thankfully, most branches have easily predictable patterns, like the two examples highlighted below:

int SumArray(int[] array) {
    if (array == null) throw new ArgumentNullException("array");

    int sum=0;
    for(int i=0; i; i++) sum += array[i];
    return sum;
}

The first highlighted condition validates the input, and so the branch will be taken only very rarely. The second highlighted condition is a loop termination condition. This will also almost always go one way, unless the arrays processed are extremely short. So, in these cases – as in most cases – the processor branch prediction logic will be effective at preventing stalls in the processor pipeline.

Updates and Clarifications

This article got picked up by reddit, and got a fair bit of attention in the reddit comments. I’ll respond to the questions, comments and criticisms below.

First, regarding the comments that optimizing for branch prediction is generally a bad idea: I agree. I do not argue anywhere in the article that you should try to write your code to optimize for branch prediction. For the vast majority of high-level code, I can’t even imagine how you’d do that.

Second, there was a concern whether the executed instructions for different cases differ in something else other than the constant value. They don’t – I looked at the JIT-ted assembly. If you’d like to see the JIT-ted assembly code or the C# source code, send me an email and I’ll send them back. (I am not posting the code here because I don’t want to blow up this update.)

Third, another question was on the surprisingly poor performance of the TTFF* pattern. The TTFF* pattern has a short period, and as such should be an easy case for the branch prediction algorithms.

However, the problem is that modern processors don’t track history for each branching instruction separately. Instead, they either track global history of all branches, or they have several history slots, each potentially shared by multiple branching instructions. Or, they can use some combination of these tricks, together with other techniques.

So, the TTFF pattern in the if-statement may not be TTFF by the time it gets to the branch predictor. It may get interleaved with other branches (there are 2 branching instructions in the body of a for-loop), and possibly approximated in other ways too. But, I don’t claim to be an expert on what precisely each processor does, and if someone reading this has an authoritative reference to how different processors behave (esp. Intel Core2 that I tested on), please let me know in comments.

Read more of my articles:

How GPU came to be used for general computation

And if you like my blog, subscribe!

Use C# dynamic typing to conveniently access internals of an object

Igor Ostrovsky — Fri, 02 Apr 2010 08:47:28 +0000

Sometimes you need to access private fields and call private methods on an object – for testing, experimentation, or to work around issues in third-party libraries.

.NET has long provided a solution to this problem: reflection. Reflection allows you to call private methods and read or write private fields from outside of the class, but is verbose and messy to write. In C# 4, this problem can be solved in a neat way using dynamic typing.

As a simple example, I’ll show you how to access internals of the List class from the BCL standard library. Let’s create an instance of the List type:

List<int> realList = new List<int>();

To access the internals of List, we’ll wrap it with ExposedObject – a type I’ll show you how to implement:

dynamic exposedList = ExposedObject.From(realList);

And now, via the exposedList object, we can access the private fields and methods of the List<> class:

// Read a private field - prints 0
Console.WriteLine(exposedList._size);

// Modify a private field
exposedList._items = new int[] { 5, 4, 3, 2, 1 };

// Modify another private field
exposedList._size = 5;

// Call a private method
exposedList.EnsureCapacity(20);

Of course, _size, _items and EnsureCapacity() are all private members of the List class! In addition to the private members, you can still access the public members of the exposed list:

// Add a value to the list
exposedList.Add(0);

// Enumerate the list. Prints "5 4 3 2 1 0"
foreach (var x in exposedList) Console.WriteLine(x);

Pretty cool, isn’t it?

How does ExposedObject work?

The example I showed uses ExposedObject to access private fields and methods of the List class. ExposedObject is a type I implemented using the dynamic typing feature in C# 4.

To create a dynamically-typed object in C#, you need to implement a class derived from DynamicObject. The derived class will implement a couple of methods whose role is to decide what to do whenever a method gets called or a property gets accessed on your dynamic object.

Here is a dynamic type that will print the name of any method you call on it:

class PrintingObject : DynamicObject 
{
    public override bool TryInvokeMember(
            InvokeMemberBinder binder, object[] args, out object result)
    {
        Console.WriteLine(binder.Name); // Print the name of the called method
        result = null;
        return true;
    }
}

Let’s test it out. This program prints “HelloWorld” to the console:

class Program 
{
    static void Main(string[] args)
    {
        dynamic printing = new PrintingObject();
        printing.HelloWorld();
        return;
    }
}

There is only a small step from PrintingObject to ExposedObject – instead of printing the name of the method, ExposedObject will find the appropriate method and invoke it via reflection. A simple version of the code looks like this:

class ExposedObjectSimple : DynamicObject 
{
    private object m_object;

    public ExposedObjectSimple(object obj)
    {
        m_object = obj;
    }

    public override bool TryInvokeMember(
            InvokeMemberBinder binder, object[] args, out object result)
    {
        // Find the called method using reflection
        var methodInfo = m_object.GetType().GetMethod(
            binder.Name,
            BindingFlags.NonPublic | BindingFlags.Public | BindingFlags.Instance);

        // Call the method
        result = methodInfo.Invoke(m_object, args);
        return true;
    }
}

This is all you need in order to implement a simple version of ExposedObject.

What else can you do with ExposedObject?

The ExposedObjectSimple implementation above illustrates the concept, but it is a bit naive – it does not handle field accesses, multiple methods with the same name but different argument lists, generic methods, static methods, and so on. I implemented a more complete version of the idea and published it as a CodePlex project: http://exposedobject.codeplex.com/

You can call static methods by creating an ExposedClass over the class. For example, let’s call the private File.InternalExists() static method:

dynamic fileType = ExposedClass.From(typeof(System.IO.File));
bool exists = fileType.InternalExists("somefile.txt");

You can call generic methods (static or non-static) too:

dynamic enumerableType = ExposedClass.From(typeof(System.Linq.Enumerable));
Console.WriteLine(
    enumerableType.Max<int>(new[] { 1, 3, 5, 3, 1 }));

Type inference for generics works too. You don’t have to specify the int generic argument in the Max method call:

dynamic enumerableType = ExposedClass.From(typeof(System.Linq.Enumerable));
Console.WriteLine(
    enumerableType.Max(new[] { 1, 3, 5, 3, 1 }));

And to clarify, all of these different cases have to be explicitly handled. Deciding which method should be called is normally done by the compiler. The logic on overload resolution is hidden away in the compiler, and so unavailable at runtime. To duplicate the method binding rules, we have to duplicate the logic of the compiler.

You can also cast a dynamic object either to its own type, or to a base class:

List<int> realList = new List<int>();
dynamic exposedList = ExposedObject.From(realList);

List<int> realList2 = exposedList;
Console.WriteLine(realList == realList2); // Prints "true"

So, after casting the exposed object (or assigning it into an appropriately typed variable), you can get back the original object!

Updates

Based on some of the responses the article is getting, I should clarify: I am definitely not suggesting that you use ExposedObject regularly in your production code! That would be a terrible idea.

As I said in the article, ExposedObject can be useful for testing, experimentation, and rare hacks. Also, it is an interesting example of how dynamic typing works in C#.

Also, apparently Mauricio Scheffer came up with the same idea almost a year before me.

Read more of my articles:

Human heart is a Turing machine, research on XBox 360 shows. Wait, what?

Skip lists are fascinating!

And if you like my blog, subscribe!

How GPU came to be used for general computation

Igor Ostrovsky — Fri, 12 Mar 2010 09:30:56 +0000

The story of how GPU came to be used for high-performance computation is pretty cool. Hardware heavily optimized for graphics turned out to be useful for another use: certain types of high-performance computations. In this article, I will explore how and why this happened, and summarize the state of general computation on GPUs today.

Programmable graphics

The first step towards computation on the GPU was introduction of programmable shaders. Both DirectX and OpenGL added support for programmable shaders roughly a decade ago, giving game designers more freedom to create custom graphics effects. Instead of just composing pre-canned effects, graphic artists can now write little programs that execute directly on the GPU. As of DirectX 8, they can specify two types of shader programs for every object in the scene: a vertex shader and a pixel shader.

A vertex shader is a function invoked on every vertex in the 3D object. The function transforms the vertex and returns its position relative to the camera view. By transforming vertices, vertex shaders help implement realistic skin, clothes, facial expressions, and similar effects.

A pixel shader is a function invoked on every pixel covered by a particular object and returns the color of the pixel. To compute the output color, the pixel shader can use a variety of optional inputs: XY-position on the screen, XYZ-position in the scene, position in the texture, the direction of the surface (i.e., the normal vector), etc. Pixel shader can also read textures, bump maps, and other inputs.

Here is a simple scene, rendered with six different pixel shaders applied to the teapot:

A always returns the same color. B varies the color based on the screen Y-coordinate. C sets the color depending on the XYZ screen coordinates. D sets the color proportionally to the cosine of the angle between the surface normal and the light direction (“diffuse lighting”). E uses a slightly more complex lighting model and a texture, and F also adds a bump map.

If you are curious how lighting shaders are implemented, check out GamaSutra’s Implementing Lighting Models With HLSL.

Realization: shaders can be used for computation!

Let’s take a look at a simple pixel shader that just blurs a texture. This shader is implemented in HLSL (a DirectX shader language):

float4 ps_main( float2 t: TEXCOORD0 ) : COLOR0 
{ 
   float dx = 2/fViewportWidth; 
   float dy = 2/fViewportHeight; 
   return 
      0.2 * tex2D( baseMap, t ) + 
      0.2 * tex2D( baseMap, t + float2(dx, 0) ) + 
      0.2 * tex2D( baseMap, t + float2(-dx, 0) ) +      
      0.2 * tex2D( baseMap, t + float2(0, dy) ) + 
      0.2 * tex2D( baseMap, t + float2(0, -dy) ); 
}

The texture blur has this effect:

This is not exactly a breath-taking effect, but the interesting part is that simulations of car crashes, wind tunnels and weather patterns all follow this basic pattern of computation! All of these simulations are computations on a grid of points. Each point has one or more quantities associated with it: temperature, pressure, velocity, force, air flow, etc. In each iteration of the simulation, the neighboring points interact: temperatures and pressures are equalized, forces are transferred, grid is deformed, and so forth. Mathematically, the programs that run these simulations are partial differential equation (PDE) solvers.

As a trivial example, here is a simple simulation of heat dissipation. In each iteration, the temperature of each grid point is recomputed as an average over its nearest neighbors:

It is hard to overlook the fact that an iteration of this simulation is nearly identical to the blur operation. Hardware highly optimized for running pixel shaders will be able to run this simulation very fast. And, after years of refinement and challenges from latest and greatest games, GPUs became very efficient at using massive parallelism to execute shaders blazingly fast.

One cool example of a PDE solver is a liquid and smoke simulator. The structure of the simulation is similar to my trivial heat dissipation example, but instead of tracking the temperature of each grid point, the smoke simulator tracks pressure and velocity. Just as in the heat dissipation example, a grid point is affected by all of its nearest neighbors in each iteration.

This simulation was developed by Keenan Crane.

For a view into general computation on GPUs in 2004 when hacked-up pixel shaders were the state of the art, see the General-Purpose Computation on GPUs section of GPU Gems 2.

Arrival of GPGPU

Once GPUs have shown themselves to be a good fit for certain types of high-performance computations (like PDEs), GPU manufacturers moved to make GPUs more attractive for general-purpose computing. The idea of General Purpose computation on a GPU (“GPGPU”) became a hot topic in high-performance computing.

GPGPU computing is based around compute kernels. A compute kernel is a generalization of a pixel shader:

Like a pixel shader, a compute kernel is a routine that will be invoked on each point in the input space.
A pixel shader always operates on two-dimensional space. A compute kernel can work on space of any dimensionality.
A pixel shader returns a single color. A compute kernel can write an arbitrary number of outputs.
A pixel shader operates on 32-bit floating-point numbers. A compute kernel also supports 64-bit floating-point numbers and integers.
A pixel shader reads from textures. A compute kernel can read from any place in GPU memory.
Additionally, compute kernels running on the same core can share data via an explicitly managed per-core cache.

Comparison of a GPU and a CPU

The control flow of a modern application is typically very complicated. Just think about all the different tasks that must be completed to show this article in your browser. To display this blog article, the CPU has to communicate with various devices, maintain thousands of data structures, position thousands of UI elements, parse perhaps a hundred file formats, … that does not even begin to scratch the surface. And, not only are all of these tasks different, they also depend on each other in very complex ways.

Compare that with the control flow of a pixel shader. A pixel shader is a single routine that needs to be invoked roughly a million times, each time on a different input. Different shader invocations are pretty much independent and once all are done, the GPU starts over again with a scene where objects have moved a bit.

It shouldn’t come as a surprise that hardware optimized for running a pixel shader will be quite different from hardware optimized for tasks like viewing web pages. A CPU greatly benefits from a sophisticated execution pipeline with multi-level caches, instruction reordering, prefetching, branch prediction, and other clever optimizations. A GPU does not need most of those complex features for its much simpler control flow. Instead, a GPU benefits from lots of Arithmetic Logic Units (ALUs) to add, multiply and divide floating point numbers in parallel.

This table shows the most important differences between a CPU and a GPU today:

CPU	GPU
2-4 cores	16-32 cores
Each core runs 1-2 independent threads in parallel	Each core runs 16-32 threads in parallel. All threads on a core must execute the same instruction at any time.
Automatically managed hierarchy of caches	Each core has 16-64kB of cache, explicitly managed by the programmer
0.1 billion floating-point operations / second (0.1 TFLOP)	1 billion floating-point operations / second (1 TFLOP)
Main memory throughput: 10GB / sec	GPU memory throughput: 100GB / sec

All of this means that if a program can be broken up into many threads all doing the same thing on different data (ideally executing arithmetic operations), a GPU will probably be able to do this an order of magnitude faster than a CPU. On the other hand, on an application with a complex control flow, CPU is going to be the one winning by orders of magnitude. Going back to my earlier example, it should be clear why a CPU will excel at running a browser and a GPU will excel at executing a pixel shader.

This chart illustrates how a CPU and a GPU use up their “silicon budget”. A CPU uses most of its transistors for the L1 cache and for execution control. A GPU dedicates the bulk of its transistors to Arithmetic Logic Units (ALUs).

Adapted from NVidia’s CUDA Programming Guide.

NVidia’s upcoming Fermi chip will slightly change the comparison table. Fermi introduces a per-core automatically-managed L1 cache. It will be very interesting to see what kind of impact the introduction of an L1 cache will have on the types of programs that can run on the GPU. One point is fairly clear – the penalty for register spills into main memory will be greatly reduced (this point may not make sense until you read the next section).

GPGPU Programming

Today, writing efficient GPGPU programs requires in-depth understanding of the hardware. There are three popular programming models:

DirectCompute – Microsoft’s API for defining compute kernels, introduced in DirectX 11
CUDA – NVidia’s C-based language for programming compute kernels
OpenCL – API originally proposed by Apple and now developed by Khronos Group

Conceptually, the models are very similar. The table below summarizes some of the terminology differences between the models:

DirectCompute	CUDA	OpenCL
thread	thread	work item
thread group	thread block	work group
group-shared memory	shared memory	local memory
warp?	warp	wavefront
barrier	barrier	barrier

Writing high-performance GPGPU code is not for the faint at heart (although the same could probably be said about any type of high-performance computing). Here are examples of some issues you need to watch out for when writing compute kernels:

The program has to have plenty of threads (thousands)
Not too many threads, though, or cores will run out of registers and will have to simulate additional registers using main GPU memory.
It is important that threads running on one core access main memory in such a pattern that the hardware will be coalesce the memory accesses from different threads. This optimization alone can make an order of magnitude difference.
… and so on.

Explaining all of these performance topics in detail is well beyond the scope of this article, but hopefully this gives you an idea of what GPGPU programming is about, and what kinds of problems it can be applied to.

Read more of my articles:

Human heart is a Turing machine, research on XBox 360 shows. Wait, what?

Skip lists are fascinating!

And if you like my blog, subscribe!

Volatile keyword in C# – memory model explained

Igor Ostrovsky — Tue, 23 Feb 2010 09:57:22 +0000

The memory model is a fascinating topic – it touches on hardware, concurrency, compiler optimizations, and even math.

The memory model defines what state a thread may see when it reads a memory location modified by other threads. For example, if one thread updates a regular non-volatile field, it is possible that another thread reading the field will never observe the new value. This program never terminates (in a release build):

class Test
{
    private bool _loop = true;

    public static void Main()
    {
        Test test1 = new Test();

        // Set _loop to false on another thread
        new Thread(() => { test1._loop = false;}).Start();

        // Poll the _loop field until it is set to false
        while (test1._loop == true) ;

        // The loop above will never terminate!
    }
}

There are two possible ways to get the while loop to terminate:

Use a lock to protect all accesses (reads and writes) to the _loop field
Mark the _loop field as volatile

There are two reasons why a read of a non-volatile field may observe a stale value: compiler optimizations and processor optimizations.

In concurrent programming, threads can get interleaved in many different ways, resulting in possibly many different outcomes. But as the example with the infinite loop shows, threads do not just get interleaved – they potentially interact in more complex ways, unless you correctly use locks and volatile fields.

Compiler optimizations

The first reason why a non-volatile read may return a stale value has to do with compiler optimizations. In the infinite loop example, the JIT compiler optimizes the while loop from this:

while (test1._loop == true) ;

To this:

if (test1._loop) { while (true); }

This is an entirely reasonable transformation if only one thread accesses the _loop field. But, if another thread changes the value of the field, this optimization can prevent the reading thread from noticing the updated value.

If you mark the _loop field as volatile, the compiler will not hoist the read out of the loop. The compiler will know that other threads may be modifying the field, and so it will be careful to avoid optimizations that would result in a read of a stale value.

The code transformation I showed is a close approximation of the optimization done by the CLR JIT compiler, but not completely exact.

The full story is that the assembly code emitted by the JIT compiler will store the value test1._loop in the EAX register. The loop condition will keep polling the register, and will read test1._loop from memory again. Even when the thread is pre-empted, the CPU registers get saved. Once the thread is again scheduled to run, the same stale EAX register value will be restored, and the loop never terminates.

The assembly code generated by the while loop looks as follows:

00000068  test        eax,eax 
0000006a  jne         00000068

If you make the _loop field volatile, this code is generated instead:

00000064  cmp         byte ptr [eax+4],0 
00000068  jne         00000064

If the _loop field is not volatile, the compiler will store _loop in the EAX register. If _loop is volatile, the compiler will instead keep the test1 variable in EAX, and the value of _loop will be re-fetched from memory on each access (by “ptr [eax+4]”).

From my experience playing around with the current version of the CLR, I get the impression that these kinds of compiler optimizations are not terribly frequent. On x86 and x64, often the same assembly code will be generated regardless of whether a field is volatile or not. On IA64, the situation is a bit different – see the next section.

Processor optimizations

On some processors, not only must the compiler avoid certain optimizations on volatile reads and writes, it also has to use special instructions. On a multi-core machine, different cores have different caches. The processors may not bother to keep those caches coherent by default, and special instructions may be needed to flush and refresh the caches.

The mainstream x86 and x64 processors implement a strong memory model where memory access is effectively volatile. So, a volatile field forces the compiler to avoid some high-level optimizations like hoisting a read out of a loop, but otherwise results in the same assembly code as a non-volatile read.

The Itanium processor implements a weaker memory model. To target Itanium, the JIT compiler has to use special instructions for volatile memory accesses: LD.ACQ and ST.REL, instead of LD and ST. Instruction LD.ACQ effectively says, “refresh my cache and then read a value” and ST.REL says, “write a value to my cache and then flush the cache to main memory”. LD and ST on the other hand may just access the processor’s cache, which is not visible to other processors.

For the reasons explained in this section and the previous sections, marking a field as volatile will often incur zero performance penalty on x86 and x64.

The x86/x64 instruction set actually does contains three fence instructions: LFENCE, SFENCE, and MFENCE. LFENCE and SFENCE are apparently not needed on the current architecture, but MFENCE is useful to go around one particular issue: if a core reads a memory location it previously wrote, the read may be served from the store buffer, even though the write has not yet been written to memory. [Source] I don’t actually know whether the CLR JIT ever inserts MFENCE instructions.

Volatile accesses in more depth

To understand how volatile and non-volatile memory accesses work, you can imagine each thread as having its own cache. Consider a simple example with a non-volatile memory location (i.e. a field) u, and a volatile memory location v.

A non-volatile write could just update the value in the thread’s cache, and not the value in main memory:

However, in C# all writes are volatile (unlike say in Java), regardless of whether you write to a volatile or a non-volatile field. So, the above situation actually never happens in C#.

A volatile write updates the thread’s cache, and then flushes the entire cache to main memory. If we were to now set the volatile field v to 11, both values u and v would get flushed to main memory:

Since all C# writes are volatile, you can think of all writes as going straight to main memory.

A regular, non-volatile read can read the value from the thread’s cache, rather than from main memory. Despite the fact that thread 1 set u to 11, when thread 2 reads u, it will still see value 10:

When you read a non-volatile field in C#, a non-volatile read occurs, and you may see a stale value from the thread’s cache. Or, you may see the updated value. Whether you see the old or the new value depends on your compiler and your processor.

Finally, let’s take a look at an example of a volatile read. Thread 2 will read the volatile field v:

Before the volatile read, thread 2 refreshes its entire cache, and then reads the updated value of v: 11. So, it will observe the value that is really in main memory, and also refresh its cache as a bonus.

Note that the thread caches that I described are imaginary – there really is no such thing as a thread cache. Threads only appear to have these caches as an artifact of compiler and processor optimizations.

One interesting point is that all writes in C# are volatile according to the memory model as documented here and here, and are also presumably implemented as such. The ECMA specification of the C# language actually defines a weaker model where writes are not volatile by default.

You may find it surprising that a volatile read refreshes the entire cache, not just the read value. Similarly, a volatile write (i.e., every C# write) flushes the entire cache, not just the written value. These semantics are sometimes referred to as “strong volatile semantics”.

The original Java memory model designed in 1995 was based on weak volatile semantics, but was changed in 2004 to strong volatile. The weak volatile model is very inconvenient. One example of the problem is that the “safe publication” pattern is not safe. Consider this example:

volatile string[] _args = null;

public void Write() {
    string[] a = new string[2];
    a[0] = "arg1";
    a[1] = "arg2";
    _args = a;
    ...
}

public void Read() {
    if (_args != null) {
        // Under weak volatile semantics, this assert could fail!
        Debug.Assert(_args[0] != null);
    }
}

Under strong volatile semantics (i.e., the .NET and C# volatile semantics), a non-null value in the _args field guarantees that the elements of _args are also not null. The safe publication pattern is very useful and commonly used in practice.

Memory model and .NET operations

Here is a table of how various .NET operations interact with the imaginary thread cache:

Construct	Refreshes thread cache before?	Flushes thread cache after?	Notes
Ordinary read	No	No	Read of a non-volatile field
Ordinary write	No	Yes	Write of a non-volatile field
Volatile read	Yes	No	Read of volatile field, or Thread.VolatileRead
Volatile write	No	Yes	Write of a volatile field – same as non-volatile
Thread.MemoryBarrier	Yes	Yes	Special memory barrier method
Interlocked operations	Yes	Yes	Increment, Add, Exchange, etc.
Lock acquire	Yes	No	Monitor.Enter or entering a lock {} region
Lock release	No	Yes	Monitor.Exit or exiting a lock {} region

For each operation, the table shows two things:

Is the entire imaginary thread cache refreshed from main memory before the operation?
Is the entire imaginary thread cache flushed to main memory after the operation?

Disclaimer and limitations of the model

This blog post reflects my personal understanding of the .NET memory model, and is based purely on publicly available information.

I find the explanation based on imaginary thread caches more intuitive than the more commonly used explanation based on operation reordering. The thread cache model is also accurate for most intents and purposes.

To be even more accurate, you should assume that the thread caches can form an arbitrary large hierarchy, and so you cannot assume that a read is served only from two possible places – main memory or the thread’s cache. I think that you would have to construct a somewhat of a clever case in order for the cache hierarchy to make a difference, though. If anyone is aware of a case where the hierarchical thread cache model makes a prediction different from the reordering-based model, I would love to hear about it.

If you are interested in the .NET memory model, I encourage you to read Understand the Impact of Low-Lock Techniques in Multithreaded Apps in the MSDN Magazine, and the Memory model blog post from Chris Brumme.

Read more of my articles:

Human heart is a Turing machine, research on XBox 360 shows. Wait, what?

7 tricks to simplify your programs with LINQ

And if you like my blog, subscribe!

What really happens when you navigate to a URL

Igor Ostrovsky — Tue, 09 Feb 2010 08:14:26 +0000

As a software developer, you certainly have a high-level picture of how web apps work and what kinds of technologies are involved: the browser, HTTP, HTML, web server, request handlers, and so on.

In this article, we will take a deeper look at the sequence of events that take place when you visit a URL.

1. You enter a URL into the browser

It all starts here:

2. The browser looks up the IP address for the domain name

The first step in the navigation is to figure out the IP address for the visited domain. The DNS lookup proceeds as follows:

Browser cache – The browser caches DNS records for some time. Interestingly, the OS does not tell the browser the time-to-live for each DNS record, and so the browser caches them for a fixed duration (varies between browsers, 2 – 30 minutes).
OS cache – If the browser cache does not contain the desired record, the browser makes a system call (gethostbyname in Windows). The OS has its own cache.
Router cache – The request continues on to your router, which typically has its own DNS cache.
ISP DNS cache – The next place checked is the cache ISP’s DNS server. With a cache, naturally.
Recursive search – Your ISP’s DNS server begins a recursive search, from the root nameserver, through the .com top-level nameserver, to Facebook’s nameserver. Normally, the DNS server will have names of the .com nameservers in cache, and so a hit to the root nameserver will not be necessary.

Here is a diagram of what a recursive DNS search looks like:

One worrying thing about DNS is that the entire domain like wikipedia.org or facebook.com seems to map to a single IP address. Fortunately, there are ways of mitigating the bottleneck:

Round-robin DNS is a solution where the DNS lookup returns multiple IP addresses, rather than just one. For example, facebook.com actually maps to four IP addresses.
Load-balancer is the piece of hardware that listens on a particular IP address and forwards the requests to other servers. Major sites will typically use expensive high-performance load balancers.
Geographic DNS improves scalability by mapping a domain name to different IP addresses, depending on the client’s geographic location. This is great for hosting static content so that different servers don’t have to update shared state.
Anycast is a routing technique where a single IP address maps to multiple physical servers. Unfortunately, anycast does not fit well with TCP and is rarely used in that scenario.

Most of the DNS servers themselves use anycast to achieve high availability and low latency of the DNS lookups.

3. The browser sends a HTTP request to the web server

You can be pretty sure that Facebook’s homepage will not be served from the browser cache because dynamic pages expire either very quickly or immediately (expiry date set to past).

So, the browser will send this request to the Facebook server:

GET http://facebook.com/ HTTP/1.1
Accept: application/x-ms-application, image/jpeg, application/xaml+xml, [...]
User-Agent: Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; WOW64; [...]
Accept-Encoding: gzip, deflate
Connection: Keep-Alive
Host: facebook.com
Cookie: datr=1265876274-[...]; locale=en_US; lsd=WW[...]; c_user=2101[...]

The GET request names the URL to fetch: “http://facebook.com/”. The browser identifies itself (User-Agent header), and states what types of responses it will accept (Accept and Accept-Encoding headers). The Connection header asks the server to keep the TCP connection open for further requests.

The request also contains the cookies that the browser has for this domain. As you probably already know, cookies are key-value pairs that track the state of a web site in between different page requests. And so the cookies store the name of the logged-in user, a secret number that was assigned to the user by the server, some of user’s settings, etc. The cookies will be stored in a text file on the client, and sent to the server with every request.

There is a variety of tools that let you view the raw HTTP requests and corresponding responses. My favorite tool for viewing the raw HTTP traffic is fiddler, but there are many other tools (e.g., FireBug) These tools are a great help when optimizing a site.

In addition to GET requests, another type of requests that you may be familiar with is a POST request, typically used to submit forms. A GET request sends its parameters via the URL (e.g.: http://robozzle.com/puzzle.aspx?id=85). A POST request sends its parameters in the request body, just under the headers.

The trailing slash in the URL “http://facebook.com/” is important. In this case, the browser can safely add the slash. For URLs of the form http://example.com/folderOrFile, the browser cannot automatically add a slash, because it is not clear whether folderOrFile is a folder or a file. In such cases, the browser will visit the URL without the slash, and the server will respond with a redirect, resulting in an unnecessary roundtrip.

4. The facebook server responds with a permanent redirect

This is the response that the Facebook server sent back to the browser request:

HTTP/1.1 301 Moved Permanently
Cache-Control: private, no-store, no-cache, must-revalidate, post-check=0,
      pre-check=0
Expires: Sat, 01 Jan 2000 00:00:00 GMT
Location: http://www.facebook.com/
P3P: CP="DSP LAW"
Pragma: no-cache
Set-Cookie: made_write_conn=deleted; expires=Thu, 12-Feb-2009 05:09:50 GMT;
      path=/; domain=.facebook.com; httponly
Content-Type: text/html; charset=utf-8
X-Cnection: close
Date: Fri, 12 Feb 2010 05:09:51 GMT
Content-Length: 0

The server responded with a 301 Moved Permanently response to tell the browser to go to “http://www.facebook.com/” instead of “http://facebook.com/”.

There are interesting reasons why the server insists on the redirect instead of immediately responding with the web page that the user wants to see.

One reason has to do with search engine rankings. See, if there are two URLs for the same page, say http://www.igoro.com/ and http://igoro.com/, search engine may consider them to be two different sites, each with fewer incoming links and thus a lower ranking. Search engines understand permanent redirects (301), and will combine the incoming links from both sources into a single ranking.

Also, multiple URLs for the same content are not cache-friendly. When a piece of content has multiple names, it will potentially appear multiple times in caches.

5. The browser follows the redirect

The browser now knows that “http://www.facebook.com/” is the correct URL to go to, and so it sends out another GET request:

GET http://www.facebook.com/ HTTP/1.1
Accept: application/x-ms-application, image/jpeg, application/xaml+xml, [...]
Accept-Language: en-US
User-Agent: Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; WOW64; [...]
Accept-Encoding: gzip, deflate
Connection: Keep-Alive
Cookie: lsd=XW[...]; c_user=21[...]; x-referer=[...]
Host: www.facebook.com

The meaning of the headers is the same as for the first request.

6. The server ‘handles’ the request

The server will receive the GET request, process it, and send back a response.

This may seem like a straightforward task, but in fact there is a lot of interesting stuff that happens here – even on a simple site like my blog, let alone on a massively scalable site like facebook.

Web server software
The web server software (e.g., IIS or Apache) receives the HTTP request and decides which request handler should be executed to handle this request. A request handler is a program (in ASP.NET, PHP, Ruby, …) that reads the request and generates the HTML for the response.
In the simplest case, the request handlers can be stored in a file hierarchy whose structure mirrors the URL structure, and so for example http://example.com/folder1/page1.aspx URL will map to file /httpdocs/folder1/page1.aspx. The web server software can also be configured so that URLs are manually mapped to request handlers, and so the public URL of page1.aspx could be http://example.com/folder1/page1.
Request handler
The request handler reads the request, its parameters, and cookies. It will read and possibly update some data stored on the server. Then, the request handler will generate a HTML response.

One interesting difficulty that every dynamic website faces is how to store data. Smaller sites will often have a single SQL database to store their data, but sites that store a large amount of data and/or have many visitors have to find a way to split the database across multiple machines. Solutions include sharding (splitting up a table across multiple databases based on the primary key), replication, and usage of simplified databases with weakened consistency semantics.

One technique to keep data updates cheap is to defer some of the work to a batch job. For example, Facebook has to update the newsfeed in a timely fashion, but the data backing the “People you may know” feature may only need to be updated nightly (my guess, I don’t actually know how they implement this feature). Batch job updates result in staleness of some less important data, but can make data updates much faster and simpler.

7. The server sends back a HTML response

Here is the response that the server generated and sent back:

HTTP/1.1 200 OK
Cache-Control: private, no-store, no-cache, must-revalidate, post-check=0,
    pre-check=0
Expires: Sat, 01 Jan 2000 00:00:00 GMT
P3P: CP="DSP LAW"
Pragma: no-cache
Content-Encoding: gzip
Content-Type: text/html; charset=utf-8
X-Cnection: close
Transfer-Encoding: chunked
Date: Fri, 12 Feb 2010 09:05:55 GMT

2b3
��������T�n�@����[...]

The entire response is 36 kB, the bulk of them in the byte blob at the end that I trimmed.

The Content-Encoding header tells the browser that the response body is compressed using the gzip algorithm. After decompressing the blob, you’ll see the HTML you’d expect:

...

In addition to compression, headers specify whether and how to cache the page, any cookies to set (none in this response), privacy information, etc.

Notice the header that sets Content-Type to text/html. The header instructs the browser to render the response content as HTML, instead of say downloading it as a file. The browser will use the header to decide how to interpret the response, but will consider other factors as well, such as the extension of the URL.

8. The browser begins rendering the HTML

Even before the browser has received the entire HTML document, it begins rendering the website:

9. The browser sends requests for objects embedded in HTML

As the browser renders the HTML, it will notice tags that require fetching of other URLs. The browser will send a GET request to retrieve each of these files.

Here are a few URLs that my visit to facebook.com retrieved:

Images
http://static.ak.fbcdn.net/rsrc.php/z12E0/hash/8q2anwu7.gif
http://static.ak.fbcdn.net/rsrc.php/zBS5C/hash/7hwy7at6.gif
…
CSS style sheets
http://static.ak.fbcdn.net/rsrc.php/z448Z/hash/2plh8s4n.css
http://static.ak.fbcdn.net/rsrc.php/zANE1/hash/cvtutcee.css
…
JavaScript files
http://static.ak.fbcdn.net/rsrc.php/zEMOA/hash/c8yzb6ub.js
http://static.ak.fbcdn.net/rsrc.php/z6R9L/hash/cq2lgbs8.js
…

Each of these URLs will go through process a similar to what the HTML page went through. So, the browser will look up the domain name in DNS, send a request to the URL, follow redirects, etc.

However, static files – unlike dynamic pages – allow the browser to cache them. Some of the files may be served up from cache, without contacting the server at all. The browser knows how long to cache a particular file because the response that returned the file contained an Expires header. Additionally, each response may also contain an ETag header that works like a version number – if the browser sees an ETag for a version of the file it already has, it can stop the transfer immediately.

Can you guess what “fbcdn.net” in the URLs stands for? A safe bet is that it means “Facebook content delivery network”. Facebook uses a content delivery network (CDN) to distribute static content – images, style sheets, and JavaScript files. So, the files will be copied to many machines across the globe.

Static content often represents the bulk of the bandwidth of a site, and can be easily replicated across a CDN. Often, sites will use a third-party CDN provider, instead of operating a CND themselves. For example, Facebook’s static files are hosted by Akamai, the largest CDN provider.

As a demonstration, when you try to ping static.ak.fbcdn.net, you will get a response from an akamai.net server. Also, interestingly, if you ping the URL a couple of times, may get responses from different servers, which demonstrates the load-balancing that happens behind the scenes.

10. The browser sends further asynchronous (AJAX) requests

In the spirit of Web 2.0, the client continues to communicate with the server even after the page is rendered.

For example, Facebook chat will continue to update the list of your logged in friends as they come and go. To update the list of your logged-in friends, the JavaScript executing in your browser has to send an asynchronous request to the server. The asynchronous request is a programmatically constructed GET or POST request that goes to a special URL. In the Facebook example, the client sends a POST request to http://www.facebook.com/ajax/chat/buddy_list.php to fetch the list of your friends who are online.

This pattern is sometimes referred to as “AJAX”, which stands for “Asynchronous JavaScript And XML”, even though there is no particular reason why the server has to format the response as XML. For example, Facebook returns snippets of JavaScript code in response to asynchronous requests.

Among other things, the fiddler tool lets you view the asynchronous requests sent by your browser. In fact, not only you can observe the requests passively, but you can also modify and resend them. The fact that it is this easy to “spoof” AJAX requests causes a lot of grief to developers of online games with scoreboards. (Obviously, please don’t cheat that way.)

Facebook chat provides an example of an interesting problem with AJAX: pushing data from server to client. Since HTTP is a request-response protocol, the chat server cannot push new messages to the client. Instead, the client has to poll the server every few seconds to see if any new messages arrived.

Long polling is an interesting technique to decrease the load on the server in these types of scenarios. If the server does not have any new messages when polled, it simply does not send a response back. And, if a message for this client is received within the timeout period, the server will find the outstanding request and return the message with the response.

Conclusion

Hopefully this gives you a better idea of how the different web pieces work together.

Read more of my articles:

Human heart is a Turing machine, research on XBox 360 shows. Wait, what?