<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Igor Ostrovsky Blogging</title>
	<atom:link href="http://igoro.com/feed/" rel="self" type="application/rss+xml" />
	<link>http://igoro.com</link>
	<description>On programming, technology, and random things of interest</description>
	<lastBuildDate>Wed, 08 Oct 2014 20:52:39 +0000</lastBuildDate>
	<language>en-US</language>
		<sy:updatePeriod>hourly</sy:updatePeriod>
		<sy:updateFrequency>1</sy:updateFrequency>
	<generator>https://wordpress.org/?v=4.0.38</generator>
	<item>
		<title>How RAID-6 dual parity calculation works</title>
		<link>http://igoro.com/archive/how-raid-6-dual-parity-calculation-works/</link>
		<comments>http://igoro.com/archive/how-raid-6-dual-parity-calculation-works/#comments</comments>
		<pubDate>Mon, 29 Sep 2014 05:09:18 +0000</pubDate>
		<dc:creator><![CDATA[Igor Ostrovsky]]></dc:creator>
				<category><![CDATA[Algorithms]]></category>

		<guid isPermaLink="false">http://igoro.com/?p=765</guid>
		<description><![CDATA[Parity protection is a common technique for reliable data storage on mediums that may fail (HDDs, SSDs, storage servers, etc.) Calculation of parity uses tools from algebra in very interesting ways, in particular in the dual parity case. This article is for computer engineers who would like to have a high-level understanding of how the [&#8230;]]]></description>
				<content:encoded><![CDATA[<p>Parity protection is a common technique for reliable data storage on mediums that may fail (HDDs, SSDs, storage servers, etc.) Calculation of parity uses tools from algebra in very interesting ways, in particular in the dual parity case.</p>
<p>This article is for computer engineers who would like to have a high-level understanding of how the dual parity calculation works, without diving into all of the mathematical details.</p>
<h3>Single parity</h3>
<p>Let&#8217;s start with the simple case &#8211; single parity. As an example, say that you need to reliably store 5 data blocks of a fixed size (e.g., 4kB). You can put the 5 data blocks on 5 different storage devices, and then store a parity block on a 6th device. If any one device fails, you can still recompute the block from the failed device, and so you haven&#8217;t lost any data.</p>
<p>A simple way to calculate the parity <em>P</em> is just to combine the values <em>a, b, c, d, e</em> with the exclusive-or (XOR) operator:</p>
<p>&#160;&#160;&#160; <em>P</em> = <em>a</em> XOR <em>b </em>XOR <em>c </em>XOR <em>d </em>XOR <em>e</em></p>
<p>Effectively, every parity bit protects the corresponding bits of <em>a, b, c, d, e</em>.</p>
<ul>
<li>If the first parity bit is 0, you know that the number of 1 bits among the first bits of <em>a, b, c, d, e</em> is even. </li>
<li>Otherwise the number of 1 bits is odd. </li>
</ul>
<p>Given <em>a, b, c, e </em>and <em>P</em>, it is easy to reconstruct the missing block <em>d</em>:</p>
<p>&#160;&#160;&#160; <em>d</em> = <em>P </em>XOR <em>a </em>XOR <em>b </em>XOR <em>c </em>XOR <em>e</em></p>
<p>Straightforward so far. Exclusive-or parity is commonly used in storage systems as RAID-5 configuration:</p>
<p><a href="http://en.wikipedia.org/wiki/File:RAID_5.svg"><img title="640px-RAID_5.svg" style="border-left-width: 0px; border-right-width: 0px; background-image: none; border-bottom-width: 0px; padding-top: 0px; padding-left: 0px; display: inline; padding-right: 0px; border-top-width: 0px" border="0" alt="640px-RAID_5.svg" src="http://igoro.com/wordpress/wp-content/uploads/2014/09/640px-RAID_5.svg_.png" width="400" height="296" /></a></p>
<p>RAID-5 uses the exclusive-or parity approach, except that the placement of parity is rotated among the storage devices. Parity blocks gets more overwrites than data blocks, so it makes sense to distribute them among the devices.</p>
<h3>What about double parity?</h3>
<p>Single parity protects data against the loss of a single storage device. But, what if you want protection against the loss off two devices? Can we extend the parity idea to protect the data even in that scenario?</p>
<p>It turns out that we can. Consider defining <em>P</em> and <em>Q</em> like this:</p>
<p>&#160;&#160;&#160; <em>P</em> = <em>a</em> + <em>b</em> + <em>c</em> + <em>d</em> + <em>e</em>     <br />&#160;&#160;&#160; <em>Q</em> = 1<em>a</em> + 2<em>b</em> + 3<em>c</em> + 4<em>d</em> + 5<em>e</em></p>
<p>If you lose any two of { <em>a, b, c, d, e, P, Q</em> }, you can recover them by plugging in the known values into the equations above, and solving for the two unknowns. Simple.</p>
<p>For example, say that { <em>a, b, c, d, e</em> } are { 10, 5, 8, 13, 2 }. Then:</p>
<p>&#160;&#160;&#160; <em>P</em> = 10 + 5 + 8 + 13 + 2 = 38     <br />&#160;&#160;&#160; <em>Q</em> = 10 + 2 × 5 + 3 × 8 + 4 × 13 + 5 × 2 = 106</p>
<p>If you lost <em>c</em> and <em>d</em>, you can recover them by solving the original equations.</p>
<h3>How come you can always solve the equations?</h3>
<p>To explain why we can always solve the equations to recover two missing blocks, we need to take a short detour into matrix algebra. We can express the relationship between <em>a, b, c, d, e, P, Q</em> as the following matrix-vector multiplication:</p>
<pre><p><img title="image" style="border-left-width: 0px; border-right-width: 0px; background-image: none; border-bottom-width: 0px; padding-top: 0px; padding-left: 0px; display: inline; padding-right: 0px; border-top-width: 0px" border="0" alt="image" src="http://igoro.com/wordpress/wp-content/uploads/2014/09/image.png" width="261" height="128" /></p></pre>
<p>We have 7 linear equations of 5 variables. In our example, we lose blocks <em>c</em> and <em>d</em> and end up with 5 equations of 5 variables. Effectively, we lost two rows in the matrix and the two matching rows in the result vector: </p>
<p><img title="image" style="border-left-width: 0px; border-right-width: 0px; background-image: none; border-bottom-width: 0px; padding-top: 0px; padding-left: 0px; display: inline; padding-right: 0px; border-top-width: 0px" border="0" alt="image" src="http://igoro.com/wordpress/wp-content/uploads/2014/09/image1.png" width="281" height="95" /> </p>
<p>&#160; <br />This system of equations is solvable because the matrix is invertible: the five rows are all linearly independent.</p>
<p>In fact, if you look at the original matrix, <strong>any</strong> five rows would have been linearly independent, so we could solve the equations regardless of which two devices were missing.</p>
<h3>But, integers are inconvenient&#8230;</h3>
<p>Calculating <em>P</em> and <em>Q</em> from the formulas above will allow you to recover up to two lost data blocks.</p>
<p>But, there is an annoying technical problem: the calculated values of <em>P</em> and <em>Q</em> can be significantly larger than the original data values. For example, if the data block is 1 byte, its value ranges from 0 to 255. But in that case, the <em>Q</em> value can reach 3825!&#160; Ideally, you&#8217;d want <em>P</em> and <em>Q</em> to have the same range as the data values themselves. For example, if you are dealing with 1-byte blocks, it would be ideal if <em>P</em> and <em>Q</em> were 1 byte as well.</p>
<p>We&#8217;d like to redefine the arithmetic operators +, –, ×, / such that our equations still work, but the values of <em>P</em> and <em>Q</em> stay within the ranges of the input values.</p>
<p>Thankfully, this is a well-studied problem in algebra. One simple approach that works is to pick a prime number M, and change the meaning of +, –, and × so that values &quot;wrap around&quot; at M. For example, if we picked M=7, then:</p>
<p>&#160;&#160;&#160; 6 + 1 = 0 (mod 7)<br />
  <br />&#160;&#160;&#160; 4 + 5 = 2 (mod 7) </p>
<p>&#160;&#160;&#160; 3 × 3 = 2 (mod 7) </p>
<p>&#160;&#160;&#160; etc.</p>
<p>Here is how to visualize the field:</p>
<p><img title="field_mod_7" style="border-left-width: 0px; border-right-width: 0px; background-image: none; border-bottom-width: 0px; padding-top: 0px; padding-left: 0px; display: inline; padding-right: 0px; border-top-width: 0px" border="0" alt="field_mod_7" src="http://igoro.com/wordpress/wp-content/uploads/2014/09/field_mod_7.png" width="233" height="226" /></p>
<p>Since 7 is a prime number, this is an example of a finite field. Additions, subtractions, multiplications and even divisions work in a finite field, and so our parity equations will work even modulo 7. In some sense, finite fields are more convenient than integers: dividing two integers does not necessarily give you an integer (e.g., 5 / 2), but in a finite field, division of two values always gives you another value in the field (e.g., 5 / 2 = 6 mod 7). Division modulo a prime is not trivial to calculate &#8211; you need to use the extended Euclidean algorithm &#8211; but it is pretty fast.</p>
<p>But, modular arithmetic forms a finite field only if the modulus is prime. If the modulus is composite, we have a finite ring, but not a finite field, which is not good enough for our equations to work.</p>
<p>Furthermore, in computing we like to deal with bytes or multiples of bytes. And, the range of a byte is 256, which is definitely not a prime number. 257 is a prime number, but that still doesn&#8217;t really work for us&#8230; if the inputs are bytes, our <em>P</em> and <em>Q</em> values will sometimes come out as 256, which doesn&#8217;t fit into a byte.</p>
<h3>Finally, Galois Fields</h3>
<p>The finite fields we described so far are all of size M, where M is a prime number. It is possible to construct finite fields of size <em>M<sup>k</sup></em>, where <em>M</em> is a prime number and <em>k</em> is a positive integer.</p>
<p>The solution is called a Galois field. The particular Galois field of interest is <em>GF</em>(2<sup>8</sup>) where:</p>
<ul>
<li>Addition is XOR. </li>
<li>Subtraction is same as addition; also XOR. </li>
<li>Multiplication can be implemented as <em>gf_ilog</em>[<em>gf_log</em>[<em>a</em>] + <em>gf_log</em>[<em>b</em>]], where <em>gf_log</em> is a logarithm in the Galois field, and <em>gf_ilog</em> is its inverse. <em>gf_log</em> and <em>gf_ilog</em> can be tables computed ahead of time. </li>
<li>Division can then calculated as <em>gf_ilog</em>[<em>gf_log</em>[<em>a</em>] – <em>gf_log</em>[<em>b</em>]], so we can reuse the same tables used by multiplication. </li>
</ul>
<p>The actual calculation of the logarithm and inverse logarithm tables is beyond the scope of the article, but if you would like to learn more about the details ,there are plenty of in-depth articles on Galois fields online.</p>
<p>In the end, a <em>GF</em>(2<sup>8</sup>) field gives us exactly what we wanted. It is a field of size 2<sup>8</sup>, and it gives us efficient +, –, ×, / operators that we can use to calculate <em>P</em> and <em>Q</em>, and if needed recalculate a lost data block from the remaining data blocks and parities. And, each of P and Q perfectly fits into a byte. The resulting coding is called Reed-Solomon error correction, and that’s the method used by RAID 6 configuration:</p>
<p><a href="http://en.wikipedia.org/wiki/File:RAID_6.svg"><img title="640px-RAID_6.svg" style="border-left-width: 0px; border-right-width: 0px; background-image: none; border-bottom-width: 0px; padding-top: 0px; padding-left: 0px; display: inline; padding-right: 0px; border-top-width: 0px" border="0" alt="640px-RAID_6.svg" src="http://igoro.com/wordpress/wp-content/uploads/2014/09/640px-RAID_6.svg_.png" width="400" height="235" /></a></p>
<h3>Final Comments</h3>
<p>The article explains the overall approach to implementing a RAID-6 storage scheme based on GF(2<sup>8</sup>). It is possible to use other Galois fields, such as the 16-bit GF(2<sup>16</sup>), but the 8-bit field is most often used in practice.</p>
<p>Also, it is possible to calculate additional parities to protect against additional device failures:</p>
<p>&#160;&#160;&#160; R = 1<sup>2</sup>a + 2<sup>2</sup>B + 3<sup>2</sup>C + 4<sup>2</sup>D + 5<sup>2</sup>E </p>
<p>&#160;&#160;&#160; S = 1<sup>3</sup>a + 2<sup>3</sup>B + 3<sup>3</sup>C + 4<sup>3</sup>D + 5<sup>3</sup>E </p>
<p>&#160;&#160;&#160; &#8230;</p>
<p>Finally, the examples in this article used 5 data blocks per stripe, but that&#8217;s an arbitrary choice &#8211; any number works, as far as the parity calculation is concerned.</p>
<h3>Related Reading</h3>
<p><a href="https://www.kernel.org/pub/linux/kernel/people/hpa/raid6.pdf">The mathematics of RAID-6</a>, H. Peter Anvin.</p>
]]></content:encoded>
			<wfw:commentRss>http://igoro.com/archive/how-raid-6-dual-parity-calculation-works/feed/</wfw:commentRss>
		<slash:comments>12</slash:comments>
		</item>
		<item>
		<title>Is two to the power of infinity more than infinity?</title>
		<link>http://igoro.com/archive/is-two-to-the-power-of-infinity-more-than-infinity/</link>
		<comments>http://igoro.com/archive/is-two-to-the-power-of-infinity-more-than-infinity/#comments</comments>
		<pubDate>Thu, 21 Apr 2011 15:11:45 +0000</pubDate>
		<dc:creator><![CDATA[Igor Ostrovsky]]></dc:creator>
				<category><![CDATA[Cool Stuff]]></category>

		<guid isPermaLink="false">http://igoro.com/?p=583</guid>
		<description><![CDATA[Do you know whether this inequality is true? It’s a simple question, but with an intriguing and revealing answer. Infinity #1 One concept of infinity that most people would have encountered in a math class&#160; is the infinity of limits. With limits, we can try to understand 2∞ as follows: The infinity symbol is used [&#8230;]]]></description>
				<content:encoded><![CDATA[<p>Do you know whether this inequality is true?</p>
<p><img style="background-image: none; border-right-width: 0px; padding-left: 0px; padding-right: 0px; display: inline; border-top-width: 0px; border-bottom-width: 0px; border-left-width: 0px; padding-top: 0px" title="clip_image002[3]" border="0" alt="clip_image002[3]" src="http://igoro.com/wordpress/wp-content/uploads/2011/04/clip_image00232.gif" width="64" height="25" /></p>
<p>It’s a simple question, but with an intriguing and revealing answer.</p>
<h3>Infinity #1</h3>
<p>One concept of infinity that most people would have encountered in a math class&#160; is the infinity of limits. With limits, we can try to understand 2<sup>∞</sup> as follows:</p>
<p><img style="border-right-width: 0px; display: inline; border-top-width: 0px; border-bottom-width: 0px; border-left-width: 0px" title="clip_image002" border="0" alt="clip_image002" src="http://igoro.com/wordpress/wp-content/uploads/2010/05/clip_image002.gif" width="95" height="32" /></p>
<p>The infinity symbol is used twice here: first time to represent “as x grows”, and a second to time to represent “2<sup>x</sup> eventually permanently exceeds any specific bound”.</p>
<p>If we use the notation a bit loosely, we could “simplify” the limit above as follows:</p>
<p><img style="border-right-width: 0px; display: inline; border-top-width: 0px; border-bottom-width: 0px; border-left-width: 0px" title="clip_image002[4]" border="0" alt="clip_image002[4]" src="http://igoro.com/wordpress/wp-content/uploads/2010/05/clip_image0024.gif" width="64" height="25" /></p>
<p>This would suggest that the answer to the question in the title is “No”, but as will be apparent shortly, using infinity notation loosely is not a good idea.</p>
<h3>Infinity #2</h3>
<p>In addition to limits, there is another place in mathematics where infinity is important: in set theory.</p>
<p>Set theory recognizes infinities of multiple “sizes”, the smallest of which is the set of positive integers: { 1, 2, 3, … }. A set whose size is equal to the size of positive integer set is called <em>countably infinite</em>.</p>
<ul>
<li><strong>“Countable infinity plus one”        <br /></strong>If we add another element (say 0) to the set of positive integers, is the new set any larger? To see that it cannot be larger, you can look at the problem differently: in set { 0, 1, 2, … } each element is simply smaller by one, compared to the set { 1, 2, 3, … }. So, even though we added an element to the infinite set, we really just “<strong>relabeled</strong>” the elements by decrementing every value.       </p>
<p><img style="background-image: none; border-right-width: 0px; padding-left: 0px; padding-right: 0px; display: inline; border-top-width: 0px; border-bottom-width: 0px; border-left-width: 0px; padding-top: 0px" title="image" border="0" alt="image" src="http://igoro.com/wordpress/wp-content/uploads/2011/04/image.png" width="151" height="73" />       </li>
<li><strong>“Two times countable infinity”</strong>       <br />Now, let’s “double” the set of positive integers by adding values 0.5, 1.5, 2.5, … The new set might seem larger, since it contains an infinite number of new values. But again, you can say that the sets are the same size, just each element is half the size:
<p><img style="background-image: none; border-right-width: 0px; padding-left: 0px; padding-right: 0px; display: inline; border-top-width: 0px; border-bottom-width: 0px; border-left-width: 0px; padding-top: 0px" title="image" border="0" alt="image" src="http://igoro.com/wordpress/wp-content/uploads/2011/04/image1.png" width="151" height="73" />       </li>
<li><strong>“Countable infinity squared”</strong>       <br />To “square” countable infinity, we can form a set that will contain all integer <em>pairs</em>, such as [1,1], [1,2], [2,2] and so on. By pairing up every integer with every integer, we are effectively squaring the size of the integer set.
<p>Can pairs of integers also be basically just relabeled with integers? Yes, they can, and so the set of integer pairs is no larger than the set of integers. The diagram below shows how integer pairs can be “relabeled” with ordinary integers (e.g., pair [2,2] is labeled as 5):       </p>
<p><img style="background-image: none; border-right-width: 0px; padding-left: 0px; padding-right: 0px; display: inline; border-top-width: 0px; border-bottom-width: 0px; border-left-width: 0px; padding-top: 0px" title="image" border="0" alt="image" src="http://igoro.com/wordpress/wp-content/uploads/2011/04/image2.png" width="149" height="154" />       </li>
<li><strong>“Two to the power of countable infinity”</strong>       <br />The set of integers contains a countable infinity of elements, and so the set of all integer <strong>subsets </strong>should – loosely speaking – contain <em>two to the power of countable infinity</em> elements. So, is the number of integer subsets equal to the number of integers? It turns out that the “relabeling” trick we used in the first three examples does not work here, and so it appears that there are <strong>more</strong> integer subsets than there are integers. </li>
</ul>
<p>Let’s look at the fourth example in more detail to understand why it is so fundamentally different from the first three. You can think of an integer subset as a binary number with an infinite sequence of digits: <em>i</em>-th digit is 1 if <em>i</em> is included in the subset and 0 if <em>i</em> is excluded. So, a typical integer subset is a sequence of ones and zeros going forever and ever, with no pattern emerging.</p>
<p>And now we are getting to the key difference. Every integer, half-integer, or integer pair can be described using a <em>finite number of bits</em>. That’s why we can squint at the set of integer pairs and see that it really is just a set of integers. Each integer pair can be easily converted to an integer and back.</p>
<p>However, an integer subset is an <em>infinite</em> sequence of bits. It is impossible to describe a general scheme for converting an infinite sequence of bits into a finite sequence without information loss. That is why it is impossible to squint at the set of integer subsets and argue that it really is just a set of integers.</p>
<p>The diagram below shows examples of infinite sets of three different sizes:</p>
<p><img style="background-image: none; border-right-width: 0px; padding-left: 0px; padding-right: 0px; display: inline; border-top-width: 0px; border-bottom-width: 0px; border-left-width: 0px; padding-top: 0px" title="image" border="0" alt="image" src="http://igoro.com/wordpress/wp-content/uploads/2011/04/image7.png" width="613" height="325" /></p>
<p>So, in set theory, there are multiple infinities. The smallest infinity is the “countable” infinity, <img style="background-image: none; border-right-width: 0px; padding-left: 0px; padding-right: 0px; display: inline; border-top-width: 0px; border-bottom-width: 0px; border-left-width: 0px; padding-top: 0px" title="image" border="0" alt="image" src="http://igoro.com/wordpress/wp-content/uploads/2011/04/image4.png" width="10" height="12" /><sub>0</sub>, that matches the number of integers. A larger infinity is <img style="background-image: none; border-right-width: 0px; padding-left: 0px; padding-right: 0px; display: inline; border-top-width: 0px; border-bottom-width: 0px; border-left-width: 0px; padding-top: 0px" title="image" border="0" alt="image" src="http://igoro.com/wordpress/wp-content/uploads/2011/04/image_thumb.png" width="10" height="12" /><sub></sub><sub>1</sub> that matches the number of real numbers or integer subsets. And there are even larger and larger infinite sets.</p>
<p>Since there are more integer subsets than there are integers, it should not be surprising that the mathematical formula below holds (you can find the formula in the Wikipedia article on <a href="http://en.wikipedia.org/wiki/Continuum_hypothesis">Continuum Hypothesis</a>):</p>
<p><img style="background-image: none; border-right-width: 0px; padding-left: 0px; padding-right: 0px; display: inline; border-top-width: 0px; border-bottom-width: 0px; border-left-width: 0px; padding-top: 0px" title="clip_image002[4]" border="0" alt="clip_image002[4]" src="http://igoro.com/wordpress/wp-content/uploads/2011/04/clip_image0024.png" width="85" height="30" /></p>
<p>And since <img style="background-image: none; border-right-width: 0px; padding-left: 0px; padding-right: 0px; display: inline; border-top-width: 0px; border-bottom-width: 0px; border-left-width: 0px; padding-top: 0px" title="image" border="0" alt="image" src="http://igoro.com/wordpress/wp-content/uploads/2011/04/image4.png" width="10" height="12" /><sub>0</sub> denotes infinity (the smallest kind), it seems that it would not be much of a stretch to write this:</p>
<p><img style="background-image: none; border-right-width: 0px; padding-left: 0px; padding-right: 0px; display: inline; border-top-width: 0px; border-bottom-width: 0px; border-left-width: 0px; padding-top: 0px" title="clip_image0023" border="0" alt="clip_image0023" src="http://igoro.com/wordpress/wp-content/uploads/2011/04/clip_image00231.gif" width="64" height="25" /></p>
<p>… and now it seems that the answer to the question from the title should be “Yes”.</p>
<h3>The answer</h3>
<p>So, is it true that that 2<sup>∞</sup> &gt; ∞? The answer depends on which notion of infinity we use. The infinity of limits has no size concept, and the formula would be false. The infinity of set theory does have a size concept and the formula would be kind of true.</p>
<p>Technically, statement 2<sup>∞</sup> &gt; ∞ is neither true nor false. Due to the ambiguous notation, it is impossible to tell which concept of infinity is being used, and consequently which rules apply.</p>
<h3>Who cares?</h3>
<p>OK… but why would anyone care that there are two different notions of infinity? It is easy to get the impression that the discussion is just an intellectual exercise with no practical implications.</p>
<p>On the contrary, rigorous understanding of the two kinds of infinity has been very important. After properly understanding the first kind of infinity, Isaac Newton was able to develop calculus, followed by the theory of gravity. And, the second kind of infinity was a pre-requisite for Alan Turing to define computability (see my article on <a href="http://igoro.com/archive/numbers-that-cannot-be-computed/">Numbers that cannot be computed</a>) and Kurt Gödel to prove Gödel’s Incompleteness Theorem.</p>
<p>So, understanding both kinds of infinity has lead to important insights and practical advancements.</p>
]]></content:encoded>
			<wfw:commentRss>http://igoro.com/archive/is-two-to-the-power-of-infinity-more-than-infinity/feed/</wfw:commentRss>
		<slash:comments>16</slash:comments>
		</item>
		<item>
		<title>Why computers represent signed integers using two&#8217;s complement</title>
		<link>http://igoro.com/archive/why-computers-represent-signed-integers-using-twos-complement/</link>
		<comments>http://igoro.com/archive/why-computers-represent-signed-integers-using-twos-complement/#comments</comments>
		<pubDate>Tue, 24 Aug 2010 08:29:44 +0000</pubDate>
		<dc:creator><![CDATA[Igor Ostrovsky]]></dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://igoro.com/?p=682</guid>
		<description><![CDATA[If you had to come up with a way to represent signed integers in 32-bits, how would you do it? One simple solution would be to use one bit to represent the sign, and the remaining 31 bits to represent the absolute value of the number. But as many intuitive solutions, this one is not [&#8230;]]]></description>
				<content:encoded><![CDATA[<p>If you had to come up with a way to represent signed integers in 32-bits, how would you do it?</p>
<p>One simple solution would be to use one bit to represent the sign, and the remaining 31 bits to represent the absolute value of the number. But as many intuitive solutions, this one is not very good. One problem is that adding and multiplying these integers would be somewhat tricky, because there are four cases to handle due to signs of the inputs. Another problem is that zero can be represented in two ways: as positive zero and as negative zero.</p>
<p>Computers use a more elegant representation of signed integers that is obviously “correct” as soon as you understand how it works.</p>
<h3>Clock arithmetic</h3>
<p>Before we get to the signed integer representation, we need to briefly talk about “clock arithmetic”.</p>
<p>Clock arithmetic is a bit different from ordinary arithmetic. On a 12-hour clock, moving 2 hours forward is equivalent to moving 14 hours forward or 10 hours backward:</p>
<p>&#160; <img style="border-right-width: 0px; display: inline; border-top-width: 0px; border-bottom-width: 0px; vertical-align: middle; border-left-width: 0px" title="image" border="0" alt="image" src="http://igoro.com/wordpress/wp-content/uploads/2010/08/image.png" width="41" height="18" /> + 2 hours = <img style="border-right-width: 0px; display: inline; border-top-width: 0px; border-bottom-width: 0px; margin-left: 0px; vertical-align: middle; border-left-width: 0px; margin-right: 0px" title="image" border="0" alt="image" src="http://igoro.com/wordpress/wp-content/uploads/2010/08/image1.png" width="43" height="21" /> </p>
<p>&#160; <img style="border-right-width: 0px; display: inline; border-top-width: 0px; border-bottom-width: 0px; vertical-align: middle; border-left-width: 0px" title="image" border="0" alt="image" src="http://igoro.com/wordpress/wp-content/uploads/2010/08/image.png" width="41" height="18" /> + 14 hours = <img style="border-right-width: 0px; display: inline; border-top-width: 0px; border-bottom-width: 0px; margin-left: 0px; vertical-align: middle; border-left-width: 0px; margin-right: 0px" title="image" border="0" alt="image" src="http://igoro.com/wordpress/wp-content/uploads/2010/08/image1.png" width="43" height="21" /></p>
<p>&#160; <img style="border-right-width: 0px; display: inline; border-top-width: 0px; border-bottom-width: 0px; vertical-align: middle; border-left-width: 0px" title="image" border="0" alt="image" src="http://igoro.com/wordpress/wp-content/uploads/2010/08/image.png" width="41" height="18" /> – 10 hours = <img style="border-right-width: 0px; display: inline; border-top-width: 0px; border-bottom-width: 0px; margin-left: 0px; vertical-align: middle; border-left-width: 0px; margin-right: 0px" title="image" border="0" alt="image" src="http://igoro.com/wordpress/wp-content/uploads/2010/08/image1.png" width="43" height="21" /></p>
<p>In the “clock arithmetic”, 2,&#160; 14 and –10 are just three different ways to write down the same number. </p>
<p>They are interchangeable in multiplications too:</p>
<p>&#160; <img style="border-right-width: 0px; display: inline; border-top-width: 0px; border-bottom-width: 0px; vertical-align: middle; border-left-width: 0px" title="image" border="0" alt="image" src="http://igoro.com/wordpress/wp-content/uploads/2010/08/image2.png" width="47" height="17" /> + (3 * 2) hours = <img style="border-right-width: 0px; display: inline; border-top-width: 0px; border-bottom-width: 0px; vertical-align: middle; border-left-width: 0px" title="image" border="0" alt="image" src="http://igoro.com/wordpress/wp-content/uploads/2010/08/image3.png" width="41" height="17" /></p>
<p>&#160; <img style="border-right-width: 0px; display: inline; border-top-width: 0px; border-bottom-width: 0px; vertical-align: middle; border-left-width: 0px" title="image" border="0" alt="image" src="http://igoro.com/wordpress/wp-content/uploads/2010/08/image2.png" width="47" height="17" /> + (3 * 14) hours = <img style="border-right-width: 0px; display: inline; border-top-width: 0px; border-bottom-width: 0px; vertical-align: middle; border-left-width: 0px" title="image" border="0" alt="image" src="http://igoro.com/wordpress/wp-content/uploads/2010/08/image3.png" width="41" height="17" /></p>
<p>&#160; <img style="border-right-width: 0px; display: inline; border-top-width: 0px; border-bottom-width: 0px; vertical-align: middle; border-left-width: 0px" title="image" border="0" alt="image" src="http://igoro.com/wordpress/wp-content/uploads/2010/08/image2.png" width="47" height="17" /> + (3 * -10) hours = <img style="border-right-width: 0px; display: inline; border-top-width: 0px; border-bottom-width: 0px; vertical-align: middle; border-left-width: 0px" title="image" border="0" alt="image" src="http://igoro.com/wordpress/wp-content/uploads/2010/08/image3.png" width="41" height="17" /></p>
<p>A more formal term for “clock arithmetic” is “modular arithmetic”. In modular arithmetic, two numbers are equivalent if they leave the same non-negative remainder when divided by a particular number. Numbers 2, 14 and –10 all leave remainder of 2 when divided by 12, and so they are all “equivalent”. In math terminology, numbers 2, 14 and –10 are <em>congruent modulo</em> <em>12</em>.</p>
<h3>Fixed-width binary arithmetic</h3>
</p>
<p>Internally, processors represent integers using a fixed number of bits: say 32 or 64. And, additions, subtractions and multiplications of these integers are based on modular arithmetic.</p>
<p>As a simplified example, let’s consider 3-bit integers, which can represent integers from 0 to 7. If you add or multiply two of these 3-bit numbers in fixed-width binary arithmetic, you’ll get the “modular arithmetic” answer:</p>
<p>&#160;&#160; 1 + 2 –&gt; 3</p>
<p>&#160;&#160; 4 + 5 -&gt; 1</p>
<p>The calculations wrap around, because any answer larger than 7 cannot be represented with 3 bits. The wrapped-around answer is still meaningful:</p>
<ul>
<li><strong>The answer we got is congruent (i.e., equivalent) to the real answer, modulo 8        <br /></strong>This is the modular arithmetic! The real answer was 9, but we got 1. And, both 9 and 1 leave remainder 1 when divided by 8.       </li>
<li><strong>The answer we got represents the lowest 3 bits of the correct answer        <br /></strong>For 4+5, we got 001, while the correct answer is 1001. </li>
</ul>
<p>One good way to think about this is to imagine the infinite number line:</p>
<p>&#160;<img style="border-right-width: 0px; display: inline; border-top-width: 0px; border-bottom-width: 0px; border-left-width: 0px" title="image" border="0" alt="image" src="http://igoro.com/wordpress/wp-content/uploads/2010/08/image4.png" width="541" height="49" /> </p>
<p>Then, curl the number line into a circle so that 1000 overlaps the 000:</p>
<p><img style="border-right-width: 0px; display: inline; border-top-width: 0px; border-bottom-width: 0px; border-left-width: 0px" title="image" border="0" alt="image" src="http://igoro.com/wordpress/wp-content/uploads/2010/08/image5.png" width="241" height="241" /> </p>
<p>In 3-bit arithmetic, additions, subtractions and multiplications work just they do in ordinary arithmetic, except that they move along a closed ring of 8 numbers, rather than along the infinite number line. Otherwise, mostly the same rules apply: 0+a=a, 1*a=a, a+b=b+a, a*(b+c)=a*b+a*c, and so on.</p>
<p>Additions and multiplications modulo a power of two are convenient to implement in hardware. An adder that computes c = a + b for 3-bit integers can be implemented like this:</p>
<pre>c0 = (a0 XOR b0)
c1 = (a1 XOR b1) XOR (a0 AND b0)
c2 = (a2 XOR b2) XOR ((a1 AND b1) OR ((a1 XOR b1) AND (a0 AND b0))) </pre>
<h3>Binary arithmetic and signs</h3>
<p>Now comes the cool part: the adder for unsigned integers can be used for signed integers too, exactly as it is! Similarly, a multiplier for unsigned integers works for signed integers too. (Division needs to be handled separately, though.)</p>
<p>Recall that the adder I showed works in modular arithmetic. The adder represents integers as 3-bit values from 000 to 111. You can interpret those eight values as signed or unsigned:</p>
<table border="1" cellspacing="0" cellpadding="2" width="170">
<tbody>
<tr>
<td valign="top" width="51"><strong>Binary</strong></td>
<td valign="top" width="52"><strong>Unsigned value</strong></td>
<td valign="top" width="65"><strong>Signed value</strong></td>
</tr>
<tr>
<td valign="top" width="50">000</td>
<td valign="top" width="52">0</td>
<td valign="top" width="66">0</td>
</tr>
<tr>
<td valign="top" width="50">001</td>
<td valign="top" width="52">1</td>
<td valign="top" width="66">1</td>
</tr>
<tr>
<td valign="top" width="50">010</td>
<td valign="top" width="52">2</td>
<td valign="top" width="66">2</td>
</tr>
<tr>
<td valign="top" width="50">011</td>
<td valign="top" width="52">3</td>
<td valign="top" width="66">3</td>
</tr>
<tr>
<td valign="top" width="50">100</td>
<td valign="top" width="52">4</td>
<td valign="top" width="66">-4</td>
</tr>
<tr>
<td valign="top" width="50">101</td>
<td valign="top" width="52">5</td>
<td valign="top" width="66">-3</td>
</tr>
<tr>
<td valign="top" width="50">110</td>
<td valign="top" width="52">6</td>
<td valign="top" width="66">-2</td>
</tr>
<tr>
<td valign="top" width="50">111</td>
<td valign="top" width="52">7</td>
<td valign="top" width="67">-1</td>
</tr>
</tbody>
</table>
<p>Notice that the signed value and the unsigned value are congruent modulo 8, and so equivalent as far as the adder is concerned. For example, 101 means either 5 or –3. On a ring of size 8, moving 5 numbers forward is equivalent to moving 3 numbers backwards.</p>
<p>The 3-bit adder and multiplier work in the 3-bit binary arithmetic, not in ‘actual’ integers. It is up to you whether you <strong>interpret</strong> their inputs and outputs as unsigned integers (the left ring), or as signed integers (the right ring):</p>
<p><img style="border-bottom: 0px; border-left: 0px; display: inline; border-top: 0px; border-right: 0px" title="image" border="0" alt="image" src="http://igoro.com/wordpress/wp-content/uploads/2010/09/image6.png" width="241" height="241" />&#160;&#160;&#160;&#160;&#160; <img style="border-right-width: 0px; display: inline; border-top-width: 0px; border-bottom-width: 0px; border-left-width: 0px" title="image" border="0" alt="image" src="http://igoro.com/wordpress/wp-content/uploads/2010/08/image7.png" width="241" height="241" /></p>
<p>(Of course, you could also decide that the eight values should represent say [0, 1, –6, –5, –4, –3, –2, -1].)</p>
<p>Let’s take a look at an example. In unsigned 3-bit integers, we can compute the following:</p>
<p>&#160;&#160; 6 + 7 –&gt; 5</p>
<p>In signed 3-bit integers, the computation comes out like this:</p>
<p>&#160;&#160; (-2) + (-1) –&gt; -3</p>
<p>In 3-bit arithmetic, the computation is the same in both the signed and the unsigned case:</p>
<p>&#160;&#160; 110 + 111 –&gt; 101</p>
<h3>Two’s complement</h3>
<p>The representation of signed integers that I showed above is the representation used by modern processors. It is called “two’s complement” because to negate an integer, you subtract it from 2<sup>N</sup>. For example, to get the representation of –2 in 3-bit arithmetic, you can compute 8 – 2 = 6, and so –2 is represented in two’s complement as 6 in binary: 110.</p>
<p>This is another way to compute two’s complement, which is easier to imagine implemented in hardware:</p>
<ol>
<li>Start with a binary representation of the number you need to negate </li>
<li>Flip all bits </li>
<li>Add one </li>
</ol>
<p>The reason why this works is that flipping bits is equivalent to subtracting the number from (2<sup>N</sup> – 1). We actually need to subtract the number from 2<sup>N</sup>, and step 3 compensates for that – 1.</p>
<h3>Modern processors and two’s complement</h3>
<p>Today’s processors represent signed integers using two’s complement.</p>
<p>To see that, you can compare the x86 assembly emitted by a compiler for signed and unsigned integers. Here is a C# program that multiplies two signed integers:</p>
<pre class="code">    <span style="color: blue">int </span>a = <span style="color: blue">int</span>.Parse(<span style="color: #2b91af">Console</span>.ReadLine());
    <span style="color: blue">int </span>b = <span style="color: blue">int</span>.Parse(<span style="color: #2b91af">Console</span>.ReadLine());
    <span style="color: #2b91af">Console</span>.WriteLine(a * b);</pre>
<p>The JIT compiler will use the IMUL instruction to compute a * b:</p>
</p>
<pre class="code">    <font color="#808080">0000004f&#160; call&#160;&#160;&#160;&#160;&#160;&#160;&#160; 79084EA0 
    00000054&#160; mov&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160; ecx,eax 
    </font><strong>00000056&#160; imul&#160;&#160;&#160;&#160;&#160;&#160;&#160; esi,edi</strong>
    <font color="#808080">00000059&#160; mov&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160; edx,esi 
    0000005b&#160; mov&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160; eax,dword ptr [ecx]</font> </pre>
<p>For comparison, I can change the program to use unsigned integers (uint):</p>
<pre class="code">    <span style="color: blue">uint </span>a = <span style="color: blue">uint</span>.Parse(<span style="color: #2b91af">Console</span>.ReadLine());
    <span style="color: blue">uint </span>b = <span style="color: blue">uint</span>.Parse(<span style="color: #2b91af">Console</span>.ReadLine());
    <span style="color: #2b91af">Console</span>.WriteLine(a * b);</pre>
</p>
<p>The JIT compiler will <strong>still </strong>use the IMUL instruction to compute a * b:</p>
</p>
<pre class="code">    <font color="#808080">0000004f&#160; call&#160;&#160;&#160;&#160;&#160;&#160;&#160; 79084EA0 
    00000054&#160; mov&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160; ecx,eax 
    </font><strong>00000056&#160; imul&#160;&#160;&#160;&#160;&#160;&#160;&#160; esi,edi</strong>
    <font color="#808080">00000059&#160; mov&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160; edx,esi 
    0000005b&#160; mov&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160; eax,dword ptr [ecx]</font> </pre>
<p>The IMUL instruction does not know whether its arguments are signed or unsigned, and it can still multiply them correctly!</p>
<p>If you do look up IMUL instruction (say in the <a href="http://www3.intel.com/Assets/PDF/manual/253666.pdf">Intel Reference Manual</a>), you’ll find that IMUL is the instruction for signed multiplication. And, there is another instruction for unsigned multiplication, MUL. How can there be separate instructions for signed and unsigned multiplications, now that I spent so much time arguing that the two kinds of multiplications are the same?</p>
<p>It turns out that there is actually a small difference between signed and unsigned multiplication: detection of overflow. In 3-bit arithmetic, the multiplication –1 * –2 -&gt; 2 produces the correct answer. But, the equivalent unsigned multiplication 7 * 6 –&gt; 2 produces an answer that is only correct in modular arithmetic, not “actually” correct.</p>
<p>So, MUL and IMUL behave the same way, except for their effect on the overflow processor flag. If your program does not check for overflow (and C# does not, by default) the instructions can be used interchangeably.</p>
<h3>Wrapping up</h3>
<p>Hopefully this article helps you understand how are signed integers represented inside the computer. With this knowledge under your belt, various behaviors of integers should be easier to understand.</p>
<div style="border-bottom: black 1px solid; border-left: black 1px solid; padding-bottom: 10px; background-color: #f0f0ff; margin-top: 20px; padding-left: 10px; padding-right: 10px; margin-bottom: 20px; margin-left: 10px; border-top: black 1px solid; border-right: black 1px solid; padding-top: 10px">
<p>Read more of my articles:</p>
<p><a href="http://igoro.com/archive/fast-and-slow-if-statements-branch-prediction-in-modern-processors/">Fast and slow if-statements: branch prediction in modern processors</a></p>
<p><a href="http://igoro.com/archive/gallery-of-processor-cache-effects/">Gallery of processor cache effects</a></p>
<p><a href="http://igoro.com/archive/how-gpu-came-to-be-used-for-general-computation/">How GPU came to be used for general computation</a></p>
<p><a href="http://igoro.com/archive/self-printing-game-of-life-in-c/">Self-printing Game of Life in C#</a></p>
<p>And if you like my blog, <a href="http://igoro.com/feed/">subscribe</a>!</p>
</div>
]]></content:encoded>
			<wfw:commentRss>http://igoro.com/archive/why-computers-represent-signed-integers-using-twos-complement/feed/</wfw:commentRss>
		<slash:comments>20</slash:comments>
		</item>
		<item>
		<title>Graphs, trees, and origins of humanity</title>
		<link>http://igoro.com/archive/graphs-trees-and-origins-of-humanity/</link>
		<comments>http://igoro.com/archive/graphs-trees-and-origins-of-humanity/#comments</comments>
		<pubDate>Fri, 02 Jul 2010 05:44:18 +0000</pubDate>
		<dc:creator><![CDATA[Igor Ostrovsky]]></dc:creator>
				<category><![CDATA[Cool Stuff]]></category>

		<guid isPermaLink="false">http://igoro.com/?p=621</guid>
		<description><![CDATA[Imagine that 90,000 years ago, every man alive at the time picked a different last name. Assuming that last names are inherited from father to son, how many different last names do you think there would be today? It turns out that there would be only one last name! Similarly, imagine that 200,000 years ago, [&#8230;]]]></description>
				<content:encoded><![CDATA[<p>Imagine that 90,000 years ago, every man alive at the time picked a different last name. Assuming that last names are inherited from father to son, how many different last names do you think there would be today?</p>
<p>It turns out that there would be <strong>only one </strong>last name!</p>
<p>Similarly, imagine that 200,000 years ago, every woman alive picked a different secret word, and told the secret word to her daughters. And, the female descendants would follow this tradition – a mother would always pass her secret word to her daughters.</p>
<p>As you might guess, today there would be <strong>only one </strong>secret word in circulation.</p>
<p>The man whose last name all men would carry is called “Y-chromosomal Adam” and the woman whose secret word all women would know is “Mitochondrial Eve”. The names come from the fact that Y chromosome is a piece of DNA inherited from father to son (just like last names), and mitochondrial DNA is inherited from mother to children (just like the hypothetical secret words).</p>
<h3>Why the convergence?</h3>
<p>So, how likely is it that one last name and one secret word will eventually come to dominate? Given enough time, it is virtually guaranteed, under some assumptions (e.g., the population does not become separated).</p>
<p>Here is a simulation of how last names of five men could flow through six generations:</p>
<p><img style="border-right-width: 0px; display: inline; border-top-width: 0px; border-bottom-width: 0px; border-left-width: 0px" title="image" border="0" alt="image" src="http://igoro.com/wordpress/wp-content/uploads/2010/07/image5.png" width="305" height="265" /></p>
<p>After six generations, the last name shown as green is the only last name around, purely by chance! In biology, this random effect is called <a href="http://en.wikipedia.org/wiki/Genetic_drift">genetic drift</a>.</p>
<p>And, the convergence does not only happen for small populations. Here are the numbers that I got by simulating different population sizes:</p>
<table border="1" cellspacing="0" cellpadding="2" width="210">
<tbody>
<tr>
<td valign="top" width="78"><strong>Population</strong></td>
<td valign="top" width="130"><strong>Generations to convergence</strong></td>
</tr>
<tr>
<td valign="top" width="78">10</td>
<td valign="top" width="130">23</td>
</tr>
<tr>
<td valign="top" width="78">50</td>
<td valign="top" width="130">73</td>
</tr>
<tr>
<td valign="top" width="78">100</td>
<td valign="top" width="130">239</td>
</tr>
<tr>
<td valign="top" width="78">500</td>
<td valign="top" width="130">891</td>
</tr>
<tr>
<td valign="top" width="78">1000</td>
<td valign="top" width="130">1395</td>
</tr>
<tr>
<td valign="top" width="78">5000</td>
<td valign="top" width="130">7312</td>
</tr>
<tr>
<td valign="top" width="78">10000</td>
<td valign="top" width="130">13491</td>
</tr>
</tbody>
</table>
<p>&#160;</p>
<p>Here is the implementation of this simulation:</p>
<pre class="code"><span style="color: blue">static int </span>MitochondrialEve(<span style="color: blue">int </span>populationSize)
{
    <span style="color: #2b91af">Random </span>random = <span style="color: blue">new </span><span style="color: #2b91af">Random</span>();

    <span style="color: blue">int </span>generations = 0;
    <span style="color: blue">int</span>[] cur = <span style="color: blue">new int</span>[populationSize];
    <span style="color: blue">for </span>(<span style="color: blue">int </span>i = 0; i &lt; populationSize; i++) cur[i] = i;

    <span style="color: blue">for </span>( ; cur.Max() != cur.Min(); generations++)
    {
        <span style="color: blue">int</span>[] next = <span style="color: blue">new int</span>[populationSize];
        <span style="color: blue">for </span>(<span style="color: blue">int </span>i = 0; i &lt; next.Length; i++)
        {
            next[i] = cur[random.Next(populationSize)];
        }
        cur = next;
    }

    <span style="color: blue">return </span>generations;
}</pre>
<p>This simulation is not a sophisticated model of a human population, but it is sufficient for the purposes of illustrating genetic drift. It assumes that every man has on average one son, with a standard deviation of roughly one, and a binomial probability distribution.</p>
<p>By the way, the Y-chromosomal Adam lived roughly 60,000–90,000 years ago, while Eve lived roughly 200,000 years ago. The reason why genetic drift acted faster for men is that men have a larger variation in the number of offspring – one man can have many more children than one woman.</p>
<p>UPDATE: Based on comments at Hacker News and Reddit, some readers are dissatisfied with the assumption of a fixed population size. Of course, human population grew over the ages, but there were also periods when it shrank, sometimes a lot. For example, roughly 70,000 years ago, the human population may have dropped down to thousands of individuals (<a href="http://en.wikipedia.org/wiki/Toba_catastrophe_theory">1</a>, <a href="https://genographic.nationalgeographic.com/genographic/dawn.html">2</a>). So, a fixed population size is a reasonable simplification for my example.</p>
<h3>The roles of Mitochondrial Eve and Y-chromosomal Adam</h3>
<p>One noteworthy fact about Mitochondrial Eve and Y-chromosomal Adam is that their positions in the history are not as special as they may first appear. Let’s take a look at this in more depth.</p>
<p>Tracing from me, I can follow a path of paternal ancestry all the way to Y-chromosomal Adam:</p>
<p><img style="border-right-width: 0px; display: inline; border-top-width: 0px; border-bottom-width: 0px; border-left-width: 0px" title="image" border="0" alt="image" src="http://igoro.com/wordpress/wp-content/uploads/2010/07/image6.png" width="703" height="51" />&#160;</p>
<p>If I only trace the male ancestry, there is exactly one path that starts at me, and that path leads to Y-chromosomal Adam. You could start the diagram at any man alive today and you’d get a similar picture, with the lineage finally reaching the Y-chromosomal Adam.</p>
<p>However, if I trace ancestry via both parents, the number of ancestors explodes. I have two parents, four grandparents, eight great grandparents, and so forth. The number of ancestors in each generation grows exponentially, although that cannot continue for long. If all ancestors in each generations were distinct, I would have more than 1 billion ancestors just 30 generations back, so the tree certainly has to start collapsing into a directed acyclic graph by then.</p>
<p><img style="border-right-width: 0px; display: inline; border-top-width: 0px; border-bottom-width: 0px; border-left-width: 0px" title="image" border="0" alt="image" src="http://igoro.com/wordpress/wp-content/uploads/2010/07/image7.png" width="361" height="235" /></p>
<p>Tracing ancestry through both parents, there are many paths to follow, and each generation of ancestors contains a lot of people. Some of those paths will reach Y-chromosomal Adam, but other paths will reach <strong>other men</strong> <strong>in his generation</strong>. Similarly, some paths will reach Mitochondrial Eve, but other paths will reach <strong>other women in her generation</strong>.</p>
<h3>Most recent common ancestor</h3>
<p>So, Mitochondrial Eve only has a special position with respect to the mother-daughter relationships, and Y-chromosomal Adam only with respect to the father-son relationships.</p>
<p>What if you consider all types of ancestry, father-son, father-daughter, mother-son and mother-daughter? In the resulting directed acyclic graph, neither the Y-chomosomal Adam nor the Mitochondrial Eve appear in a special position. In fact, in the combined graph, the most recent ancestor of all today’s people lived <strong>much </strong>later than Y-chromosomal Adam. The most recent ancestor is estimated to have lived roughly 15,000 to 5,000 years ago.</p>
<p>One way to visualize the relationship between Mitochondrial Eve, Y-chromosomal Adam, and the Most Recent Common Ancestor (MRCA) is to look at a small genealogy diagram with just a few people:</p>
<p><img style="border-right-width: 0px; display: inline; border-top-width: 0px; border-bottom-width: 0px; border-left-width: 0px" title="image" border="0" alt="image" src="http://igoro.com/wordpress/wp-content/uploads/2010/07/image8.png" width="328" height="556" /></p>
<p>For the last generation consisting of just four people, this graph shows the Mitochondrial Eve, the Y-chromosomal Adam, and the most recent common ancestors (a couple in this example, but could also be one man or one woman). Adam is at the root of the blue tree, Eve is at the root of the red tree, and the most recent common ancestors are much lower in the graph.</p>
<h3>The dating of Y-Adam and M-Eve</h3>
<p>Finally, I’ll briefly give you an idea on how biologists calculate when Mitochondrial Eve and Y-chromosomal Adam lived.</p>
<p>The dating is based on DNA analysis. Changes in DNA accumulate at a certain rate that depends on various factors – region of the DNA, the species, population size, etc. To date Mitochondrial Eve, biologists calculate an estimate of the mitochondrial DNA (mtDNA) mutation rate. Then, they look at how much mtDNA varies between today’s women, and then calculate how long it would take to achieve that degree of variation.</p>
<p>Another interesting fact is that the titles of Mitochondrial Eve and Y-chromosomal Adam are not permanent, but instead are reassigned over time. For example, the woman who we call “Mitochondrial Eve” today did not hold that title during her lifetime. Instead, there was another unknown woman who was the most recent common matrilineal ancestor of all women alive at Eve’s time.</p>
<h3>Final words</h3>
<p>I hope you enjoyed the article. I originally learned about Y-chromosomal Adam and Mitochondrial Eve from reading <a href="http://www.amazon.com/dp/014303832X/">Before the Dawn</a>, and immediately knew I had to blog about them from a programmer’s perspective. If you want to read more on the topic, the Wikipedia page on <a href="http://en.wikipedia.org/wiki/Mitochondrial_eve">Mitochondrial Eve</a> is a good start.</p>
<div style="border-bottom: black 1px solid; border-left: black 1px solid; padding-bottom: 10px; background-color: #f0f0ff; margin-top: 20px; padding-left: 10px; padding-right: 10px; margin-bottom: 20px; margin-left: 10px; border-top: black 1px solid; border-right: black 1px solid; padding-top: 10px">
<p>Read more of my articles:</p>
<p><a href="http://igoro.com/archive/fast-and-slow-if-statements-branch-prediction-in-modern-processors/">Fast and slow if statements: branch prediction in modern processors</a></p>
<p><a href="http://igoro.com/archive/human-heart-is-a-turing-machine-research-on-xbox-360-shows-wait-what/">Human heart is a Turing machine, research on XBox 360 shows. Wait, what?</a></p>
<p><a href="http://igoro.com/archive/self-printing-game-of-life-in-c/">Self-printing Game of Life in C#</a></p>
<p>And if you like my blog, <a href="http://igoro.com/feed/">subscribe</a>!</p>
</div>
]]></content:encoded>
			<wfw:commentRss>http://igoro.com/archive/graphs-trees-and-origins-of-humanity/feed/</wfw:commentRss>
		<slash:comments>7</slash:comments>
		</item>
		<item>
		<title>Fast and slow if-statements: branch prediction in modern processors</title>
		<link>http://igoro.com/archive/fast-and-slow-if-statements-branch-prediction-in-modern-processors/</link>
		<comments>http://igoro.com/archive/fast-and-slow-if-statements-branch-prediction-in-modern-processors/#comments</comments>
		<pubDate>Sat, 15 May 2010 21:11:32 +0000</pubDate>
		<dc:creator><![CDATA[Igor Ostrovsky]]></dc:creator>
				<category><![CDATA[Cool Stuff]]></category>

		<guid isPermaLink="false">http://igoro.com/?p=579</guid>
		<description><![CDATA[Did you know that the performance of an if-statement depends on whether its condition has a predictable pattern? If the condition is always true or always false, the branch prediction logic in the processor will pick up the pattern. On the other hand, if the pattern is unpredictable, the if-statement will be much more expensive. [&#8230;]]]></description>
				<content:encoded><![CDATA[<p>Did you know that the performance of an if-statement depends on whether its condition has a predictable pattern? If the condition is always true or always false, the branch prediction logic in the processor will pick up the pattern. On the other hand, if the pattern is unpredictable, the if-statement will be much more expensive. In this article, I’ll explain why today’s processors behave this way.</p>
<p>Let’s measure the performance of this loop with different conditions:</p>
<pre class="code"><span style="color: blue">for </span>(<span style="color: blue">int </span>i = 0; i &lt; max; i++) <span style="color: blue">if </span>(&lt;condition&gt;) sum++;</pre>
<p>Here are the timings of the loop with different True-False patterns:</p>
<table border="1" cellspacing="0" cellpadding="2" width="579">
<tbody>
<tr>
<td valign="top" width="277"><strong>Condition</strong></td>
<td valign="top" width="216"><strong>Pattern</strong></td>
<td valign="top" width="108"><strong>Time (ms)</strong></td>
</tr>
<tr>
<td valign="top" width="286">(i &amp; 0x80000000) == 0</td>
<td valign="top" width="216">T repeated</td>
<td valign="top" width="108">322</td>
</tr>
<tr>
<td valign="top" width="284">(i &amp; 0xffffffff) == 0</td>
<td valign="top" width="216">F repeated</td>
<td valign="top" width="108">276</td>
</tr>
<tr>
<td valign="top" width="281">(i &amp; 1) == 0</td>
<td valign="top" width="216">TF alternating</td>
<td valign="top" width="108">760</td>
</tr>
<tr>
<td valign="top" width="279">(i &amp; 3) == 0</td>
<td valign="top" width="216">TFFFTFFF…</td>
<td valign="top" width="108">513</td>
</tr>
<tr>
<td valign="top" width="277">(i &amp; 2) == 0</td>
<td valign="top" width="216">TTFFTTFF…</td>
<td valign="top" width="108">1675</td>
</tr>
<tr>
<td valign="top" width="276">(i &amp; 4) == 0</td>
<td valign="top" width="216">TTTTFFFFTTTTFFFF…</td>
<td valign="top" width="108">1275</td>
</tr>
<tr>
<td valign="top" width="276">(i &amp; <span>8</span>) == 0</td>
<td valign="top" width="216">8T 8F 8T 8F …</td>
<td valign="top" width="108">752</td>
</tr>
<tr>
<td valign="top" width="275">(i &amp; 16) == 0</td>
<td valign="top" width="216">16T 16F 16T 16F …</td>
<td valign="top" width="108">490</td>
</tr>
</tbody>
</table>
<p>A “bad” true-false pattern can make an if-statement up to <strong>six times slower </strong>than a “good” pattern! Of course, which pattern is good and which is bad depends on the exact instructions generated by the compiler and on the specific processor.</p>
<h3>Let’s look at the processor counters</h3>
<p>One way to understand how the processor used its time is to look at the hardware counters. To help with performance tuning, modern processors track various counters as they execute code: the number of instructions executed, the number of various types of memory accesses, the number of branches encountered, and so forth. To read the counters, you’ll need a tool such as <a href="http://msdn.microsoft.com/en-us/library/bb385772.aspx">the profiler in Visual Studio 2010 Premium or Ultimate</a>, <a href="http://developer.amd.com/cpu/CodeAnalyst/codeanalystwindows/Pages/default.aspx">AMD Code Analyst</a> or <a href="http://software.intel.com/en-us/intel-vtune/">Intel VTune</a>.</p>
<p>To verify that the slowdowns we observed were really due to the if-statement performance, we can look at the Mispredicted Branches counter:</p>
<p><a href="http://igoro.com/wordpress/wp-content/uploads/2010/05/image.png"><img style="border-right-width: 0px; display: inline; border-top-width: 0px; border-bottom-width: 0px; border-left-width: 0px" title="image" border="0" alt="image" src="http://igoro.com/wordpress/wp-content/uploads/2010/05/image_thumb.png" width="484" height="563" /></a></p>
<p>The worst pattern (TTFFTTFF…) results in 774 branch mispredictions, while the good patterns only get around 10. No wonder that the bad case took the longest 1.67 seconds, while the good patterns only took around 300ms!</p>
<p>Let’s take a look at what “branch prediction” does, and why it has a major impact on the processor performance.</p>
<h3>What’s the role of branch prediction?</h3>
<p>To explain what branch prediction is and why it impacts the performance numbers, we first need to take a look at how modern processors work. To complete each instruction, the CPU goes through these (and more) stages:</p>
<p>1. <strong>Fetch</strong>: Read the next instruction.</p>
<p>2. <strong>Decode</strong>: Determine the meaning of the instruction.</p>
<p>3. <strong>Execute</strong>: Perform the real ‘work’ of the instruction.</p>
<p>4. <strong>Write-back</strong>: Store results into memory.</p>
<p>An important optimization is that the stages of the pipeline can process different instructions at the same time. So, as one instruction is getting fetched, a second one is being decoded, a third is executing and the results of fourth are getting written back. Modern processors have pipelines with 10 &#8211; 31 stages (e.g., Pentium 4 Prescott has <a href="http://www.tomshardware.com/reviews/intel,751-5.html">31 stages</a>), and for optimum performance, it is very important to keep all stages as busy as possible.</p>
<p><img style="border-right-width: 0px; display: inline; border-top-width: 0px; border-bottom-width: 0px; border-left-width: 0px" title="500px-Pipeline,_4_stage_svg" border="0" alt="500px-Pipeline,_4_stage_svg" src="http://igoro.com/wordpress/wp-content/uploads/2010/05/500pxPipeline_4_stage_svg.png" width="500" height="479" />&#160; <br /><font size="1">Image from </font><a title="http://commons.wikimedia.org/wiki/File:Pipeline,_4_stage.svg" href="http://commons.wikimedia.org/wiki/File:Pipeline,_4_stage.svg"><font size="1">http://commons.wikimedia.org/wiki/File:Pipeline,_4_stage.svg</font></a></p>
<p>Branches (i.e. conditional jumps) present a difficulty for the processor pipeline. After fetching a branch instruction, the processor needs to fetch the next instruction. But, there are <strong>two possible</strong> “next” instructions! The processor won’t be sure which instruction is the next one until the branching instruction makes it to the end of the pipeline.</p>
<p>Instead of stalling the pipeline until the branching instruction is fully executed, modern processors attempt to <strong>predict</strong> whether the jump will or will not be taken. Then, the processor can fetch the instruction that it thinks is the next one. If the prediction turns out wrong, the processor will simply discard the partially executed instructions that are in the pipeline. See the Wikipedia page on <a href="http://en.wikipedia.org/wiki/Branch_prediction#Implementation">branch predictor implementation</a> for some typical techniques used by processors to collect and interpret branch statistics.</p>
<p>Modern branch predictors are good at predicting simple patterns: all true, all false, true-false alternating, and so on. But if the pattern happens to be something that throws off the branch predictor, the performance hit will be significant. Thankfully, most branches have <strong>easily predictable</strong> patterns, like the two examples highlighted below:</p>
<pre class="code"><span style="color: blue">int </span>SumArray(<span style="color: blue">int</span>[] array) {
    <span style="color: blue">if </span>(<span style="background-color: #e0e0e0">array == <span style="color: blue">null</span></span>) <span style="color: blue">throw new </span><span style="color: #2b91af">ArgumentNullException</span>(<span style="color: #a31515">&quot;array&quot;</span>);

    <span style="color: blue">int </span>sum=0;
    <span style="color: blue">for</span>(<span style="color: blue">int </span>i=0; <span style="background-color: #e0e0e0">i&lt;array.Length</span>; i++) sum += array[i];
    <span style="color: blue">return </span>sum;
}</pre>
<p>The first highlighted condition validates the input, and so the branch will be taken only very rarely. The second highlighted condition is a loop termination condition. This will also almost always go one way, unless the arrays processed are extremely short. So, in these cases – as in most cases – the processor branch prediction logic will be effective at preventing stalls in the processor pipeline.</p>
<h3>Updates and Clarifications</h3>
<p>This article got picked up by <a href="http://reddit.com/">reddit</a>, and got a fair bit of attention in the <a href="http://www.reddit.com/r/programming/comments/c7ues/fast_and_slow_ifstatements_branch_prediction_in/">reddit comments</a>. I’ll respond to the questions, comments and criticisms below.</p>
<p>First, regarding the comments that optimizing for branch prediction is generally a bad idea: I agree. I do not argue anywhere in the article that you should try to write your code to optimize for branch prediction. For the vast majority of high-level code, I can’t even imagine how you’d do that.</p>
<p>Second, there was a concern whether the executed instructions for different cases differ in something else other than the constant value. They don’t – I looked at the JIT-ted assembly. If you’d like to see the JIT-ted assembly code or the C# source code, send me an email and I’ll send them back. (I am not posting the code here because I don’t want to blow up this update.)</p>
<p>Third, another question was on the surprisingly poor performance of the TTFF* pattern. The TTFF* pattern has a short period, and as such should be an easy case for the branch prediction algorithms.</p>
<p>However, the problem is that modern processors don’t track history for each branching instruction separately. Instead, they either track global history of all branches, or they have several history slots, each potentially shared by multiple branching instructions. Or, they can use some combination of these tricks, together with other techniques.</p>
<p>So, the TTFF pattern in the if-statement may not be TTFF by the time it gets to the branch predictor. It may get interleaved with other branches (there are 2 branching instructions in the body of a for-loop), and possibly approximated in other ways too. But, I don’t claim to be an expert on what precisely each processor does, and if someone reading this has an authoritative reference to how different processors behave (esp. Intel Core2 that I tested on), please let me know in comments.</p>
<p>&#160;</p>
<div style="border-bottom: black 1px solid; border-left: black 1px solid; padding-bottom: 10px; background-color: #f0f0ff; margin-top: 20px; padding-left: 10px; padding-right: 10px; margin-bottom: 20px; margin-left: 10px; border-top: black 1px solid; border-right: black 1px solid; padding-top: 10px">
<p>Read more of my articles:</p>
<p><a href="http://igoro.com/archive/gallery-of-processor-cache-effects/">Gallery of processor cache effects</a></p>
<p><a href="http://igoro.com/archive/how-gpu-came-to-be-used-for-general-computation/">How GPU came to be used for general computation</a></p>
<p><a href="http://igoro.com/archive/what-really-happens-when-you-navigate-to-a-url/">What really happens when you navigate to a URL</a></p>
<p><a href="http://igoro.com/archive/self-printing-game-of-life-in-c/">Self-printing Game of Life in C#</a></p>
<p>And if you like my blog, <a href="http://igoro.com/feed/">subscribe</a>!</p>
</div>
]]></content:encoded>
			<wfw:commentRss>http://igoro.com/archive/fast-and-slow-if-statements-branch-prediction-in-modern-processors/feed/</wfw:commentRss>
		<slash:comments>35</slash:comments>
		</item>
		<item>
		<title>Use C# dynamic typing to conveniently access internals of an object</title>
		<link>http://igoro.com/archive/use-c-dynamic-typing-to-conveniently-access-internals-of-an-object/</link>
		<comments>http://igoro.com/archive/use-c-dynamic-typing-to-conveniently-access-internals-of-an-object/#comments</comments>
		<pubDate>Fri, 02 Apr 2010 08:47:28 +0000</pubDate>
		<dc:creator><![CDATA[Igor Ostrovsky]]></dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://igoro.com/?p=556</guid>
		<description><![CDATA[Sometimes you need to access private fields and call private methods on an object – for testing, experimentation, or to work around issues in third-party libraries. .NET has long provided a solution to this problem: reflection. Reflection allows you to call private methods and read or write private fields from outside of the class, but [&#8230;]]]></description>
				<content:encoded><![CDATA[<p>Sometimes you need to access private fields and call private methods on an object – for testing, experimentation, or to work around issues in third-party libraries.</p>
<p>.NET has long provided a solution to this problem: reflection. Reflection allows you to call private methods and read or write private fields from outside of the class, but is verbose and messy to write. In C# 4, this problem can be solved in a neat way using dynamic typing.</p>
<p>As a simple example, I’ll show you how to access internals of the List&lt;T&gt; class from the BCL standard library. Let’s create an instance of the List&lt;int&gt; type:</p>
<pre class="code"><span style="color: #2b91af">List</span>&lt;<span style="color: blue">int</span>&gt; realList = <span style="color: blue">new </span><span style="color: #2b91af">List</span>&lt;<span style="color: blue">int</span>&gt;();</pre>
<p>To access the internals of List&lt;int&gt;, we’ll wrap it with ExposedObject – a type I’ll show you how to implement:</p>
<pre class="code"><span style="color: blue">dynamic </span>exposedList = <span style="color: #2b91af">ExposedObject</span>.From(realList);</pre>
<p>And now, via the exposedList object, we can access the private fields and methods of the List&lt;&gt; class:</p>
<pre class="code"><span style="color: green">// Read a private field - prints 0
</span><span style="color: #2b91af">Console</span>.WriteLine(exposedList._size);

<span style="color: green">// Modify a private field
</span>exposedList._items = <span style="color: blue">new int</span>[] { 5, 4, 3, 2, 1 };

<span style="color: green">// Modify another private field
</span>exposedList._size = 5;

<span style="color: green">// Call a private method
</span>exposedList.EnsureCapacity(20);</pre>
<p>Of course, _size, _items and EnsureCapacity() are all private members of the List&lt;T&gt; class! In addition to the private members, you can still access the public members of the exposed list:</p>
<pre class="code"><span style="color: green">// Add a value to the list
</span>exposedList.Add(0);

<span style="color: green">// Enumerate the list. Prints &quot;5 4 3 2 1 0&quot;
</span><span style="color: blue">foreach </span>(<span style="color: blue">var </span>x <span style="color: blue">in </span>exposedList) <span style="color: #2b91af">Console</span>.WriteLine(x);</pre>
<p>Pretty cool, isn’t it?</p>
<h3>How does ExposedObject work?</h3>
<p>The example I showed uses ExposedObject to access private fields and methods of the List&lt;T&gt; class. ExposedObject is a type I implemented using the dynamic typing feature in C# 4.</p>
<p>To create a dynamically-typed object in C#, you need to implement a class derived from DynamicObject. The derived class will implement a couple of methods whose role is to decide what to do whenever a method gets called or a property gets accessed on your dynamic object.</p>
<p>Here is a dynamic type that will print the name of any method you call on it:</p>
<pre class="code"><span style="color: blue">class </span><span style="color: #2b91af">PrintingObject </span>: <span style="color: #2b91af">DynamicObject </span>
{
    <span style="color: blue">public override bool </span>TryInvokeMember(
            <span style="color: #2b91af">InvokeMemberBinder </span>binder, <span style="color: blue">object</span>[] args, <span style="color: blue">out object </span>result)
    {
        <span style="color: #2b91af">Console</span>.WriteLine(binder.Name); <span style="color: green">// Print the name of the called method
        </span>result = <span style="color: blue">null</span>;
        <span style="color: blue">return true</span>;
    }
}</pre>
<p>Let’s test it out. This program prints “HelloWorld” to the console:</p>
<pre class="code"><span style="color: blue">class </span><span style="color: #2b91af">Program </span>
{
    <span style="color: blue">static void </span>Main(<span style="color: blue">string</span>[] args)
    {
        <span style="color: blue">dynamic </span>printing = <span style="color: blue">new </span><span style="color: #2b91af">PrintingObject</span>();
        printing.HelloWorld();<span style="color: green">
        </span><span style="color: blue">return</span>;
    }
}</pre>
<p>There is only a small step from PrintingObject to ExposedObject – instead of printing the name of the method, ExposedObject will find the appropriate method and invoke it via reflection. A simple version of the code looks like this:</p>
<pre class="code"><span style="color: blue">class </span><span style="color: #2b91af">ExposedObjectSimple </span>: <span style="color: #2b91af">DynamicObject </span>
{
    <span style="color: blue">private object </span>m_object;

    <span style="color: blue">public </span>ExposedObjectSimple(<span style="color: blue">object </span>obj)
    {
        m_object = obj;
    }

    <span style="color: blue">public override bool </span>TryInvokeMember(
            <span style="color: #2b91af">InvokeMemberBinder </span>binder, <span style="color: blue">object</span>[] args, <span style="color: blue">out object </span>result)
    {
        <span style="color: green">// Find the called method using reflection
        </span><span style="color: blue">var </span>methodInfo = m_object.GetType().GetMethod(
            binder.Name,
            <span style="color: #2b91af">BindingFlags</span>.NonPublic | <span style="color: #2b91af">BindingFlags</span>.Public | <span style="color: #2b91af">BindingFlags</span>.Instance);

        <span style="color: green">// Call the method
        </span>result = methodInfo.Invoke(m_object, args);
        <span style="color: blue">return true</span>;
    }
}</pre>
<p>This is all you need in order to implement a simple version of ExposedObject.</p>
<h3>What else can you do with ExposedObject?</h3>
<p>The ExposedObjectSimple implementation above illustrates the concept, but it is a bit naive – it does not handle field accesses, multiple methods with the same name but different argument lists, generic methods, static methods, and so on. I implemented a more complete version of the idea and published it as a CodePlex project: <a title="http://exposedobject.codeplex.com/" href="http://exposedobject.codeplex.com/">http://exposedobject.codeplex.com/</a></p>
<p>You can call static methods by creating an ExposedClass over the class. For example, let’s call the private File.InternalExists() static method:</p>
<pre class="code"><span style="color: blue">dynamic </span>fileType = <span style="color: #2b91af">ExposedClass</span>.From(<span style="color: blue">typeof</span>(System.IO.<span style="color: #2b91af">File</span>));
<span style="color: blue">bool </span>exists = fileType.InternalExists(<span style="color: #a31515">&quot;somefile.txt&quot;</span>);</pre>
<p>You can call generic methods (static or non-static) too:</p>
<pre class="code"><span style="color: blue">dynamic </span>enumerableType = <span style="color: #2b91af">ExposedClass</span>.From(<span style="color: blue">typeof</span>(System.Linq.<span style="color: #2b91af">Enumerable</span>));
<span style="color: #2b91af">Console</span>.WriteLine(
    enumerableType.Max&lt;<span style="color: blue">int</span>&gt;(<span style="color: blue">new</span>[] { 1, 3, 5, 3, 1 }));</pre>
<p>Type inference for generics works too. You don’t have to specify the int generic argument in the Max method call:</p>
<pre class="code"><span style="color: blue">dynamic </span>enumerableType = <span style="color: #2b91af">ExposedClass</span>.From(<span style="color: blue">typeof</span>(System.Linq.<span style="color: #2b91af">Enumerable</span>));
<span style="color: #2b91af">Console</span>.WriteLine(
    enumerableType.Max(<span style="color: blue">new</span>[] { 1, 3, 5, 3, 1 }));</pre>
<p>And to clarify, all of these different cases have to be explicitly handled. Deciding which method should be called is normally done by the <strong>compiler</strong>. The logic on overload resolution is hidden away in the compiler, and so unavailable at runtime. To duplicate the method binding rules, we have to duplicate the logic of the compiler.</p>
<p>You can also cast a dynamic object either to its own type, or to a base class:</p>
<pre class="code"><span style="color: #2b91af">List</span>&lt;<span style="color: blue">int</span>&gt; realList = <span style="color: blue">new </span><span style="color: #2b91af">List</span>&lt;<span style="color: blue">int</span>&gt;();
<span style="color: blue">dynamic </span>exposedList = <span style="color: #2b91af">ExposedObject</span>.From(realList);

<span style="color: #2b91af">List</span>&lt;<span style="color: blue">int</span>&gt; realList2 = exposedList;
<span style="color: #2b91af">Console</span>.WriteLine(realList == realList2); <span style="color: green">// Prints &quot;true&quot;
</span></pre>
<p>So, after casting the exposed object (or assigning it into an appropriately typed variable), you can get back the original object!</p>
<h3>Updates</h3>
<p>Based on some of the responses the article is getting, I should clarify: I am definitely not suggesting that you use ExposedObject regularly in your production code! That would be a terrible idea.</p>
<p>As I said in the article, ExposedObject can be useful for testing, experimentation, and rare hacks. Also, it is an interesting example of how dynamic typing works in C#.</p>
<p>Also, apparently <a href="http://bugsquash.blogspot.com/">Mauricio Scheffer</a> came up with <a href="http://bugsquash.blogspot.com/2009/05/testing-private-methods-with-c-40.html">the same idea</a> almost a year before me.</p>
<div style="border-bottom: black 1px solid; border-left: black 1px solid; padding-bottom: 10px; background-color: #f0f0ff; margin-top: 20px; padding-left: 10px; padding-right: 10px; margin-bottom: 20px; margin-left: 10px; border-top: black 1px solid; border-right: black 1px solid; padding-top: 10px">
<p>Read more of my articles:</p>
<p><a href="http://igoro.com/archive/gallery-of-processor-cache-effects/">Gallery of processor cache effects</a></p>
<p><a href="http://igoro.com/archive/what-really-happens-when-you-navigate-to-a-url/">What really happens when you navigate to a URL</a></p>
<p><a href="http://igoro.com/archive/human-heart-is-a-turing-machine-research-on-xbox-360-shows-wait-what/">Human heart is a Turing machine, research on XBox 360 shows. Wait, what?</a></p>
<p><a href="http://igoro.com/archive/skip-lists-are-fascinating/">Skip lists are fascinating!</a></p>
<p>And if you like my blog, <a href="http://igoro.com/feed/">subscribe</a>!</p>
</div>
]]></content:encoded>
			<wfw:commentRss>http://igoro.com/archive/use-c-dynamic-typing-to-conveniently-access-internals-of-an-object/feed/</wfw:commentRss>
		<slash:comments>10</slash:comments>
		</item>
		<item>
		<title>How GPU came to be used for general computation</title>
		<link>http://igoro.com/archive/how-gpu-came-to-be-used-for-general-computation/</link>
		<comments>http://igoro.com/archive/how-gpu-came-to-be-used-for-general-computation/#comments</comments>
		<pubDate>Fri, 12 Mar 2010 09:30:56 +0000</pubDate>
		<dc:creator><![CDATA[Igor Ostrovsky]]></dc:creator>
				<category><![CDATA[Cool Stuff]]></category>

		<guid isPermaLink="false">http://igoro.com/?p=518</guid>
		<description><![CDATA[The story of how GPU came to be used for high-performance computation is pretty cool. Hardware heavily optimized for graphics turned out to be useful for another use: certain types of high-performance computations. In this article, I will explore how and why this happened, and summarize the state of general computation on GPUs today. Programmable [&#8230;]]]></description>
				<content:encoded><![CDATA[</p>
<p>The story of how GPU came to be used for high-performance computation is pretty cool. Hardware heavily optimized for graphics turned out to be useful for another use: certain types of high-performance computations. In this article, I will explore how and why this happened, and summarize the state of general computation on GPUs today.</p>
<h3>Programmable graphics</h3>
<p>The first step towards computation on the GPU was introduction of programmable shaders. Both DirectX and OpenGL added support for programmable shaders roughly a decade ago, giving game designers more freedom to create custom graphics effects. Instead of just composing pre-canned effects, graphic artists can now write little programs that execute <strong>directly on the GPU</strong>. As of DirectX 8, they can specify two types of shader programs for every object in the scene: a vertex shader and a pixel shader.</p>
<p>A <strong>vertex shader</strong> is a function invoked on every vertex in the 3D object. The function transforms the vertex and returns its position relative to the camera view. By transforming vertices, vertex shaders help implement realistic skin, clothes, facial expressions, and similar effects.</p>
<p>A <strong>pixel shader</strong> is a function invoked on every pixel covered by a particular object and returns the color of the pixel. To compute the output color, the pixel shader can use a variety of optional inputs: XY-position on the screen, XYZ-position in the scene, position in the texture, the direction of the surface (i.e., the normal vector), etc. Pixel shader can also read textures, bump maps, and other inputs.</p>
<p>Here is a simple scene, rendered with six different pixel shaders applied to the teapot:</p>
<p>&#160;<a href="http://igoro.com/wordpress/wp-content/uploads/2010/03/image5.png"><img style="border-right-width: 0px; display: inline; border-top-width: 0px; border-bottom-width: 0px; border-left-width: 0px" title="image" border="0" alt="image" src="http://igoro.com/wordpress/wp-content/uploads/2010/03/image_thumb.png" width="402" height="279" /></a> </p>
<p><strong>A</strong> always returns the same color. <strong>B</strong> varies the color based on the screen Y-coordinate. <strong>C</strong> sets the color depending on the XYZ screen coordinates. <strong>D</strong> sets the color proportionally to the cosine of the angle between the surface normal and the light direction (“diffuse lighting”). <strong>E</strong> uses a slightly more complex lighting model and a texture, and <strong>F</strong> also adds a bump map.</p>
<div style="border-bottom: black 1px solid; border-left: black 1px solid; padding-bottom: 10px; background-color: #f0f0ff; margin-top: 20px; padding-left: 10px; padding-right: 10px; margin-bottom: 20px; margin-left: 10px; border-top: black 1px solid; border-right: black 1px solid; padding-top: 10px">
<p>If you are curious how lighting shaders are implemented, check out GamaSutra’s <a href="http://www.gamasutra.com/features/20030418/engel_01.shtml">Implementing Lighting Models With HLSL</a>.</p>
</p></div>
<h3>Realization: shaders can be used for computation!</h3>
<p>Let’s take a look at a simple pixel shader that just blurs a texture. This shader is implemented in HLSL (a DirectX shader language):</p>
<pre class="code">float4 ps_main( float2 t: TEXCOORD0 ) : COLOR0 
{ 
   float dx = 2/fViewportWidth; 
   float dy = 2/fViewportHeight; 
   return 
      0.2 * tex2D( baseMap, t ) + 
      0.2 * tex2D( baseMap, t + float2(dx, 0) ) + 
      0.2 * tex2D( baseMap, t + float2(-dx, 0) ) +      
      0.2 * tex2D( baseMap, t + float2(0, dy) ) + 
      0.2 * tex2D( baseMap, t + float2(0, -dy) ); 
}</pre>
<p>The texture blur has this effect:</p>
<p><img style="border-right-width: 0px; display: inline; border-top-width: 0px; border-bottom-width: 0px; border-left-width: 0px" title="blur" border="0" alt="blur" src="http://igoro.com/wordpress/wp-content/uploads/2010/03/blur.jpg" width="350" height="170" />&#160;</p>
<p>This is not exactly a breath-taking effect, but the interesting part is that simulations of <strong>car crashes</strong>, <strong>wind tunnels</strong> and <strong>weather patterns</strong> all follow this basic pattern of computation! All of these simulations are computations on a grid of points. Each point has one or more quantities associated with it: temperature, pressure, velocity, force, air flow, etc. In each iteration of the simulation, the <strong>neighboring points interact</strong>: temperatures and pressures are equalized, forces are transferred, grid is deformed, and so forth. Mathematically, the programs that run these simulations are partial differential equation (<strong>PDE</strong>) solvers.</p>
<p>As a trivial example, here is a simple simulation of <strong>heat dissipation</strong>. In each iteration, the temperature of each grid point is recomputed as an average over its nearest neighbors:</p>
<p><img style="border-right-width: 0px; display: inline; border-top-width: 0px; border-bottom-width: 0px; border-left-width: 0px" title="temperature" border="0" alt="temperature" src="http://igoro.com/wordpress/wp-content/uploads/2010/03/temperature.png" width="597" height="122" /> </p>
<p>It is hard to overlook the fact that an iteration of this simulation is nearly identical to the blur operation. Hardware highly optimized for running pixel shaders will be able to run this simulation very fast. And, after years of refinement and challenges from latest and greatest games, GPUs became very efficient at using massive parallelism to execute shaders blazingly fast.</p>
<div style="border-bottom: black 1px solid; border-left: black 1px solid; padding-bottom: 10px; background-color: #f0f0ff; margin-top: 20px; padding-left: 10px; padding-right: 10px; margin-bottom: 20px; margin-left: 10px; border-top: black 1px solid; border-right: black 1px solid; padding-top: 10px">
<p>One cool example of a PDE solver is a <a href="http://http.developer.nvidia.com/GPUGems3/gpugems3_ch30.html">liquid and smoke simulator</a>. The structure of the simulation is similar to my trivial heat dissipation example, but instead of tracking the temperature of each grid point, the smoke simulator tracks pressure and velocity. Just as in the heat dissipation example, a grid point is affected by all of its nearest neighbors in each iteration.</p>
<p>&#160;</p>
<p><object width="480" height="385"><param name="movie" value="http://www.youtube.com/v/DER5xp29z6w&amp;hl=en_US&amp;fs=1&amp;rel=0"></param><param name="allowFullScreen" value="true"></param><param name="allowscriptaccess" value="always"></param><embed src="http://www.youtube.com/v/DER5xp29z6w&amp;hl=en_US&amp;fs=1&amp;rel=0" type="application/x-shockwave-flash" allowscriptaccess="always" allowfullscreen="true" width="480" height="385"></embed></object></p>
<p>This simulation was developed by <a href="http://www.cs.caltech.edu/~keenan/">Keenan Crane</a>.</div>
<div style="border-bottom: black 1px solid; border-left: black 1px solid; padding-bottom: 10px; background-color: #f0f0ff; margin-top: 20px; padding-left: 10px; padding-right: 10px; margin-bottom: 20px; margin-left: 10px; border-top: black 1px solid; border-right: black 1px solid; padding-top: 10px">
<p>For a view into general computation on GPUs in 2004 when hacked-up pixel shaders were the state of the art, see the <a href="http://http.developer.nvidia.com/GPUGems2/gpugems2_part04.html">General-Purpose Computation on GPUs</a> section of GPU Gems 2.</p>
</div>
<h3>Arrival of GPGPU</h3>
<p>Once GPUs have shown themselves to be a good fit for certain types of high-performance computations (like PDEs), GPU manufacturers moved to make GPUs more attractive for general-purpose computing. The idea of General Purpose computation on a GPU (“GPGPU”) became a hot topic in high-performance computing.</p>
<p>GPGPU computing is based around <strong>compute kernels</strong>. A compute kernel is a generalization of a pixel shader:</p>
<ul>
<li>Like a pixel shader, a compute kernel is a routine that will be invoked on each point in the <strong>input space</strong>. </li>
<li>A pixel shader always operates on two-dimensional space. A compute kernel can work on space of <strong>any dimensionality</strong>. </li>
<li>A pixel shader returns a single color. A compute kernel can write an <strong>arbitrary number of outputs</strong>. </li>
<li>A pixel shader operates on 32-bit floating-point numbers. A compute kernel also supports <strong>64-bit floating-point</strong> numbers and <strong>integers</strong>. </li>
<li>A pixel shader reads from textures. A compute kernel can read from any place in <strong>GPU memory</strong>. </li>
<li>Additionally, compute kernels running on the same core can share data via an explicitly managed <strong>per-core cache</strong>. </li>
</ul>
<h3>Comparison of a GPU and a CPU</h3>
<p>The control flow of a modern application is typically very complicated. Just think about all the <strong>different tasks</strong> that must be completed to show this article in your browser. To display this blog article, the CPU has to communicate with various devices, maintain thousands of data structures, position thousands of UI elements, parse perhaps a hundred file formats, … that does not even begin to scratch the surface. And, not only are all of these tasks different, they also depend on each other in very complex ways.</p>
<p>Compare that with the control flow of a pixel shader. A pixel shader is a <strong>single routine</strong> that needs to be invoked roughly a million times, each time on a different input. Different shader invocations are pretty much independent and once all are done, the GPU starts over again with a scene where objects have moved a bit.</p>
<p>It shouldn’t come as a surprise that hardware optimized for running a pixel shader will be <strong>quite different</strong> from hardware optimized for tasks like viewing web pages. A CPU greatly benefits from a sophisticated execution pipeline with multi-level caches, instruction reordering, prefetching, branch prediction, and other clever optimizations. A GPU does not need most of those complex features for its much simpler control flow. Instead, a GPU benefits from lots of Arithmetic Logic Units (ALUs) to add, multiply and divide floating point numbers in parallel.</p>
<p>This table shows the most important differences between a CPU and a GPU today:</p>
<table style="text-align: left" border="1" cellspacing="0" cellpadding="2" width="541">
<tbody>
<tr>
<td valign="top" width="257"><strong>CPU</strong></td>
<td valign="top" width="282"><strong>GPU</strong></td>
</tr>
<tr>
<td valign="top" width="257"><strong>2-4</strong> cores</td>
<td valign="top" width="282"><strong>16-32</strong> cores</td>
</tr>
<tr>
<td valign="top" width="257">Each core runs <strong>1-2</strong> independent threads in parallel</td>
<td valign="top" width="282">Each core runs <strong>16-32</strong> threads in parallel. All threads on a core must execute the <strong>same instruction </strong>at any time.</td>
</tr>
<tr>
<td valign="top" width="257"><strong>Automatically managed </strong>hierarchy of caches</td>
<td valign="top" width="282">Each core has 16-64kB of cache, <strong>explicitly managed </strong>by the programmer</td>
</tr>
<tr>
<td valign="top" width="257">0.1 billion floating-point operations / second (<strong>0.1 TFLOP</strong>)</td>
<td valign="top" width="282">1 billion floating-point operations / second (<strong>1 TFLOP</strong>)</td>
</tr>
<tr>
<td valign="top" width="257">Main memory throughput: 10GB / sec</td>
<td valign="top" width="282">GPU memory throughput: 100GB / sec</td>
</tr>
</tbody>
</table>
<p>&#160;</p>
<p>All of this means that if a program can be broken up into many threads all doing the same thing on different data (ideally executing arithmetic operations), a GPU will probably be able to do this an order of magnitude faster than a CPU. On the other hand, on an application with a complex control flow, CPU is going to be the one winning by orders of magnitude. Going back to my earlier example, it should be clear why a CPU will excel at running a browser and a GPU will excel at executing a pixel shader.</p>
<p>This chart illustrates how a CPU and a GPU use up their “silicon budget”. A CPU uses most of its transistors for the L1 cache and for execution control. A GPU dedicates the bulk of its transistors to Arithmetic Logic Units (ALUs). </p>
<p><img style="border-right-width: 0px; display: inline; border-top-width: 0px; border-bottom-width: 0px; border-left-width: 0px" title="image" border="0" alt="image" src="http://igoro.com/wordpress/wp-content/uploads/2010/03/image12.png" width="408" height="147" /> </p>
<p><font size="1">Adapted from </font><a href="http://developer.download.nvidia.com/compute/cuda/2_3/toolkit/docs/NVIDIA_CUDA_Programming_Guide_2.3.pdf"><font size="1">NVidia’s CUDA Programming Guide</font></a><font size="1">.</font></p>
<div style="border-bottom: black 1px solid; border-left: black 1px solid; padding-bottom: 10px; background-color: #f0f0ff; margin-top: 20px; padding-left: 10px; padding-right: 10px; margin-bottom: 20px; margin-left: 10px; border-top: black 1px solid; border-right: black 1px solid; padding-top: 10px">
<p>NVidia’s upcoming Fermi chip will slightly change the comparison table. Fermi introduces a per-core automatically-managed L1 cache. It will be very interesting to see what kind of impact the introduction of an L1 cache will have on the types of programs that can run on the GPU. One point is fairly clear – the penalty for register spills into main memory will be greatly reduced (this point may not make sense until you read the next section).</p>
</div>
<h3>GPGPU Programming</h3>
<p>Today, writing efficient GPGPU programs requires in-depth understanding of the hardware. There are three popular programming models:</p>
<ul>
<li>DirectCompute – Microsoft’s API for defining compute kernels, introduced in DirectX 11 </li>
<li>CUDA – NVidia’s C-based language for programming compute kernels </li>
<li>OpenCL – API originally proposed by Apple and now developed by Khronos Group </li>
</ul>
<p>Conceptually, the models are very similar. The table below summarizes some of the terminology differences between the models:</p>
<table border="1" cellspacing="0" cellpadding="2" width="379">
<tbody>
<tr>
<td valign="top" width="149"><strong>DirectCompute&#160; </strong></td>
<td valign="top" width="133"><strong>CUDA&#160; </strong></td>
<td valign="top" width="111"><strong>OpenCL</strong></td>
</tr>
<tr>
<td valign="top" width="149">thread</td>
<td valign="top" width="133">thread</td>
<td valign="top" width="111">work item</td>
</tr>
<tr>
<td valign="top" width="149">thread group</td>
<td valign="top" width="133">thread block</td>
<td valign="top" width="111">work group</td>
</tr>
<tr>
<td valign="top" width="149">group-shared memory</td>
<td valign="top" width="133">shared memory</td>
<td valign="top" width="111">local memory</td>
</tr>
<tr>
<td valign="top" width="149">warp?</td>
<td valign="top" width="133">warp</td>
<td valign="top" width="111">wavefront</td>
</tr>
<tr>
<td valign="top" width="149">barrier</td>
<td valign="top" width="133">barrier</td>
<td valign="top" width="111">barrier</td>
</tr>
</tbody>
</table>
<p>&#160;</p>
<p>Writing high-performance GPGPU code is not for the faint at heart (although the same could probably be said about any type of high-performance computing). Here are examples of some issues you need to watch out for when writing compute kernels:</p>
<ul>
<li>The program has to have <strong>plenty of threads</strong> (thousands) </li>
<li><strong>Not too many threads</strong>, though, or cores will run out of registers and will have to simulate additional registers using main GPU memory. </li>
<li>It is important that threads running on one core access main memory in such a pattern that the hardware will be <strong>coalesce</strong> <strong>the memory accesses </strong>from different threads. This optimization alone can make an order of magnitude difference. </li>
<li>… and so on. </li>
</ul>
<p>Explaining all of these performance topics in detail is well beyond the scope of this article, but hopefully this gives you an idea of what GPGPU programming is about, and what kinds of problems it can be applied to.</p>
<div style="border-bottom: black 1px solid; border-left: black 1px solid; padding-bottom: 10px; background-color: #f0f0ff; margin-top: 20px; padding-left: 10px; padding-right: 10px; margin-bottom: 20px; margin-left: 10px; border-top: black 1px solid; border-right: black 1px solid; padding-top: 10px">
<p>Read more of my articles:</p>
<p><a href="http://igoro.com/archive/gallery-of-processor-cache-effects/">Gallery of processor cache effects</a></p>
<p><a href="http://igoro.com/archive/what-really-happens-when-you-navigate-to-a-url/">What really happens when you navigate to a URL</a></p>
<p><a href="http://igoro.com/archive/human-heart-is-a-turing-machine-research-on-xbox-360-shows-wait-what/">Human heart is a Turing machine, research on XBox 360 shows. Wait, what?</a></p>
<p><a href="http://igoro.com/archive/skip-lists-are-fascinating/">Skip lists are fascinating!</a></p>
<p>And if you like my blog, <a href="http://igoro.com/feed/">subscribe</a>!</p>
</div>
]]></content:encoded>
			<wfw:commentRss>http://igoro.com/archive/how-gpu-came-to-be-used-for-general-computation/feed/</wfw:commentRss>
		<slash:comments>8</slash:comments>
		</item>
		<item>
		<title>Volatile keyword in C# – memory model explained</title>
		<link>http://igoro.com/archive/volatile-keyword-in-c-memory-model-explained/</link>
		<comments>http://igoro.com/archive/volatile-keyword-in-c-memory-model-explained/#comments</comments>
		<pubDate>Tue, 23 Feb 2010 09:57:22 +0000</pubDate>
		<dc:creator><![CDATA[Igor Ostrovsky]]></dc:creator>
				<category><![CDATA[C#]]></category>

		<guid isPermaLink="false">http://igoro.com/?p=468</guid>
		<description><![CDATA[The memory model is a fascinating topic – it touches on hardware, concurrency, compiler optimizations, and even math. The memory model defines what state a thread may see when it reads a memory location modified by other threads. For example, if one thread updates a regular non-volatile field, it is possible that another thread reading [&#8230;]]]></description>
				<content:encoded><![CDATA[<style>
div.mynote { border: black 1px solid; padding: 10px; background-color: #f0f0ff; margin-top: 20px; margin-bottom: 20px; margin-left: 10px; ></style>
<p>The <strong>memory model</strong> is a fascinating topic – it touches on hardware, concurrency, compiler optimizations, and even math.</p>
<p>The memory model defines what state a thread may see when it reads a memory location modified by other threads. For example, if one thread updates a regular non-volatile field, it is possible that another thread reading the field will <strong>never </strong>observe the new value. This program never terminates (in a release build):</p>
<p>   <span id="more-468"></span>
<pre class="code"><span style="color: blue">class </span><span style="color: #2b91af">Test
</span>{
    <span style="color: blue">private </span><span style="color: blue">bool </span>_loop = <span style="color: blue">true</span>;

    <span style="color: blue">public static void </span>Main()
    {
        <span style="color: #2b91af">Test </span>test1 = <span style="color: blue">new </span><span style="color: #2b91af">Test</span>();

        <span style="color: green">// Set _loop to false on another thread
        </span><span style="color: blue">new </span><span style="color: #2b91af">Thread</span>(() =&gt; { test1._loop = <span style="color: blue">false</span>;}).Start();

        <span style="color: green">// Poll the _loop field until it is set to false
        </span><span style="color: blue">while </span>(test1._loop == <span style="color: blue">true</span>) ;

        <span style="color: green">// The loop above will never terminate!</span>
    }
}</pre>
<p>There are two possible ways to get the while loop to terminate:</p>
<ol>
<li><strong>Use a lock </strong>to protect all accesses (reads and writes) to the _loop field </li>
<li><strong>Mark the _loop field as volatile</strong> </li>
</ol>
<p>There are two reasons why a read of a non-volatile field may observe a stale value: <strong>compiler optimizations</strong> and <strong>processor optimizations</strong>.</p>
<div class="mynote">
<p>In concurrent programming, threads can get interleaved in many different ways, resulting in possibly many different outcomes. But as the example with the infinite loop shows, threads do not just get interleaved – they potentially interact in more complex ways, unless you correctly use locks and volatile fields.</p>
</div>
<h3>Compiler optimizations</h3>
<p>The first reason why a non-volatile read may return a stale value has to do with compiler optimizations. In the infinite loop example, the JIT compiler optimizes the while loop from this:</p>
<pre class="code"><span style="color: blue">while </span>(test1._loop == <span style="color: blue">true</span>) ;</pre>
<p>To this:</p>
<pre class="code"><span style="color: blue">if </span>(test1._loop) { <span style="color: blue">while </span>(<span style="color: blue">true</span>); }</pre>
<p>This is an entirely reasonable transformation if only one thread accesses the _loop field. But, if another thread changes the value of the field, this optimization can prevent the reading thread from noticing the updated value.</p>
<p>If you mark the _loop field as volatile, the compiler will not hoist the read out of the loop. The compiler will know that other threads may be modifying the field, and so it will be careful to avoid optimizations that would result in a read of a stale value.</p>
<div class="mynote">
<p>The code transformation I showed is a close approximation of the optimization done by the CLR JIT compiler, but not completely exact.</p>
<p>The full story is that the assembly code emitted by the JIT compiler will store the value test1._loop in the EAX register. The loop condition will keep polling the register, and will read test1._loop from memory again. Even when the thread is pre-empted, the CPU registers get saved. Once the thread is again scheduled to run, the same stale EAX register value will be restored, and the loop never terminates.</p>
<p>The assembly code generated by the while loop looks as follows:</p>
<pre>00000068  test        eax,eax 
0000006a  jne         00000068</pre>
<p>If you make the _loop field volatile, this code is generated instead:</p>
<pre>00000064  cmp         byte ptr [eax+4],0 
00000068  jne         00000064</pre>
<p>If the _loop field is not volatile, the compiler will store _loop in the EAX register. If _loop is volatile, the compiler will instead keep the test1 variable in EAX, and the value of _loop will be re-fetched from memory on each access (by “ptr [eax+4]”).</p>
</div>
<div class="mynote">
<p>From my experience playing around with the current version of the CLR, I get the impression that these kinds of compiler optimizations are not terribly frequent. On x86 and x64, often the same assembly code will be generated regardless of whether a field is volatile or not. On IA64, the situation is a bit different &#8211; see the next section.</p>
</div>
<h3>Processor optimizations</h3>
<p>On some processors, not only must the compiler avoid certain optimizations on volatile reads and writes, it also has to use special instructions. On a multi-core machine, different cores have different caches. The processors may not bother to keep those caches coherent by default, and special instructions may be needed to flush and refresh the caches.</p>
<p>The mainstream <strong>x86 </strong>and <strong>x64 </strong>processors implement a strong memory model where memory access is effectively volatile. So, a volatile field forces the compiler to avoid some high-level optimizations like hoisting a read out of a loop, but otherwise results in the same assembly code as a non-volatile read.</p>
<p>The <strong>Itanium </strong>processor implements a weaker memory model. To target Itanium, the JIT compiler has to use special instructions for volatile memory accesses: <strong>LD.ACQ</strong> and <strong>ST.REL</strong>, instead of LD and ST. Instruction LD.ACQ effectively says, “refresh my cache and then read a value” and ST.REL says, “write a value to my cache and then flush the cache to main memory”. LD and ST on the other hand may just access the processor’s cache, which is not visible to other processors.</p>
<div class="mynote">
<p>For the reasons explained in this section and the previous sections, marking a field as volatile will often incur <strong>zero </strong>performance penalty on x86 and x64.</p>
</div>
<div class="mynote">
<p>The x86/x64 instruction set actually does contains three fence instructions: LFENCE, SFENCE, and MFENCE. LFENCE and SFENCE are apparently not needed on the current architecture, but MFENCE is useful to go around one particular issue: if a core reads a memory location it previously wrote, the read may be served from the store buffer, even though the write has not yet been written to memory. <a href="http://software.intel.com/en-us/forums/showthread.php?t=47577">[Source]</a> I don&#8217;t actually know whether the CLR JIT ever inserts MFENCE instructions.</p>
</div>
<h3>Volatile accesses in more depth</h3>
<p>To understand how volatile and non-volatile memory accesses work, you can imagine each thread as having its own cache. Consider a simple example with a <strong>non-volatile </strong>memory location (i.e. a field) <strong>u</strong>, and a <strong>volatile </strong>memory location <strong>v</strong>.</p>
<p>&#160;<img style="border-right-width: 0px; display: inline; border-top-width: 0px; border-bottom-width: 0px; border-left-width: 0px" title="image" border="0" alt="image" src="http://igoro.com/wordpress/wp-content/uploads/2010/03/image.png" width="346" height="295" /></p>
<p>A <strong>non-volatile write</strong> could just update the value in the thread’s cache, and not the value in main memory:</p>
<p><img style="border-right-width: 0px; display: inline; border-top-width: 0px; border-bottom-width: 0px; border-left-width: 0px" title="image" border="0" alt="image" src="http://igoro.com/wordpress/wp-content/uploads/2010/03/image1.png" width="346" height="295" />&#160; </p>
<p>However, in C# <strong>all writes are volatile</strong> (unlike say in Java), regardless of whether you write to a volatile or a non-volatile field. So, the above situation actually never happens in C#.</p>
<p>A <strong>volatile write </strong>updates the thread’s cache, and then flushes the entire cache to main memory. If we were to now set the volatile field v to 11, both values u and v would get flushed to main memory:</p>
<p><img style="border-right-width: 0px; display: inline; border-top-width: 0px; border-bottom-width: 0px; border-left-width: 0px" title="image" border="0" alt="image" src="http://igoro.com/wordpress/wp-content/uploads/2010/03/image2.png" width="346" height="295" /></p>
<p>Since all C# writes are volatile, you can think of all writes as going straight to main memory.</p>
<p>A regular, <strong>non-volatile read</strong> can read the value from the thread’s cache, rather than from main memory. Despite the fact that thread 1 set u to 11, when thread 2 reads u, it will still see value 10:</p>
<p><img style="border-right-width: 0px; display: inline; border-top-width: 0px; border-bottom-width: 0px; border-left-width: 0px" title="image" border="0" alt="image" src="http://igoro.com/wordpress/wp-content/uploads/2010/03/image3.png" width="346" height="295" /></p>
<p>When you read a non-volatile field in C#, a non-volatile read occurs, and you may see a stale value from the thread’s cache. Or, you may see the updated value. Whether you see the old or the new value depends on your compiler and your processor.</p>
<p>Finally, let’s take a look at an example of a <strong>volatile read</strong>. Thread 2 will read the volatile field v:</p>
<p><img style="border-right-width: 0px; display: inline; border-top-width: 0px; border-bottom-width: 0px; border-left-width: 0px" title="image" border="0" alt="image" src="http://igoro.com/wordpress/wp-content/uploads/2010/03/image4.png" width="346" height="294" />&#160;</p>
<p>Before the volatile read, thread 2 refreshes its <strong>entire cache</strong>, and then reads the updated value of v: 11. So, it will observe the value&#160; that is really in main memory, and also refresh its cache as a bonus.</p>
<p>Note that the thread caches that I described are <strong>imaginary </strong>– there really is no such thing as a thread cache. Threads only appear to have these caches as an artifact of compiler and processor optimizations.</p>
<div class="mynote">
<p>One interesting point is that all writes in C# are volatile according to the memory model as documented <a href="http://msdn.microsoft.com/en-us/magazine/cc163715.aspx">here</a> and <a href="http://www.bluebytesoftware.com/blog/CommentView,guid,efdf6840-c071-4c5b-addf-9c3205de7f1d.aspx">here</a>, and are also presumably implemented as such. The ECMA specification of the C# language actually defines a weaker model where writes are not volatile by default.</p>
</div>
<div class="mynote">
<p>You may find it surprising that a volatile read refreshes the <strong>entire </strong>cache, not just the&#160; read value. Similarly, a volatile write (i.e., every C# write) flushes the <strong>entire </strong>cache, not just the written value. These semantics are sometimes referred to as “strong volatile semantics”.</p>
<p>The original Java memory model designed in 1995 was based on weak volatile semantics, but was changed in 2004 to strong volatile. The weak volatile model is very inconvenient. One example of the problem is that the “safe publication” pattern is not safe. Consider this example:</p>
<pre class="code"><span style="color: blue">volatile string</span>[] _args = <span style="color: blue">null</span>;</pre>
<pre class="code"><span style="color: blue">public void </span>Write() {
    <span style="color: blue">string</span>[] a = <span style="color: blue">new string</span>[2];
    a[0] = <span style="color: #a31515">&quot;arg1&quot;</span>;
    a[1] = <span style="color: #a31515">&quot;arg2&quot;</span>;
    _args = a;
    ...
}

<span style="color: blue">public void </span>Read() {
    <span style="color: blue">if </span>(_args != <span style="color: blue">null</span>) {
        <span style="color: green">// Under weak volatile semantics, this assert could fail!</span>
        <span style="color: #2b91af">Debug</span>.Assert(_args[0] != <span style="color: blue">null</span>);
    }
}</pre>
<p><a href="http://11011.net/software/vspaste"></a><font face="Courier New"></font></p>
<p>Under strong volatile semantics (i.e., the .NET and C# volatile semantics), a non-null value in the _args field guarantees that the elements of _args are also not null. The safe publication pattern is very useful and commonly used in practice.</p>
</div>
<h3>Memory model and .NET operations</h3>
<p>Here is a table of how various .NET operations interact with the imaginary thread cache:</p>
<table border="1" cellspacing="0" cellpadding="2" width="689">
<tbody>
<tr>
<td valign="top" width="160"><strong>Construct</strong></td>
<td valign="top" width="120"><strong>Refreshes thread cache before?</strong></td>
<td valign="top" width="104"><strong>Flushes thread cache after?</strong></td>
<td valign="top" width="303"><strong>Notes</strong></td>
</tr>
<tr>
<td valign="top" width="160">Ordinary read</td>
<td valign="top" width="120">No</td>
<td valign="top" width="104">No</td>
<td valign="top" width="303">Read of a non-volatile field</td>
</tr>
<tr>
<td valign="top" width="160">Ordinary write</td>
<td valign="top" width="120">No</td>
<td valign="top" width="104"><strong>Yes</strong></td>
<td valign="top" width="303">Write of a non-volatile field</td>
</tr>
<tr>
<td valign="top" width="160">Volatile read</td>
<td valign="top" width="120"><strong>Yes</strong></td>
<td valign="top" width="104">No</td>
<td valign="top" width="303">Read of volatile field, or Thread.VolatileRead</td>
</tr>
<tr>
<td valign="top" width="160">Volatile write</td>
<td valign="top" width="120">No</td>
<td valign="top" width="104"><strong>Yes</strong></td>
<td valign="top" width="303">Write of a volatile field – same as non-volatile</td>
</tr>
<tr>
<td valign="top" width="160"><a href="http://msdn.microsoft.com/en-us/library/system.threading.thread.memorybarrier.aspx">Thread.MemoryBarrier</a></td>
<td valign="top" width="120"><strong>Yes</strong></td>
<td valign="top" width="104"><strong>Yes</strong></td>
<td valign="top" width="303">Special memory barrier method</td>
</tr>
<tr>
<td valign="top" width="160"><a href="http://msdn.microsoft.com/en-us/library/system.threading.interlocked.aspx">Interlocked</a> operations</td>
<td valign="top" width="120"><strong>Yes</strong></td>
<td valign="top" width="104"><strong>Yes</strong></td>
<td valign="top" width="303">Increment, Add, Exchange, etc.</td>
</tr>
<tr>
<td valign="top" width="160">Lock acquire</td>
<td valign="top" width="120"><strong>Yes</strong></td>
<td valign="top" width="104">No</td>
<td valign="top" width="303"><a href="http://msdn.microsoft.com/en-us/library/de0542zz.aspx">Monitor.Enter</a> or entering a lock {}&#160; region</td>
</tr>
<tr>
<td valign="top" width="160">Lock release</td>
<td valign="top" width="120">No</td>
<td valign="top" width="104"><strong>Yes</strong></td>
<td valign="top" width="303"><a href="http://msdn.microsoft.com/en-us/library/system.threading.monitor.exit.aspx">Monitor.Exit</a> or exiting a lock {} region</td>
</tr>
</tbody>
</table>
<p>&#160;</p>
<p>For each operation, the table shows two things:</p>
<ul>
<li>Is the entire imaginary thread cache <strong>refreshed </strong>from main memory <strong>before </strong>the operation? </li>
<li>Is the entire imaginary thread cache <strong>flushed </strong>to main memory <strong>after </strong>the operation? </li>
</ul>
<h3>Disclaimer and limitations of the model</h3>
<p>This blog post reflects my personal understanding of the .NET memory model, and is based purely on publicly available information.</p>
<p>I find the explanation based on imaginary thread caches more intuitive than the more commonly used explanation based on operation reordering. The thread cache model is also accurate for most intents and purposes.</p>
<p>To be even more accurate, you should assume that the thread caches can form an arbitrary large hierarchy, and so you cannot assume that a read is served only from two possible places – main memory or the thread’s cache. I think that you would have to construct a somewhat of a clever case in order for the cache hierarchy to make a difference, though. If anyone is aware of a case where the hierarchical thread cache model makes a prediction different from the reordering-based model, I would love to hear about it.</p>
<p>If you are interested in the .NET memory model, I encourage you to read <a href="http://msdn.microsoft.com/en-us/magazine/cc163715.aspx">Understand the Impact of Low-Lock Techniques in Multithreaded Apps</a> in the MSDN Magazine, and the <a href="http://blogs.msdn.com/cbrumme/archive/2003/05/17/51445.aspx">Memory model</a> blog post from Chris Brumme.</p>
<div class="mynote">
<p>Read more of my articles:</p>
<p><a href="http://igoro.com/archive/gallery-of-processor-cache-effects/">Gallery of processor cache effects</a></p>
<p><a href="http://igoro.com/archive/what-really-happens-when-you-navigate-to-a-url/">What really happens when you navigate to a URL</a></p>
<p><a href="http://igoro.com/archive/human-heart-is-a-turing-machine-research-on-xbox-360-shows-wait-what/">Human heart is a Turing machine, research on XBox 360 shows. Wait, what?</a></p>
<p><a href="http://igoro.com/archive/7-tricks-to-simplify-your-programs-with-linq/">7 tricks to simplify your programs with LINQ</a></p>
<p>And if you like my blog, <a href="http://igoro.com/feed/">subscribe</a>!</p>
</div>
]]></content:encoded>
			<wfw:commentRss>http://igoro.com/archive/volatile-keyword-in-c-memory-model-explained/feed/</wfw:commentRss>
		<slash:comments>29</slash:comments>
		</item>
		<item>
		<title>What really happens when you navigate to a URL</title>
		<link>http://igoro.com/archive/what-really-happens-when-you-navigate-to-a-url/</link>
		<comments>http://igoro.com/archive/what-really-happens-when-you-navigate-to-a-url/#comments</comments>
		<pubDate>Tue, 09 Feb 2010 08:14:26 +0000</pubDate>
		<dc:creator><![CDATA[Igor Ostrovsky]]></dc:creator>
				<category><![CDATA[Cool Stuff]]></category>

		<guid isPermaLink="false">http://igoro.com/?p=422</guid>
		<description><![CDATA[As a software developer, you certainly have a high-level picture of how web apps work and what kinds of technologies are involved: the browser, HTTP, HTML, web server, request handlers, and so on. In this article, we will take a deeper look at the sequence of events that take place when you visit a URL. [&#8230;]]]></description>
				<content:encoded><![CDATA[<style>div.mynote { border: black 1px solid; padding: 10px; background-color: #f0f0ff; margin-top: 20px; margin-bottom: 20px; margin-left: 10px; ></style>
<p>As a software developer, you certainly have a high-level picture of how web apps work and what kinds of technologies are involved: the browser, HTTP, HTML, web server, request handlers, and so on.</p>
<p>In this article, we will take a deeper look at the sequence of events that take place when you visit a URL.</p>
<h3>1. You enter a URL into the browser</h3>
<p>It all starts here:</p>
<p><img style="border-right-width: 0px; display: inline; border-top-width: 0px; border-bottom-width: 0px; border-left-width: 0px" title="image" border="0" alt="image" src="http://igoro.com/wordpress/wp-content/uploads/2010/02/image4.png" width="591" height="103" /> </p>
<p><strong></strong></p>
<h3>2. The browser looks up the IP address for the domain name</h3>
<p>&#160;<img style="border-right-width: 0px; display: inline; border-top-width: 0px; border-bottom-width: 0px; border-left-width: 0px" title="image" border="0" alt="image" src="http://igoro.com/wordpress/wp-content/uploads/2010/02/image13.png" width="228" height="96" /> </p>
<p>The first step in the navigation is to figure out the IP address for the visited domain. The DNS lookup proceeds as follows:</p>
<p> <span id="more-422"></span>
<ul>
<li><strong>Browser cache – </strong>The browser caches DNS records for some time. Interestingly, the OS does not tell the browser the time-to-live for each DNS record, and so the browser caches them for a fixed duration (varies between browsers, 2 – 30 minutes). </li>
<li><strong>OS cache</strong> – If the browser cache does not contain the desired record, the browser makes a system call (gethostbyname in Windows). The OS has its own cache. </li>
<li><strong>Router cache</strong> – The request continues on to your router, which typically has its own DNS cache. </li>
<li><strong>ISP DNS cache</strong> – The next place checked is the cache ISP’s DNS server. With a cache, naturally. </li>
<li><strong>Recursive search</strong> – Your ISP’s DNS server begins a recursive search, from the root nameserver, through the .com top-level nameserver, to Facebook’s nameserver. Normally, the DNS server will have names of the .com nameservers in cache, and so a hit to the root nameserver will not be necessary. </li>
</ul>
<p>Here is a diagram of what a recursive DNS search looks like:</p>
<p><img style="border-right-width: 0px; display: inline; border-top-width: 0px; border-bottom-width: 0px; border-left-width: 0px" title="500px-An_example_of_theoretical_DNS_recursion_svg" border="0" alt="500px-An_example_of_theoretical_DNS_recursion_svg" src="http://igoro.com/wordpress/wp-content/uploads/2010/02/500pxAn_example_of_theoretical_DNS_recursion_svg.png" width="500" height="178" />&#160;</p>
<div class="mynote">
<p>One worrying thing about DNS is that the entire domain like wikipedia.org or facebook.com seems to map to a single IP address. Fortunately, there are ways of mitigating the bottleneck:</p>
<ul>
<li><strong>Round-robin DNS</strong> is a solution where the DNS lookup returns multiple IP addresses, rather than just one. For example, facebook.com actually maps to four IP addresses. </li>
<li><strong>Load-balancer</strong> is the piece of hardware that listens on a particular IP address and forwards the requests to other servers. Major sites will typically use expensive high-performance load balancers. </li>
<li><strong>Geographic DNS </strong>improves scalability by mapping a domain name to different IP addresses, depending on the client’s geographic location. This is great for hosting static content so that different servers don’t have to update shared state. </li>
<li><strong>Anycast</strong> is a routing technique where a single IP address maps to multiple physical servers. Unfortunately, anycast does not fit well with TCP and is rarely used in that scenario. </li>
</ul></div>
<div class="mynote">
<p>Most of the DNS servers themselves use anycast to achieve high availability and low latency <strong>of the DNS lookups</strong>.</p>
</p></div>
<h3>3. The browser sends a HTTP request to the web server</h3>
<p><img style="border-right-width: 0px; display: inline; border-top-width: 0px; border-bottom-width: 0px; margin-left: 0px; border-left-width: 0px; margin-right: 0px" title="image" border="0" alt="image" src="http://igoro.com/wordpress/wp-content/uploads/2010/02/image22.png" width="216" height="95" /> </p>
<p>You can be pretty sure that Facebook’s homepage will not be served from the browser cache because dynamic pages expire either very quickly or immediately (expiry date set to past).</p>
<p>So, the browser will send this request to the Facebook server:</p>
<pre class="code">GET http://facebook.com/ HTTP/1.1
Accept: application/x-ms-application, image/jpeg, application/xaml+xml, <font style="color: lightblue">[...]</font>
User-Agent: Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; WOW64; <font style="color: lightblue">[...]</font>
Accept-Encoding: gzip, deflate
Connection: Keep-Alive
Host: facebook.com
Cookie: datr=1265876274-<font style="color: lightblue">[...]</font>; locale=en_US; lsd=WW<font style="color: lightblue">[...]</font>; c_user=2101<font style="color: lightblue">[...]</font></pre>
<p>The GET request names the <strong>URL</strong> to fetch<strong>:</strong> “http://facebook.com/”. The browser identifies itself (<strong>User-Agent</strong> header), and states what types of responses it will accept (<strong>Accept</strong> and <strong>Accept-Encoding</strong> headers). The <strong>Connection</strong> header asks the server to keep the TCP connection open for further requests.</p>
<p>The request also contains the <strong>cookies</strong> that the browser has for this domain. As you probably already know, cookies are key-value pairs that track the state of a web site in between different page requests. And so the cookies store the name of the logged-in user, a secret number that was assigned to the user by the server, some of user’s settings, etc. The cookies will be stored in a text file on the client, and sent to the server with every request.</p>
<div class="mynote">
<p>There is a variety of tools that let you view the raw HTTP requests and corresponding responses. My favorite tool for viewing the raw HTTP traffic is <a href="http://www.fiddler2.com/fiddler2/">fiddler</a>, but there are many other tools (e.g., FireBug) These tools are a great help when optimizing a site.</p>
</div>
<div class="mynote">
<p>In addition to GET requests, another type of requests that you may be familiar with is a POST request, typically used to submit forms. A GET request sends its parameters via the URL (e.g.: http://robozzle.com/puzzle.aspx<strong>?id=85</strong>). A POST request sends its parameters in the request body, just under the headers.</p>
</div>
<div class="mynote">
<p>The trailing slash in the URL “http://facebook.com/” is important. In this case, the browser can safely add the slash. For URLs of the form http://example.com/folderOrFile, the browser cannot automatically add a slash, because it is not clear whether folderOrFile is a folder or a file. In such cases, the browser will visit the URL without the slash, and the server will respond with a redirect, resulting in an unnecessary roundtrip.</p>
</div>
<h3>4. The facebook server responds with a permanent redirect</h3>
<p><img style="border-right-width: 0px; display: inline; border-top-width: 0px; border-bottom-width: 0px; border-left-width: 0px" title="image" border="0" alt="image" src="http://igoro.com/wordpress/wp-content/uploads/2010/02/image8.png" width="214" height="87" /> </p>
<p>This is the response that the Facebook server sent back to the browser request:</p>
<pre class="code">HTTP/1.1 301 Moved Permanently
Cache-Control: private, no-store, no-cache, must-revalidate, post-check=0,
      pre-check=0
Expires: Sat, 01 Jan 2000 00:00:00 GMT
Location: http://www.facebook.com/
P3P: CP=&quot;DSP LAW&quot;
Pragma: no-cache
Set-Cookie: made_write_conn=deleted; expires=Thu, 12-Feb-2009 05:09:50 GMT;
      path=/; domain=.facebook.com; httponly
Content-Type: text/html; charset=utf-8
X-Cnection: close
Date: Fri, 12 Feb 2010 05:09:51 GMT
Content-Length: 0</pre>
<p>The server responded with a 301 Moved Permanently response to tell the browser to go to “http://www.facebook.com/” instead of “http://facebook.com/”. </p>
<div class="mynote">
<p>There are interesting reasons why the server insists on the redirect instead of immediately responding with the web page that the user wants to see.</p>
<p>One reason has to do with <strong>search engine rankings</strong>. See, if there are two URLs for the same page, say <a href="http://www.igoro.com/">http://www.igoro.com/</a> and <a href="http://igoro.com/">http://igoro.com/</a>, search engine may consider them to be two different sites, each with fewer incoming links and thus a lower ranking. Search engines understand permanent redirects (301), and will combine the incoming links from both sources into a single ranking.</p>
<p>Also, multiple URLs for the same content are not <strong>cache-friendly</strong>. When a piece of content has multiple names, it will potentially appear multiple times in caches.</p>
</div>
<h3>5. The browser follows the redirect</h3>
<p><img style="border-right-width: 0px; display: inline; border-top-width: 0px; border-bottom-width: 0px; border-left-width: 0px" title="image" border="0" alt="image" src="http://igoro.com/wordpress/wp-content/uploads/2010/02/image23.png" width="216" height="95" /></p>
<p>The browser now knows that “http://www.facebook.com/” is the correct URL to go to, and so it sends out another GET request:</p>
<pre class="code">GET http://www.facebook.com/ HTTP/1.1
Accept: application/x-ms-application, image/jpeg, application/xaml+xml, <font style="color: lightblue">[...]</font>
Accept-Language: en-US
User-Agent: Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; WOW64; <font style="color: lightblue">[...]</font>
Accept-Encoding: gzip, deflate
Connection: Keep-Alive
Cookie: lsd=XW<font style="color: lightblue">[...]</font>; c_user=21<font style="color: lightblue">[...]</font>; x-referer=<font style="color: lightblue">[...]</font>
Host: www.facebook.com</pre>
<p>The meaning of the headers is the same as for the first request.</p>
<h3>6. The server ‘handles’ the request</h3>
<p><img style="border-right-width: 0px; display: inline; border-top-width: 0px; border-bottom-width: 0px; border-left-width: 0px" title="image" border="0" alt="image" src="http://igoro.com/wordpress/wp-content/uploads/2010/02/image9.png" width="93" height="78" /> </p>
<p>The server will receive the GET request, process it, and send back a response.</p>
<p>This may seem like a straightforward task, but in fact there is a lot of interesting stuff that happens here – even on a simple site like my blog, let alone on a massively scalable site like facebook.</p>
<ul>
<li><strong>Web server software<br /></strong>The web server software (e.g., IIS or Apache) receives the HTTP request and decides which request handler should be executed to handle this request. A request handler is a program (in ASP.NET, PHP, Ruby, …) that reads the request and generates the HTML for the response.
<p>In the simplest case, the request handlers can be stored in a file hierarchy whose structure mirrors the URL structure, and so for example <a href="http://example.com/folder1/page1.aspx">http://example.com/folder1/page1.aspx</a> URL will map to file /httpdocs/folder1/page1.aspx. The web server software can also be configured so that URLs are manually mapped to request handlers, and so the public URL of page1.aspx could be <a href="http://example.com/folder1/page1">http://example.com/folder1/page1</a>. </p>
</li>
<li><strong>Request handler<br /></strong>The request handler reads the request, its parameters, and cookies. It will read and possibly update some data stored on the server. Then, the request handler will generate a HTML response. </li>
</ul>
<div class="mynote">
<p>One interesting difficulty that every dynamic website faces is how to store data. Smaller sites will often have a single SQL database to store their data, but sites that store a large amount of data and/or have many visitors have to find a way to split the database across multiple machines. Solutions include sharding (splitting up a table across multiple databases based on the primary key), replication, and usage of simplified databases with weakened consistency semantics. </p>
</div>
<div class="mynote">
<p>One technique to keep data updates cheap is to defer some of the work to a batch job. For example, Facebook has to update the newsfeed in a timely fashion, but the data backing the “People you may know” feature may only need to be updated nightly (my guess, I don’t actually know how they implement this feature). Batch job updates result in staleness of some less important data, but can make data updates much faster and simpler.</p>
</div>
<h3>7. The server sends back a HTML response</h3>
<p><img style="border-right-width: 0px; display: inline; border-top-width: 0px; border-bottom-width: 0px; border-left-width: 0px" title="image" border="0" alt="image" src="http://igoro.com/wordpress/wp-content/uploads/2010/02/image10.png" width="214" height="87" /> </p>
<p>Here is the response that the server generated and sent back:</p>
<pre class="code">HTTP/1.1 200 OK
Cache-Control: private, no-store, no-cache, must-revalidate, post-check=0,
    pre-check=0
Expires: Sat, 01 Jan 2000 00:00:00 GMT
P3P: CP=&quot;DSP LAW&quot;
Pragma: no-cache
Content-Encoding: gzip
Content-Type: text/html; charset=utf-8
X-Cnection: close
Transfer-Encoding: chunked
Date: Fri, 12 Feb 2010 09:05:55 GMT

2b3<br />��������T�n�@����<font style="color: lightblue">[...]</font></pre>
<p>The entire response is 36 kB, the bulk of them in the byte blob at the end that I trimmed.</p>
<p>The <strong>Content-Encoding</strong> header tells the browser that the response body is compressed using the gzip algorithm. After decompressing the blob, you’ll see the HTML you’d expect:</p>
<pre>&lt;!DOCTYPE html PUBLIC &quot;-//W3C//DTD XHTML 1.0 Strict//EN&quot;&#160;&#160;&#160;
      &quot;http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd&quot;&gt;
&lt;html xmlns=&quot;http://www.w3.org/1999/xhtml&quot; xml:lang=&quot;en&quot; 
      lang=&quot;en&quot; id=&quot;facebook&quot; class=&quot; no_js&quot;&gt;
&lt;head&gt;
&lt;meta http-equiv=&quot;Content-type&quot; content=&quot;text/html; charset=utf-8&quot; /&gt;
&lt;meta http-equiv=&quot;Content-language&quot; content=&quot;en&quot; /&gt;
...</pre>
<p>In addition to compression, headers specify whether and how to cache the page, any cookies to set (none in this response), privacy information, etc.</p>
<div class="mynote">
<p>Notice the header that sets <strong>Content-Type</strong> to <strong>text/html</strong>. The header instructs the browser to render the response content as HTML, instead of say downloading it as a file. The browser will use the header to decide how to interpret the response, but will consider other factors as well, such as the extension of the URL.</p>
</div>
<h3>8. The browser begins rendering the HTML</h3>
<p>Even before the browser has received the entire HTML document, it begins rendering the website:</p>
<p>&#160;<img style="border-right-width: 0px; display: inline; border-top-width: 0px; border-bottom-width: 0px; border-left-width: 0px" title="image" border="0" alt="image" src="http://igoro.com/wordpress/wp-content/uploads/2010/02/image6.png" width="244" height="137" /> </p>
<h3>9. The browser sends requests for objects embedded in HTML</h3>
<p><img style="border-right-width: 0px; display: inline; border-top-width: 0px; border-bottom-width: 0px; border-left-width: 0px" title="image" border="0" alt="image" src="http://igoro.com/wordpress/wp-content/uploads/2010/02/image11.png" width="214" height="126" /> </p>
<p>As the browser renders the HTML, it will notice tags that require fetching of other URLs. The browser will send a GET request to retrieve each of these files.</p>
<p>Here are a few URLs that my visit to facebook.com retrieved:</p>
<ul>
<li><strong>Images<br /></strong>http://static.ak.fbcdn.net/rsrc.php/z12E0/hash/8q2anwu7.gif<br />http://static.ak.fbcdn.net/rsrc.php/zBS5C/hash/7hwy7at6.gif<br />… </li>
<li><strong>CSS style sheets<br /></strong>http://static.ak.fbcdn.net/rsrc.php/z448Z/hash/2plh8s4n.css<br />http://static.ak.fbcdn.net/rsrc.php/zANE1/hash/cvtutcee.css<br />… </li>
<li><strong>JavaScript files</strong><br />http://static.ak.fbcdn.net/rsrc.php/zEMOA/hash/c8yzb6ub.js<br />http://static.ak.fbcdn.net/rsrc.php/z6R9L/hash/cq2lgbs8.js<br />…</li>
</ul>
<p>Each of these URLs will go through process a similar to what the HTML page went through. So, the browser will look up the domain name in DNS, send a request to the URL, follow redirects, etc.</p>
<p>However, static files &#8211; unlike dynamic pages &#8211; allow the browser to cache them. Some of the files may be served up from cache, without contacting the server at all. The browser knows how long to cache a particular file because the response that returned the file contained an Expires header. Additionally, each response may also contain an ETag header that works like a version number – if the browser sees an ETag for a version of the file it already has, it can stop the transfer immediately.</p>
<div class="mynote">
<p>Can you guess what <strong>“fbcdn.net”</strong> in the URLs stands for? A safe bet is that it means “Facebook content delivery network”. Facebook uses a content delivery network (CDN) to distribute static content – images, style sheets, and JavaScript files. So, the files will be copied to many machines across the globe.</p>
<p>Static content often represents the bulk of the bandwidth of a site, and can be easily replicated across a CDN. Often, sites will use a third-party CDN provider, instead of operating a CND themselves. For example, Facebook’s static files are hosted by Akamai, the largest CDN provider.</p>
<p>As a demonstration, when you try to ping static.ak.fbcdn.net, you will get a response from an akamai.net server. Also, interestingly, if you ping the URL a couple of times, may get responses from different servers, which demonstrates the load-balancing that happens behind the scenes.</p>
</div>
<h3>10. The browser sends further asynchronous (AJAX) requests</h3>
<p><img style="border-right-width: 0px; display: inline; border-top-width: 0px; border-bottom-width: 0px; border-left-width: 0px" title="image" border="0" alt="image" src="http://igoro.com/wordpress/wp-content/uploads/2010/02/image12.png" width="214" height="120" /> </p>
<p>In the spirit of Web 2.0, the client continues to communicate with the server even after the page is rendered.</p>
<p>For example, Facebook chat will continue to update the list of your logged in friends as they come and go. To update the list of your logged-in friends, the JavaScript executing in your browser has to send an asynchronous request to the server. The asynchronous request is a programmatically constructed GET or POST request that goes to a special URL. In the Facebook example, the client sends a POST request to http://www.facebook.com/ajax/chat/buddy_list.php to fetch the list of your friends who are online.</p>
<p>This pattern is sometimes referred to as “AJAX”, which stands for “Asynchronous JavaScript And XML”, even though there is no particular reason why the server has to format the response as XML. For example, Facebook returns snippets of JavaScript code in response to asynchronous requests.</p>
<div class="mynote">
<p>Among other things, the fiddler tool lets you view the asynchronous requests sent by your browser. In fact, not only you can observe the requests passively, but you can also modify and resend them. The fact that it is this easy to “spoof” AJAX requests causes a lot of grief to developers of online games with scoreboards. (Obviously, please don’t cheat that way.)</p>
</div>
<div class="mynote">
<p>Facebook chat provides an example of an interesting problem with AJAX: pushing data from server to client. Since HTTP is a request-response protocol, the chat server cannot push new messages to the client. Instead, the client has to poll the server every few seconds to see if any new messages arrived.</p>
<p><a href="http://en.wikipedia.org/wiki/Push_technology#Long_polling">Long polling</a> is an interesting technique to decrease the load on the server in these types of scenarios. If the server does not have any new messages when polled, it simply does not send a response back. And, if a message for this client is received within the timeout period, the server will find the outstanding request and return the message with the response.</p>
</div>
<h3>Conclusion</h3>
<p>Hopefully this gives you a better idea of how the different web pieces work together.</p>
<div class="mynote">
<p>Read more of my articles:</p>
<p><a href="http://igoro.com/archive/gallery-of-processor-cache-effects/">Gallery of processor cache effects</a></p>
<p><a href="http://igoro.com/archive/human-heart-is-a-turing-machine-research-on-xbox-360-shows-wait-what/">Human heart is a Turing machine, research on XBox 360 shows. Wait, what?</a></p>
<p><a href="http://igoro.com/archive/self-printing-game-of-life-in-c/">Self-printing Game of Life in C#</a></p>
<p><a href="http://igoro.com/archive/skip-lists-are-fascinating/">Skip lists are fascinating!</a></p>
<p>And if you like my blog, <a href="http://igoro.com/feed/">subscribe</a>!</p>
</div>
]]></content:encoded>
			<wfw:commentRss>http://igoro.com/archive/what-really-happens-when-you-navigate-to-a-url/feed/</wfw:commentRss>
		<slash:comments>209</slash:comments>
		</item>
		<item>
		<title>Gallery of Processor Cache Effects</title>
		<link>http://igoro.com/archive/gallery-of-processor-cache-effects/</link>
		<comments>http://igoro.com/archive/gallery-of-processor-cache-effects/#comments</comments>
		<pubDate>Tue, 19 Jan 2010 10:28:11 +0000</pubDate>
		<dc:creator><![CDATA[Igor Ostrovsky]]></dc:creator>
				<category><![CDATA[Cool Stuff]]></category>

		<guid isPermaLink="false">http://igoro.com/?p=366</guid>
		<description><![CDATA[Most of my readers will understand that cache is a fast but small type of memory that stores recently accessed memory locations.  This description is reasonably accurate, but the &#8220;boring&#8221; details of how processor caches work can help a lot when trying to understand program performance. In this blog post, I will use code samples [&#8230;]]]></description>
				<content:encoded><![CDATA[<p>Most of my readers will understand that cache is a fast but small type of memory that stores recently accessed memory locations.  This description is reasonably accurate, but the &#8220;boring&#8221; details of how processor caches work can help a lot when trying to understand program performance.</p>
<p>In this blog post, I will use code samples to illustrate various aspects of how caches work, and what is the impact on the performance of real-world programs.</p>
<p>The examples are in C#, but the language choice has little impact on the performance scores and the conclusions they lead to.</p>
<h3>Example 1: Memory accesses and performance</h3>
<p>How much faster do you expect Loop 2 to run, compared Loop 1?</p>
<pre class="code"><span style="color: blue;">int</span>[] arr = <span style="color: blue;">new int</span>[64 * 1024 * 1024];
<span id="more-366"></span>
<span style="color: green;">// Loop 1</span>
<span style="color: blue;">for </span>(<span style="color: blue;">int </span>i = 0; i &lt; arr.Length; i++) arr[i] *= 3;

<span style="color: green;">// Loop 2</span>
<span style="color: blue;">for </span>(<span style="color: blue;">int </span>i = 0; i &lt; arr.Length; i += 16) arr[i] *= 3;</pre>
<p><!--More--></p>
<p>The first loop multiplies every value in the array by 3, and the second loop multiplies only every 16-th. The second loop only does about <strong>6% of the work</strong> of the first loop, but on modern machines, the two for-loops take about the same time<strong>:</strong> <strong>80</strong> and <strong>78</strong> <strong>ms</strong> respectively on my machine.</p>
<p>The reason why the loops take the same amount of time has to do with memory. The running time of these loops is dominated by the memory accesses to the array, not by the integer multiplications. And, as I’ll explain on Example 2, the hardware will perform the same main memory accesses for the two loops.</p>
<h3>Example 2: Impact of cache lines</h3>
<p>Let’s explore this example deeper. We will try other step values, not just 1 and 16:</p>
<pre class="code"><span style="color: blue;">for </span>(<span style="color: blue;">int </span>i = 0; i &lt; arr.Length; i += K) arr[i] *= 3;</pre>
<p>Here are the running times of this loop for different step values (K):</p>
<p><img style="display: inline; border-width: 0px;" title="image" src="http://igoro.com/wordpress/wp-content/uploads/2010/01/image6.png" border="0" alt="image" width="483" height="291" /></p>
<p>Notice that while step is in the range from 1 to 16, the running time of the for-loop hardly changes. But from 16 onwards, the running time is halved each time we double the step.</p>
<p>The reason behind this is that today’s CPUs do not access memory byte by byte. Instead, they fetch memory in chunks of (typically) 64 bytes, called <em>cache lines</em>. When you read a particular memory location, the entire cache line is fetched from the main memory into the cache. And, accessing other values from the same cache line is cheap!</p>
<p>Since 16 ints take up 64 bytes (one cache line), for-loops with a step between 1 and 16 have to touch the same number of cache lines: all of the cache lines in the array. But once the step is 32, we’ll only touch roughly every other cache line, and once it is 64, only every fourth.</p>
<p>Understanding of cache lines can be important for certain types of program optimizations. For example, alignment of data may determine whether an operation touches one or two cache lines. As we saw in the example above, this can easily mean that in the misaligned case, the operation will be twice slower.</p>
<h3>Example 3: L1 and L2 cache sizes</h3>
<p>Today’s computers come with two or three levels of caches, usually called L1, L2 and possibly L3. If you want to know the sizes of the different caches, you can use the <a href="http://technet.microsoft.com/en-us/sysinternals/cc835722.aspx">CoreInfo</a> SysInternals tool, or use the <a href="http://msdn.microsoft.com/en-us/library/ms683194(VS.85).aspx">GetLogicalProcessorInfo</a> Windows API call. Both methods will also tell you the cache line sizes, in addition to the cache sizes.</p>
<p>On my machine, CoreInfo reports that I have a 32kB L1 data cache, a 32kB L1 instruction cache, and a 4MB L2 data cache. The L1 caches are per-core, and the L2 caches are shared between pairs of cores:</p>
<pre class="code">Logical Processor to Cache Map:
*---  Data Cache          0, Level 1,   32 KB, Assoc   8, LineSize  64
*---  Instruction Cache   0, Level 1,   32 KB, Assoc   8, LineSize  64
-*--  Data Cache          1, Level 1,   32 KB, Assoc   8, LineSize  64
-*--  Instruction Cache   1, Level 1,   32 KB, Assoc   8, LineSize  64
**--  Unified Cache       0, Level 2,    4 MB, Assoc  16, LineSize  64
--*-  Data Cache          2, Level 1,   32 KB, Assoc   8, LineSize  64
--*-  Instruction Cache   2, Level 1,   32 KB, Assoc   8, LineSize  64
---*  Data Cache          3, Level 1,   32 KB, Assoc   8, LineSize  64
---*  Instruction Cache   3, Level 1,   32 KB, Assoc   8, LineSize  64
--**  Unified Cache       1, Level 2,    4 MB, Assoc  16, LineSize  64</pre>
<p>Let’s verify these numbers by an experiment. To do that, we’ll step over an array incrementing every 16th integer – a cheap way to modify every cache line. When we reach the last value, we loop back to the beginning. We’ll experiment with different array sizes, and we should see drops in the performance at the array sizes where the array spills out of one cache level.</p>
<p>Here is the program:</p>
<pre class="code"><span style="color: blue;">int </span>steps = 64 * 1024 * 1024; <span style="color: green;">// Arbitrary number of steps</span>
<span style="color: blue;">int </span>lengthMod = arr.Length - 1;
<span style="color: blue;">for </span>(<span style="color: blue;">int </span>i = 0; i &lt; steps; i++)
{
    arr[(i * 16) &amp; lengthMod]++; <span style="color: green;">// (x &amp; lengthMod) is equal to (x % arr.Length)
</span>}</pre>
<p>And here are the timings:</p>
<p> <img style="display: inline; border-width: 0px;" title="image" src="http://igoro.com/wordpress/wp-content/uploads/2010/02/image.png" border="0" alt="image" width="483" height="291" /></p>
<p>You can see distinct drops after 32kB and 4MB – the sizes of L1 and L2 caches on my machine.</p>
<h3>Example 4: Instruction-level parallelism</h3>
<p>Now, let’s take a look at something different. Out of these two loops, which one would you expect to be faster?</p>
<pre class="code"><span style="color: blue;">int </span>steps = 256 * 1024 * 1024;
<span style="color: blue;">int</span>[] a = <span style="color: blue;">new int</span>[2];

<span style="color: green;">// Loop 1
</span><span style="color: blue;">for </span>(<span style="color: blue;">int </span>i=0; i&lt;steps; i++) { a[0]++; a[0]++; }

<span style="color: green;">// Loop 2
</span><span style="color: blue;">for </span>(<span style="color: blue;">int </span>i=0; i&lt;steps; i++) { a[0]++; a[1]++; }</pre>
<p>It turns out that the second loop is about twice faster than the first loop, at least on all of the machines I tested. Why? This has to do with the dependencies between operations in the two loop bodies.</p>
<p>In the body of the first loop, operations depend on each other as follows:</p>
<p><img style="display: inline; border-width: 0px;" title="image" src="http://igoro.com/wordpress/wp-content/uploads/2010/01/image.png" border="0" alt="image" width="613" height="25" /></p>
<p>But in the second example, we only have these dependencies:</p>
<p><img style="display: inline; border-width: 0px;" title="image" src="http://igoro.com/wordpress/wp-content/uploads/2010/02/image2.png" border="0" alt="image" width="289" height="73" /></p>
<p>The modern processor has various parts that have a little bit of parallelism in them: it can access two memory locations in L1 at the same time, or perform two simple arithmetic operations. In the first loop, the processor cannot exploit this instruction-level parallelism, but in the second loop, it can.</p>
<p><strong>[UPDATE]</strong>: Many people on reddit are asking about compiler optimizations, and whether { a[0]++; a[0]++; } would just get optimized to { a[0]+=2; }. In fact, the C# compiler and CLR JIT will not do this optimization – not when array accesses are involved. I built all of the tests in release mode (i.e. with optimizations), but I looked at the JIT-ted assembly to verify that optimizations aren’t skewing the results.</p>
<h3>Example 5: Cache associativity</h3>
<p>One key decision in cache design is whether each chunk of main memory can be stored in any cache slot, or in just some of them.</p>
<p>There are three possible approaches to mapping cache slots to memory chunks:</p>
<ol>
<li><strong>Direct mapped cache</strong>
<p>Each memory chunk can only be stored only in one particular slot in the cache. One simple solution is to map the chunk with index chunk_index to cache slot (chunk_index % cache_slots). Two memory chunks that map to the same slot cannot be stored simultaneously in the cache.</li>
<li><strong>N-way set associative cache</strong>
<p>Each memory chunk can be stored in any one of N particular slots in the cache. As an example, in a 16-way cache, each memory chunk can be stored in 16 different cache slots. Commonly, chunks with indices with the same lowest order bits will all share 16 slots.</li>
<li><strong>Fully associative cache</strong>
<p>Each memory chunk can be stored in any slot in the cache. Effectively, the cache operates like a hash table.</li>
</ol>
<p>Direct mapped caches can suffer from conflicts &#8211; when multiple values compete for the same slot in the cache, they keep evicting each other out, and the hit rate plummets. On the other hand, fully associative caches are complicated and costly to implement in the hardware. N-way set associative caches are the typical solution for processor caches, as they make a good trade off between implementation simplicity and good hit rate.</p>
<p>For example, the 4MB L2 cache on my machine is 16-way associative. All 64-byte memory chunks are partitioned into sets (based on the lowest order bits of the chunk index), and chunks in the same set compete for 16 slots in the L2 cache.</p>
<p>Since the L2 cache has 65,536 slots, and each set will need 16 slots in the cache, we will have 4,096 sets. So, the lowest 12 bits of the chunk index will determine which set the chunk belongs to (2<sup>12</sup> = 4,096). As a result, cache lines at addresses that differ by a multiple of 262,144 bytes (4096 * 64) will compete for the same slot in the cache. The cache on my machine can hold at most 16 such cache lines.</p>
<p>In order for the effects of cache associativity to become apparent, I need to repeatedly access more than  16 elements from the same set. I will demonstrate this using the following method:</p>
<pre class="code"><span style="color: blue;">public static long </span>UpdateEveryKthByte(<span style="color: blue;">byte</span>[] arr, <span style="color: blue;">int </span>K)
{
    <span style="color: #2b91af;">Stopwatch </span>sw = <span style="color: #2b91af;">Stopwatch</span>.StartNew();
    <span style="color: blue;">const int </span>rep = 1024*1024; <span style="color: green;">// Number of iterations – arbitrary</span>

<span style="color: green;">    </span><span style="color: blue;">int </span>p = 0;
    <span style="color: blue;">for </span>(<span style="color: blue;">int </span>i = 0; i &lt; rep; i++)
    {
        arr[p]++;
        p += K;
        <span style="color: blue;">if </span>(p &gt;= arr.Length) p = 0;
    }

    sw.Stop();
    <span style="color: blue;">return </span>sw.ElapsedMilliseconds;
}</pre>
<p>This method increments every K-th value in the array. Once the it reaches the end of the array, it starts again from the beginning. After running sufficiently long (2^20 steps), the loop stops.</p>
<p>I ran UpdateEveryKthByte() with different array sizes (in 1MB increments) and different step sizes. Here is a plot of the results, with blue representing long running time, and white representing short:</p>
<p> <a href="http://igoro.com/wordpress/wp-content/uploads/2010/02/image3.png"><img style="display: inline; border-width: 0px;" title="image" src="http://igoro.com/wordpress/wp-content/uploads/2010/02/image_thumb1_opt.png" border="0" alt="image" width="582" height="299" /></a></p>
<p>The blue areas (long running times) are cases where the updated values <strong>could not be simultaneously held in the cache</strong> as we repeatedly iterated over them. The bright blue areas correspond to running times of ~80 ms, and the nearly white areas to ~10 ms.</p>
<p>Let’s explain the blue parts of the chart:</p>
<ol>
<li><strong>Why the vertical lines?</strong>The vertical lines show the step values that touch too many memory locations (&gt;16) from the same set. For those steps, we cannot simultaneously hold all touched values in the 16-way associative cache on my machine.
<p>Some bad step values are powers of two: 256 and 512. As an example, consider step 512 on an 8MB array. An 8MB cache line contains 32 values that are spaced by 262,144 bytes apart. All of those values will be updated by each pass of our loop, because 512 divides 262,144.</p>
<p>And since 32 &gt; 16, those 32 values will keep competing for the same 16 slots in the cache.</p>
<p>Some values that are not powers of two are simply unfortunate, and will end up visiting disproportionately many values from the same set. Those step values will also show up as as blue lines.</li>
<li><strong>Why do the vertical lines stop at 4MB array length?</strong>On arrays of 4MB or less, a 16-way associative cache is just as good as a fully associative one.
<p>A 16-way associative cache can hold at most 16 cache lines that are a multiple of 262,144 bytes apart. There is <strong>no set</strong> of 17 or more cache lines all aligned on 262,144-byte boundaries within 4MB, because 16 * 262,144 = 4,194,304.</li>
<li><strong>Why the blue triangle in upper left?</strong>In the triangle area, we cannot hold all necessary data in cache simultaneously … not due to the associativity, but simply because of the L2 cache size limit.
<p>For example, consider the array length 16MB with step 128. We are repeatedly updating every 128th byte in the array, which means that we touch every other 64-byte memory chunk. To store every other cache line of a 16MB array, we’d need 8MB cache. But, my machine only has 4MB of cache.</p>
<p>Even if the 4MB cache on my machine was fully associative, it still wouldn’t be able to hold 8MB of data.</li>
<li><strong>Why does the triangle fade out in the left?</strong>Notice that the gradient goes from 0 to 64 bytes – one cache line! As explained in examples 1 and 2, additional accesses to same cache line are nearly free. For example, when stepping by 16 bytes, it will take 4 steps to get to the next cache line. So, we get four memory accesses for the price of one.
<p>Since the number of steps is the same for all cases, a cheaper step results in a shorter running time.</li>
</ol>
<p>These patterns continue to hold as you extend the chart:</p>
<p><a href="http://igoro.com/wordpress/wp-content/uploads/2010/02/assoc_big1.png"><img style="display: inline; border-width: 0px;" title="assoc_big" src="http://igoro.com/wordpress/wp-content/uploads/2010/02/assoc_big_thumb1_opt.png" border="0" alt="assoc_big" width="582" height="299" /></a></p>
<p>Cache associativity is interesting to understand and can certainly be demonstrated, but it tends to be less of a problem compared to the other issues discussed in this article. It is certainly not something that should be at the forefront of your mind as you write programs.</p>
<h3>Example 6: False cache line sharing</h3>
<p>On multi-core machines, caches encounter another problem – consistency. Different cores have fully or partly separate caches. On my machine, L1 caches are separate (as is common), and there are two pairs of processors, each pair sharing an L2 cache. While the details vary, a modern multi-core machine will have a multi-level cache hierarchy, where the faster and smaller caches belong to individual processors.</p>
<p>When one processor modifies a value in its cache, other processors cannot use the old value anymore. That memory location will be invalidated in all of the caches. Furthermore, since caches operate on the granularity of cache lines and not individual bytes, the <strong>entire cache line</strong> will be invalidated in all caches!</p>
<p>To demonstrate this issue, consider this example:</p>
<pre class="code"><span style="color: blue;">private static int</span>[] s_counter = <span style="color: blue;">new int</span>[1024];
<span style="color: blue;">private void </span>UpdateCounter(<span style="color: blue;">int </span>position)
{
    <span style="color: blue;">for </span>(<span style="color: blue;">int </span>j = 0; j &lt; 100000000; j++)
    {
        s_counter[position] = s_counter[position] + 3;
    }
}</pre>
<p>On my quad-core machine, if I call UpdateCounter with parameters 0,1,2,3 from four different threads, it will take <strong>4.3 seconds </strong>until all threads are done.</p>
<p>On the other hand, if I call UpdateCounter with parameters 16,32,48,64 the operation will be done in <strong>0.28 seconds</strong>!</p>
<p>Why? In the first case, all four values are very likely to end up on the same cache line. Each time a core increments the counter, it invalidates the cache line that holds all four counters. All other cores will suffer a cache miss the next time they access their own counters. This kind of thread behavior effectively disables caches, <strong>crippling</strong> the program’s performance.</p>
<h3>Example 7: Hardware complexities</h3>
<p>Even when you know the basics of how caches work, the hardware will still sometimes surprise you. Different processors differ in optimizations, heuristics, and subtle details of how they do things.</p>
<p>On some processors, L1 cache can process two accesses in parallel if they access cache lines from different banks, and serially if they belong to the same bank. Also, processors can surprise you with clever optimizations. For example, the false-sharing example that I’ve used on several machines in the past did not work well on my machine without tweaks – my home machine can optimize the execution in the simplest cases to reduce the cache invalidations.</p>
<p>Here is one odd example of “hardware weirdness”:</p>
<pre class="code"><span style="color: blue;">private static int </span>A, B, C, D, E, F, G;
<span style="color: blue;">private static void </span>Weirdness()
{
    <span style="color: blue;">for </span>(<span style="color: blue;">int </span>i = 0; i &lt; 200000000; i++)
    {
        &lt;something&gt;
    }
}</pre>
<p>When I substitute three different blocks for “&lt;something&gt;”, I get these timings:</p>
<table border="0" cellspacing="0" cellpadding="2" width="362">
<tbody>
<tr>
<td width="199" valign="top"><strong>&lt;something&gt;</strong></td>
<td width="161" valign="top"><strong>Time</strong></td>
</tr>
<tr>
<td width="220" valign="top">A++; B++; C++; D++;</td>
<td width="169" valign="top">719 ms</td>
</tr>
<tr>
<td width="225" valign="top">A++; C++; E++; G++;</td>
<td width="172" valign="top">448 ms</td>
</tr>
<tr>
<td width="225" valign="top">A++; C++;</td>
<td width="174" valign="top">518 ms</td>
</tr>
</tbody>
</table>
<p>Incrementing fields A,B,C,D takes longer than incrementing fields A,C,E,G. And what’s even weirder, incrementing just A and C takes <strong>longer </strong>than increment A and C <strong>and</strong> E and G!</p>
<p>I don’t know for sure what is the reason behind these numbers, but I suspect it is related to memory banks. If someone can explain these numbers, I’d be very curious to hear about it.</p>
<p>The lesson of this example is that can be difficult to fully predict hardware performance. There is a lot that you <strong>can</strong> predict, but ultimately, it is very important to measure and verify your assumptions.</p>
<div style="background-color: #f6f6ff; margin-top: 20px; width: 600px; margin-bottom: 20px; margin-left: 10px; border: black 1px solid; padding: 4px;">
<p>Read more of my articles:</p>
<p><a href="http://igoro.com/archive/human-heart-is-a-turing-machine-research-on-xbox-360-shows-wait-what/">Human heart is a Turing machine, research on XBox 360 shows. Wait, what?</a></p>
<p><a href="http://igoro.com/archive/self-printing-game-of-life-in-c/">Self-printing Game of Life in C#</a></p>
<p><a href="http://igoro.com/archive/efficient-auto-complete-with-a-ternary-search-tree/">Efficient auto-complete with a ternary search tree</a></p>
<p><a href="http://igoro.com/archive/numbers-that-cannot-be-computed/">Numbers that cannot be computed</a></p>
<p>And if you like my blog, <a href="http://igoro.com/feed/">subscribe</a>!</p>
</div>
<h3>Conclusion</h3>
<p>Hopefully all of this helps you understand how caches work, and apply that knowledge when tuning your programs.</p>
]]></content:encoded>
			<wfw:commentRss>http://igoro.com/archive/gallery-of-processor-cache-effects/feed/</wfw:commentRss>
		<slash:comments>90</slash:comments>
		</item>
	</channel>
</rss>
