<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" media="screen" href="/~d/styles/rss2full.xsl"?><?xml-stylesheet type="text/css" media="screen" href="http://feeds.feedburner.com/~d/styles/itemcontent.css"?><!-- Generated on Sat, 11 Jul 2009 16:46:15 -0700 --><rss xmlns:atom="http://www.w3.org/2005/Atom" xmlns:feedburner="http://rssnamespace.org/feedburner/ext/1.0" version="2.0">
  <channel>
    
    <title>Intel Threading for Multi-Core Community</title>
    <link>http://software.intel.com/en-us/articles/multi-core/all</link>
    <description />
    <language>en-us</language>
    <atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="self" href="http://feeds.feedburner.com/ISNMulticore" type="application/rss+xml" /><feedburner:emailServiceId>ISNMulticore</feedburner:emailServiceId><feedburner:feedburnerHostname>http://feedburner.google.com</feedburner:feedburnerHostname><atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="hub" href="http://pubsubhubbub.appspot.com" /><item>
      <title>Optimizing Software Applications for NUMA</title>
      <description>&lt;h1 class="sectionHeading"&gt;Download Article&lt;/h1&gt;
&lt;br /&gt; Download &lt;a href="http://software.intel.com/file/21113"&gt;Optimizing Software Applications for NUMA&lt;/a&gt; [PDF 83KB]&lt;br /&gt;&lt;br /&gt;
&lt;h1 class="sectionHeading"&gt;Introduction&lt;/h1&gt;
&lt;br /&gt; In this brief technical paper, we provide an overview of the NUMA shared memory architecture and describe various techniques for optimizing application memory performance within a NUMA-based system.  In particular, we discuss the role of processor affinity, memory allocation using implicit operating system policies, and the use of the system API's for assigning and migrating memory pages using explicit directives.&lt;br /&gt;&lt;br /&gt;
&lt;h1 class="sectionHeading"&gt;The Basics of NUMA&lt;/h1&gt;
&lt;br /&gt; &lt;b&gt;NUMA&lt;/b&gt;, or &lt;b&gt;Non-Uniform Memory Access&lt;/b&gt;, is a shared memory architecture that describes the placement of main memory modules with respect to processors in a multiprocessor system.  Perhaps the best way to understand NUMA is to compare it with its cousin &lt;b&gt;UMA&lt;/b&gt;, or &lt;b&gt;Uniform Memory Access&lt;/b&gt;.&lt;br /&gt;&lt;br /&gt; In the UMA memory architecture, all processors access shared memory through a bus (or another type of interconnect) as seen in the following diagram:&lt;br /&gt;&lt;br /&gt; &lt;img src="http://software.intel.com/file/21114" /&gt;&lt;br /&gt;&lt;br /&gt; UMA gets its name from the fact that each processor must use the same shared bus to access memory, resulting in a memory access time that is uniform across all processors. Note that access time is also independent of data location within memory.  That is, access time remains the same regardless of which shared memory module contains the data to be retrieved.&lt;br /&gt;&lt;br /&gt; In the NUMA shared memory architecture, each processor has its own &lt;i&gt;local &lt;/i&gt;memory module that it can access directly and with a distinctive performance advantage.  At the same time, it can also access any memory module belonging to another processor using a shared bus (or some other type of interconnect) as seen in the diagram below:&lt;br /&gt;&lt;br /&gt; &lt;img src="http://software.intel.com/file/21115" /&gt;&lt;br /&gt;&lt;br /&gt; What gives NUMA its name is that memory access time varies with the location of the data to be accessed.  If data resides in local memory, access is fast.  If data resides in remote memory, access is slower.  The advantage of the NUMA architecture as a hierarchical shared memory scheme is its potential to improve &lt;i&gt;average case access&lt;/i&gt; time through the introduction of fast, local memory.&lt;br /&gt;&lt;br /&gt; Modern multiprocessor systems mix these basic architectures as seen in the following diagram:&lt;br /&gt;&lt;br /&gt; &lt;img src="http://software.intel.com/file/21116" /&gt;&lt;br /&gt;&lt;br /&gt; In this complex hierarchical scheme, processors are grouped by their physical location on one or the other multi-core CPU package or "node".  Processors within a node share access to memory modules as per the UMA shared memory architecture.  At the same time, they may also access memory from the remote node using a shared interconnect, but with slower performance as per the NUMA shared memory architecture.&lt;br /&gt;&lt;br /&gt; Server platforms like Intel® Xeon® using the Intel® Core i7 processors provide an example of this complex memory architecture, and for this reason our discussion will center on it henceforth.  Note that such platforms employ a fast interconnect technology known as Intel® QuickPath Interconnect (QPI) to mitigate (but not eliminate) the problem of slower remote memory performance.&lt;br /&gt;&lt;br /&gt;
&lt;h1 class="sectionHeading"&gt;NUMA Advantages and Risks&lt;/h1&gt;
&lt;br /&gt; The advantage of the NUMA shared memory architecture is its &lt;i&gt;potential &lt;/i&gt;to reduce memory access time in the average case.   By providing each node with its own local memory, memory accesses can take place in parallel and avoid throughput limitations and contention issues associated with a shared memory bus.  In fact, memory constrained systems can theoretically improve their performance by up to the number of nodes on the system.  For example, a memory-constrained dual processor system could conceivably double its performance if processors could access memory in a fully parallelized manner.&lt;br /&gt;&lt;br /&gt; The downside of the NUMA architecture, however, is the cost associated when data is not local to the processor.  In the NUMA model, the time required to retrieve data from an adjacent node within the NUMA model will be significantly higher than that required to access local memory.  Furthermore, the time required to retrieve data from a non-adjacent node may be even higher, complicating memory performance and generating a hierarchy of access time possibilities.  In general, as the distance from a processor increases, the cost of accessing memory increases.&lt;sup&gt;2&lt;/sup&gt;&lt;br /&gt;&lt;br /&gt; The key issue in determining whether the performance benefits of the NUMA architecture can be realized, then, is &lt;b&gt;data placement&lt;/b&gt;.  The more data can effectively be placed in memory local to the processor that needs it, the move overall access time will benefit from the architecture.  Conversely, the more data fails to be local to the node that will access it, the more memory performance will suffer from the architecture.  For this reason, the NUMA architecture can be said to provide the potential to reduce overall memory access times.  To realize this &lt;i&gt;potential&lt;/i&gt;, strategies are needed to ensure smart data placement.  An application that effectively manages such placement is one that has been "optimized for NUMA", is "NUMA-aware", or is "NUMA-friendly".&lt;br /&gt;&lt;br /&gt;
&lt;h1 class="sectionHeading"&gt;Strategies for NUMA Optimization&lt;/h1&gt;
&lt;br /&gt; Two key notions in managing performance within the NUMA shared memory architecture are &lt;i&gt;processor affinity&lt;/i&gt; and &lt;i&gt;data placement.&lt;/i&gt;&lt;br /&gt;&lt;br /&gt; &lt;b&gt;Processor Affinity&lt;/b&gt;&lt;br /&gt;&lt;br /&gt; Affinity refers to the persistence of association with a particular resource instance, despite the availability of another instance for the same purpose.  Consider the case of processor affinity.  Today's complex operating systems assign application threads to processor cores using a scheduler.  A scheduler will take into account system state and various policy objectives (e.g., "balance load across cores" or "aggregate threads on a few cores and put remaining cores to sleep") and match application threads to physical cores accordingly. A given thread will execute on its assigned core for some period of time and then wait as other threads are given the chance to execute.  If another core becomes available, the scheduler may choose to migrate the thread to insure timely execution and meet its policy objectives.&lt;br /&gt;&lt;br /&gt; Thread migration from one core to another poses a problem for the NUMA shared memory architecture because of the way it disassociates a thread from its local memory allocations.  That is, a thread may allocate memory on node 1 at startup as it runs on a core within the node 1 package.  But when the thread is later migrated to a core on node 2, the data stored earlier becomes remote and memory access time significantly increases.&lt;br /&gt;&lt;br /&gt; Enter processor affinity.  Using a system API, or by modifying an OS data structure (e.g., affinity mask), a specific core or set of cores can be associated with an application thread.  The scheduler will then observe this affinity in its scheduling decisions for the lifetime of the thread.  For example, a thread may be configured to run only on cores 0 through 3, all of which belong to quad core CPU package 0.  Henceforth, the scheduler will choose among these alternatives without migrating the thread to another package.&lt;br /&gt;&lt;br /&gt; Exercising processor affinity insures that memory allocations remain local to the thread(s) that need them.  Several downsides, however, should be noted.  In general, processor affinity may significantly harm system performance by restricting scheduler options and creating resource contention when better resources management could have otherwise been used.  For example, affinity restrictions may prevent the scheduler from assigning waiting threads to unutilized cores during a particular interval.  Or, low priority threads may adversely impact high priority threads due to affinity restrictions that prevent adjustments through the use of additional cores.  Processor affinity restrictions may even hurt the application itself when additional execution time on another node would have more than compensated for a slower memory access time.&lt;br /&gt;&lt;br /&gt; Such downsides imply the need to think carefully about whether processor affinity solutions are right for a particular application and shared system context.  Note, finally, that processor affinity APIs offered by some systems support priority "hints" and affinity "suggestions" to the scheduler in addition to explicit directives.  Such suggestions may insure optimal performance in the common case yet avoid constraining scheduling options during periods of high resource contention.&lt;br /&gt;&lt;br /&gt; &lt;b&gt;Data Placement Using Implicit Memory Allocation Policies&lt;/b&gt;&lt;br /&gt;&lt;br /&gt; In the simple case, many operating systems transparently provide support for NUMA-friendly data placement.  When a single-threaded application allocates memory, the processor will simply assign memory pages to the physical memory associated with the requesting thread's node (CPU package), thus insuring that it is local to the thread and access performance is optimal.&lt;br /&gt;&lt;br /&gt; Alternatively, some operating systems will wait for the first memory access before committing on memory page assignment.2  To understand the advantage here, consider a multi-threaded application with a start-up sequence that includes memory allocations by a main control thread, followed by the creation of various worker threads, followed by a long period of application processing or service.  While it may seem reasonable to place memory pages local to the requesting thread, in fact, they are more effectively placed local to the worker threads that will access the data.  As such, the operating system will observe the first access request and commit page assignments based on the requester's node location.&lt;br /&gt;&lt;br /&gt; These two policies together illustrate the importance of an application programmer being aware of the NUMA context of the program's deployment.  If the page placement policy is based on first access, the programmer can exploit this fact by including a carefully designed data access sequence at startup that will generate "hints" to the operating system on optimal memory placement.  If the page placement policy is based on requester location, the programmer should insure that memory allocations are made by the thread that will subsequently access the data and not by an initialization or control thread designed to act as a provisioning agent.&lt;br /&gt;&lt;br /&gt; Multiple threads accessing the same data are best co-located on the same node so that the memory allocations of one, placed local to the node, can benefit all.  This may, for example, be used by prefetching schemes designed to improve application performance by generating data requests in advance of actual need.  Such threads must generate data placement that is local to the actual consumer threads for the NUMA architecture to provide its characteristic performance speedup.&lt;br /&gt;&lt;br /&gt; It should be noted that when an operating system has fully consumed the physical memory resources of one node, memory requests coming from threads on the same node will typically be fulfilled by sub-optimal allocations made on a remote node.  The implication for memory-hungry applications is to correctly size the memory needs of a particular thread and to insure local placement with respect to the accessing thread.&lt;br /&gt;&lt;br /&gt; For situations where a large number of threads will randomly share the same pool of data from all nodes, the recommendation is to stripe the data evenly across all nodes.  Doing so spreads the memory access load and avoids bottleneck access patterns on a single node within the system. &lt;sup&gt;3&lt;/sup&gt;&lt;br /&gt;&lt;br /&gt; &lt;b&gt;Data Placement Using Explicit Memory Allocation Directives&lt;/b&gt;&lt;br /&gt;&lt;br /&gt; Another approach to data placement in NUMA-based systems is to make use of system APIs that explicitly configure the location of memory page allocations.  An example of such APIs is the libnuma library for Linux.&lt;sup&gt;1&lt;/sup&gt;&lt;br /&gt;&lt;br /&gt; Using the API, a programmer may be able to associate virtual memory address ranges with particular nodes, or simply to indicate the desired node within the memory allocation system call itself.  With this capability, an application programmer can insure the placement of a particular data set regardless of which thread allocates it or which thread accesses it first.  This may be useful, for example, in schemes where complex applications make use of a memory management thread acting on behalf of worker threads.  Or, it may prove useful for applications that create many short-lived threads, each of which have predictable data requirements.  Pre-fetching schemes are another area that could benefit considerably from such control.&lt;br /&gt;&lt;br /&gt; The downside of this scheme, of course, is the management burden placed on the application in handling memory allocations and data placement.  Misplaced data may cause performance that is significantly worse than default system behavior.  Explicit memory management also presupposes fine-grained control over processor affinity throughout application use.&lt;br /&gt;&lt;br /&gt; Another capability available to the application programmer through NUMA-based memory management APIs is memory page migration.  In general, migration of memory pages from one node to another is an expensive operation and something to be avoided.  Not only is there the cost of migrating the data, but all associated memory references must be discovered and modified to observe the new mapping.  As the remapping is taking place, pages must be removed from operating system page lists and detached from normal swapping mechanisms.   &lt;br /&gt;&lt;br /&gt; Having said this, given an application that is both long-lived and memory intensive, migrating memory pages to re-establish a NUMA-friendly configuration may be worth the price.3  Consider, for example, a long lived application with various threads that have terminated and new threads that have been created but reside on another node.  Data is now no longer local to the threads that need it and sub-optimal access requests now dominate.  Application-specific knowledge of a thread's lifetime and data needs can be used to determine whether an explicit migration is in order.&lt;br /&gt;&lt;br /&gt; Finally, the API may provide functions for obtaining page residency or for examining memory access behavior under the current configuration.  Such tools may provide the means to implement a monitoring scheme that makes explicit migration adjustments when memory accesses within the NUMA context fall below a defined threshold.&lt;br /&gt;&lt;br /&gt;
&lt;h1 class="sectionHeading"&gt;Summary&lt;/h1&gt;
&lt;br /&gt; &lt;b&gt;NUMA&lt;/b&gt;, or &lt;b&gt;Non-Uniform Memory Access&lt;/b&gt;, is a shared memory architecture that describes the placement of main memory modules with respect to processors in a multiprocessor system.  The advantage of the NUMA architecture as a &lt;i&gt;hierarchical &lt;/i&gt;shared memory scheme is its potential to improve &lt;i&gt;average &lt;/i&gt;case access &lt;i&gt;time &lt;/i&gt;through the introduction of fast, local memory.  To realize the potential of NUMA systems, however, careful &lt;i&gt;data placement&lt;/i&gt; is needed. The more data can effectively be placed in memory local to the processor that needs it, the more overall access time will benefit from the architecture.&lt;br /&gt;&lt;br /&gt; In this brief technical paper, we have described various strategies and considerations for ensuring optimal data placement within a NUMA-based system.  In particular, we have discussed the role of processor affinity, memory allocation strategies that use implicit operating system page placement policies, and the use of the system API's for assigning and migrating memory pages using explicit directives.&lt;br /&gt;&lt;br /&gt;
&lt;h1 class="sectionHeading"&gt;References&lt;/h1&gt;
&lt;br /&gt; &lt;ol&gt;
&lt;li&gt;Drepper, Ulrich.  "What Every Programmer Should Know About Memory".  November 2007.&lt;/li&gt;
&lt;li&gt;Intel® 64 and IA-32 Architectures Optimization Reference Manual.  See Section 8.8 on "Affinities and Managing Shared Platform Resources".  March 2009.&lt;/li&gt;
&lt;li&gt;Lameter, Christoph.  "Local and Remote Memory: Memory in a Linux/NUMA System".  June 2006.&lt;/li&gt;
&lt;/ol&gt;
&lt;h1 class="sectionHeading"&gt;Author Bio&lt;/h1&gt;
&lt;br /&gt; David E. Ott is a Senior Software Engineer with Intel's Software Solutions Group.  He joined Intel in 2005 as a middleware systems engineer for the Technology and Manufacturing Group.  Currently, David focuses on power and virtualization aspects of enterprise server platforms.  David holds M.S. and Ph.D. degrees in Computer Science from the University of North Carolina at Chapel Hill.&lt;img src="http://feeds.feedburner.com/~r/ISNMulticore/~4/RI_PzOwDNjw" height="1" width="1"/&gt;</description>
      <link>http://feedproxy.google.com/~r/ISNMulticore/~3/RI_PzOwDNjw/optimizing-software-applications-for-numa</link>
      <pubDate>Thu, 09 Jul 2009 14:28:18 -0700</pubDate>
      <comments>http://software.intel.com/en-us/articles/optimizing-software-applications-for-numa#comments</comments>
      <guid isPermaLink="false">http://software.intel.com/en-us/articles/optimizing-software-applications-for-numa</guid>
      <category>Parallel Programming and Multi-Core</category>
    <feedburner:origLink>http://software.intel.com/en-us/articles/optimizing-software-applications-for-numa</feedburner:origLink></item>
    <item>
      <title>Intel® Advanced Vector Extensions: Pixel Format Conversions</title>
      <description>&lt;h1 class="sectionHeading"&gt;Download Article&lt;/h1&gt;
&lt;br /&gt; Download &lt;a href="http://software.intel.com/file/21089"&gt;Intel® Advanced Vector Extensions: Pixel Format Conversions&lt;/a&gt; [PDF 1.7MB]&lt;br /&gt;&lt;br /&gt;
&lt;h1 class="sectionHeading"&gt;Introduction&lt;/h1&gt;
&lt;br /&gt; Intel® Advanced Vector Extensions (Intel® AVX) is a 256 bit instruction set extension to Intel® Streaming SIMD Extensions (Intel® SSE) and is designed for applications that are floating point intensive. Intel® AVX extends all the 16 XMM registers to 256-bits (YMM registers), thus essentially doubling the width of existing XMM registers which leads to improved performance and power efficiency over 128-bit SIMD instructions. Intel® AVX introduces distinct destination argument that results in fewer register copies, better register use, smaller code size, and other benefits. Intel® AVX also introduces several new instructions for blending and rearranging data in the YMM registers.&lt;br /&gt;&lt;br /&gt; This document describes techniques to optimize pixel format conversion routines (commonly used in image processing applications) using the new Intel® AVX extensions. The two conversions demonstrated here are RGB to RGBA and RGBA to RGB. Though, R, G, B, and A components can be different data type in different applications, we only discuss single precision floating point (SP FP) components. The Intel® AVX performance is compared against the scalar version of the conversion routines on the same simulator. The Intel® AVX versions are implemented in compiler intrinsics and the code was compiled using the Intel® C Compiler that supports Intel® AVX intrinsics.&lt;br /&gt;&lt;br /&gt; This paper will describe only the Intel® AVX implementation of the format conversions.&lt;br /&gt;&lt;br /&gt; The RGB-to-RGBA and RGBA-to-RGB conversion algorithms make use of the Intel® AVX instructions VPERMILPS, VPERM2F128, and VBLENDPS to rearrange, and mask off data when copying from the source to the destination buffers.&lt;br /&gt;&lt;br /&gt;
&lt;h1 class="sectionHeading"&gt;RGB to RGBA&lt;/h1&gt;
&lt;br /&gt; The destination and source pixel buffers are aligned to 32-byte boundaries and the conversion routines expect them to be so. The following figure depicts the arrangement of the source and destination buffers in memory, for n pixels. In this figure R0 is at a lower address than G0, and so on. In order to use aligned load and store in Intel® AVX implementation for better performance, destination and source pixel buffers should be aligned on 32-byte boundary in the memory. The Intel® AVX conversion routines make assumption that both destination and source are aligned on a 32-byte boundary.&lt;br /&gt;&lt;br /&gt;
&lt;p style="text-align: center;"&gt;&lt;img src="http://software.intel.com/file/21080" /&gt;&lt;/p&gt;
&lt;br /&gt;&lt;br /&gt; &lt;b&gt;Figure 1:&lt;/b&gt; Arrangement of source and destination pixels in memory&lt;br /&gt;&lt;br /&gt; Each YMM register is 256-bit wide which allows us to load and store eight SPFP values at a time. In each iteration of the loop we load multiple source values, rearrange the data, and insert the alpha value (in this example, 1.0) and store the result to the destination address.&lt;br /&gt;&lt;br /&gt; Since the conversion is from a 3-channel pixel to 4-channel pixel, we could have loaded twelve SP FP values from the source (four RGB pixels) and written sixteen SP FP (four RGBA pixels) values per iteration. Doing so will force us to use unaligned loads since in the next iteration we have to load pixels from an offset of twelve from the source address. There will be severe performance penalties when the unaligned accesses cross cache-line boundaries. Hence we will try to avoid unaligned loads altogether by unrolling the loop twice to load eight RGB pixels.&lt;br /&gt;&lt;br /&gt; The algorithm is implemented in four steps, computing two destination pixels at each step. We first load eight single precision FP values starting from the source address using the _mm256_load_ps() aligned load intrinsic. The values are then shuffled to a temporary YMM register using _mm256_permutevar_ps() intrinsic with a control mask of {0,1,2,0,0,0,1,0} so that the R0, G0, B0, G1, and B1 are copied to their corresponding locations in the destination. Next R1 is broadcast using _mm256_broadcast_ss() to a temporary YMM register and the result is blended using a mask of 16 (00 01 00 00) with the output from the shuffle operation. Finally, the alpha value (1.0) is blended with the result from previous blend operation using a mask of 136 (10 00 10 00) to produce destination pixels zero and one. The result is written to the memory starting at the address of the destination using _mm256_store_ps(). The following figure illustrates this step &lt;i&gt;(Step1)&lt;/i&gt;.&lt;br /&gt;&lt;br /&gt;
&lt;p style="text-align: center;"&gt;&lt;img src="http://software.intel.com/file/21081" /&gt;&lt;/p&gt;
&lt;br /&gt;&lt;br /&gt; &lt;b&gt;Figure 2:&lt;/b&gt; RGB to RGBA &lt;i&gt;Step1&lt;/i&gt;&lt;br /&gt;&lt;br /&gt; The next eight FP values are loaded and shuffled with the eight values previously loaded using the intrinsic _mm256_permute2f128_ps() with a control mask of 33 (00 10 00 01) to produce an intermediate result. This intermediate result is shuffled using _mm256_permutevar_ps() intrinsic with a control mask of {2,3,0,0,1,2,3,0}, blended with B2 and the alpha value to get the destination pixels two and three. These steps are illustrated below &lt;i&gt;(Step2)&lt;/i&gt;.&lt;br /&gt;&lt;br /&gt;
&lt;p style="text-align: center;"&gt;&lt;img src="http://software.intel.com/file/21082" /&gt;&lt;/p&gt;
&lt;br /&gt;&lt;br /&gt; &lt;b&gt;Figure 3:&lt;/b&gt; RGB to RGBA &lt;i&gt;Step2&lt;/i&gt;&lt;br /&gt;&lt;br /&gt; The next eight FP values are loaded from an offset of sixteen from the start of the source address and shuffled with the eight FP values loaded in Step2 using an appropriate control mask. These resulting values are in turn shuffled again and blended with R5 and the alpha values, producing destination pixels four and five as illustrated below &lt;i&gt;(Step3)&lt;/i&gt;.&lt;br /&gt;&lt;br /&gt;
&lt;p style="text-align: center;"&gt;&lt;img src="http://software.intel.com/file/21083" /&gt;&lt;/p&gt;
&lt;br /&gt;&lt;br /&gt; &lt;b&gt;Figure 4:&lt;/b&gt; RGB to RGBA &lt;i&gt;Step3&lt;/i&gt;&lt;br /&gt;&lt;br /&gt; The final set of eight FP values is loaded from an offset of twenty four from the source address. These values are shuffled, blended with B6 and the alpha to produce destination pixels six and seven. These steps are illustrated below &lt;i&gt;(Step4)&lt;/i&gt;.
&lt;p style="text-align: center;"&gt;&lt;img src="http://software.intel.com/file/21084" /&gt;&lt;/p&gt;
&lt;br /&gt;&lt;br /&gt; &lt;b&gt;Figure 5:&lt;/b&gt; RGB to RGBA &lt;i&gt;Step4&lt;/i&gt;&lt;br /&gt;&lt;br /&gt; The source and destination addresses are incremented by twenty four and thirty two respectively. Steps &lt;i&gt;Step1, Step2, Step3,&lt;/i&gt; and &lt;i&gt;Step4&lt;/i&gt; are repeated for the remainder of pixels.&lt;br /&gt;&lt;br /&gt; The figure below shows the source code that demonstrates the above steps.&lt;br /&gt;&lt;br /&gt;
&lt;pre name="code" class="cpp"&gt;// 8 RGB ==&amp;gt; RBGA per iteration&lt;br /&gt;&lt;br /&gt;// [G2 R2 B1 G1 , R1 B0 G0 R0]&lt;br /&gt;__m256 pixel23 = _mm256_load_ps((float *)(srcPix));&lt;br /&gt;&lt;br /&gt;// [* B1 G1 *, * B0 G0 R0], ctrl = [0,1,0,0, 0,2,1,0]&lt;br /&gt;__m256 pixel01 = _mm256_permutevar_ps(pixel23, ctrl);		&lt;br /&gt;&lt;br /&gt;// [R1 R1 R1 R1 , R1 R1 R1 R1]&lt;br /&gt;__m256 pixelTemp = _mm256_broadcast_ss((float *)(srcPix+3));&lt;br /&gt;&lt;br /&gt;// [*  B1 G1 R1 , *  B0 G0 R0], mask = 00 01 00 00&lt;br /&gt;pixel01 = _mm256_blend_ps(pixel01, pixelTemp, 16);			&lt;br /&gt;		&lt;br /&gt;// [1. B1 G1 R1 , 1. B0 G0 R0], mask = 10 00 10 00&lt;br /&gt;pixel01 = _mm256_blend_ps(pixel01, alphaOne, 136);	&lt;br /&gt;_mm256_store_ps((float *)(dstPix), pixel01);  &lt;br /&gt;&lt;br /&gt;// [R5 B4 G4 R4 , B3 G3 R3 B2]&lt;br /&gt;__m256 pixel45  = _mm256_load_ps((float *)(srcPix+8));		&lt;br /&gt;&lt;br /&gt;// [B3 G3 R3 B2 , G2 R2 B1 G1]  mask = 00 10 00 01		&lt;br /&gt;pixel23 = _mm256_permute2f128_ps(pixel23, pixel45, 33);	&lt;br /&gt;	&lt;br /&gt;// [* B3 G3 R3, * * G2 R2], ctrl2 = [0,3,2,1, 0,0,3,2]&lt;br /&gt;pixel23 = _mm256_permutevar_ps(pixel23, ctrl2);				&lt;br /&gt;&lt;br /&gt;// [B2 B2 B2 B2 , B2 B2 B2 B2]&lt;br /&gt;pixelTemp = _mm256_broadcast_ss((float *)(srcPix+8));		&lt;br /&gt;&lt;br /&gt;// [*  B3 G3 R3 , *  B2 G2 R2], mask = 00 00 01 00&lt;br /&gt;pixel23 = _mm256_blend_ps(pixel23, pixelTemp, 4);	&lt;br /&gt;pixel23 = _mm256_blend_ps(pixel23, alphaOne, 136);			&lt;br /&gt;_mm256_store_ps((float *)(dstPix+8), pixel23);  &lt;br /&gt;&lt;br /&gt;// [B7 G7 R7 B6, G6 R6 B5 G5]&lt;br /&gt;__m256 pixel67  = _mm256_load_ps((float *)(srcPix+16));		&lt;br /&gt;&lt;br /&gt;// [G6 R6 B5 G5, R5 B4 G4 R4]  mask = 00 10 00 01&lt;br /&gt;pixel45 = _mm256_permute2f128_ps(pixel45, pixel67, 33);		&lt;br /&gt;// [*  B5 G5 *, * B4 G4 R4]&lt;br /&gt;pixel45 = _mm256_permutevar_ps(pixel45, ctrl);		&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;// [R5 R5 R5 R5 , R5 R5 R5 R5]		&lt;br /&gt;pixelTemp = _mm256_broadcast_ss((float *)(srcPix+15)); &lt;br /&gt;&lt;br /&gt;// [* G6 R6 R6, * B4 G4 R4]		&lt;br /&gt;pixel45 = _mm256_blend_ps(pixel45, pixelTemp, 16);	&lt;br /&gt;pixel45 = _mm256_blend_ps(pixel45, alphaOne, 136);			&lt;br /&gt;_mm256_store_ps((float *)(dstPix+16), pixel45);  &lt;br /&gt;&lt;br /&gt;// [* B7 G7 R7, * * G6 R6]&lt;br /&gt;pixel67 = _mm256_permutevar_ps(pixel67, ctrl2);						&lt;br /&gt;// [B6 B6 B6 B6 , B6 B6 B6 B6]&lt;br /&gt;pixelTemp = _mm256_broadcast_ss((float *)(srcPix+20));		&lt;br /&gt;&lt;br /&gt;// [* B7 G7 R7, * B6 G6 R6]		&lt;br /&gt;pixel67 = _mm256_blend_ps(pixel67, pixelTemp, 4);			&lt;br /&gt;pixel67 = _mm256_blend_ps(pixel67, alphaOne, 136);&lt;br /&gt;_mm256_store_ps((float *)(dstPix+24), pixel67); &lt;br /&gt;&lt;/pre&gt;
&lt;br /&gt; &lt;br /&gt; &lt;b&gt;Figure 6:&lt;/b&gt; Intel® AVX RGB to RGBA conversion code&lt;br /&gt;&lt;br /&gt;
&lt;h1 class="sectionHeading"&gt;RGBA to RGB&lt;/h1&gt;
&lt;br /&gt; The destination and source pixel buffers are aligned to 32-byte boundaries and the conversion routines expect them to be so. The following figure depicts the arrangement of the source and destination buffers in memory, for &lt;b&gt;n&lt;/b&gt; pixels. In this figure R0 is at a lower address than G0, etc.&lt;br /&gt;&lt;br /&gt;
&lt;p style="text-align: center;"&gt;&lt;img src="http://software.intel.com/file/21085" /&gt;&lt;/p&gt;
&lt;br /&gt;&lt;br /&gt; &lt;b&gt;Figure 7:&lt;/b&gt; Arrangement of source and destination pixels in memory&lt;br /&gt;&lt;br /&gt; In each iteration of the loop we load multiple source pixels, rearrange the data, and remove the alpha value  and store the result to the destination address.&lt;br /&gt;&lt;br /&gt; Since the conversion is from a 4-channel pixel to 3-channel pixel, we need to load sixteen SP FP values from the source (four RGBA pixels) and write twelve values (four RGB pixels) per iteration. Doing so will force us to use unaligned stores since in the next iteration we have to write the result at an offset of twelve from the destination address. As explained before we will avoid all unaligned accesses by unrolling the loop twice thus writing twenty four values (six RGB pixels) at a time.&lt;br /&gt;&lt;br /&gt; We first load sixteen SP FP values starting from the source address by invoking the _mm256_load_ps() aligned load intrinsic twice. The pixels are then rearranged using a combination of _mm256_permutevar_ps() and _mm256_permute2f128_ps() instrinsics and the intermediate results blended using an appropriate mask to produce the first set of destination FP values. The following figure illustrates this step &lt;i&gt;(Step1)&lt;/i&gt;.&lt;br /&gt;&lt;br /&gt;
&lt;p style="text-align: center;"&gt;&lt;img src="http://software.intel.com/file/21086" /&gt;&lt;/p&gt;
&lt;br /&gt;&lt;br /&gt; &lt;b&gt;Figure 8:&lt;/b&gt; RGBA to RGB &lt;i&gt;Step1&lt;/i&gt;&lt;br /&gt;&lt;br /&gt; The next set of eight FP values are loaded and using a series of _mm256_permute2f128_ps(), _mm256_permutevar_ps(), _mm256_blend_ps() and _mm256_broadcast_ss() intrinsics and blending with previously loaded values the next set of eight destination values are produced, as illustrated below &lt;i&gt;(Step2)&lt;/i&gt;.&lt;br /&gt;&lt;br /&gt;
&lt;p style="text-align: center;"&gt;&lt;img src="http://software.intel.com/file/21087" /&gt;&lt;/p&gt;
&lt;br /&gt;&lt;br /&gt; &lt;b&gt;Figure 9:&lt;/b&gt; RGBA to RGB &lt;i&gt;Step2&lt;/i&gt;&lt;br /&gt;&lt;br /&gt; In the third step &lt;i&gt;(Step3)&lt;/i&gt;, source RGBA pixels six and seven are loaded from an offset of twenty four from the source address and shuffled and blended with the previously loaded pixels four and five using a series of _mm256_permute2f128_ps(), _mm256_permutevar_ps(), and _mm256_blend_ps() intrinsics to produce the last set of destination values for the current iteration. The following figure depicts this step.&lt;br /&gt;&lt;br /&gt;
&lt;p style="text-align: center;"&gt;&lt;img src="http://software.intel.com/file/21088" /&gt;&lt;/p&gt;
&lt;br /&gt;&lt;br /&gt; &lt;b&gt;Figure 10:&lt;/b&gt; RGBA to RGB &lt;i&gt;Step3&lt;/i&gt;&lt;br /&gt;&lt;br /&gt; The source and destination addresses are incremented by thirty two and twenty four respectively. &lt;i&gt;Steps Step1, Step2,&lt;/i&gt; and &lt;i&gt;Step3&lt;/i&gt; are repeated for the remainder of pixels.&lt;br /&gt;&lt;br /&gt; The figure below shows the source code that demonstrates the above steps.&lt;br /&gt;&lt;br /&gt;
&lt;pre name="code" class="cpp"&gt;// 8 RGBA ==&amp;gt; 8 RGB conversion per iteration&lt;br /&gt;&lt;br /&gt;// [A1 B1 G1 R1 , A0 B0 G0 R0]		&lt;br /&gt;__m256 pixel01 = _mm256_load_ps((float *)(srcPix));				&lt;br /&gt;&lt;br /&gt;// [*  *  B1 G1 , *  B0 G0 R0] &lt;br /&gt;__m256 pixelTmp = _mm256_permutevar_ps(pixel01, ctrl1);			&lt;br /&gt;&lt;br /&gt;// [A3 B3 G3 R3 , A2 B2 G2 R2]&lt;br /&gt;__m256 pixel23 = _mm256_load_ps((float *)(srcPix)+8); &lt;br /&gt;&lt;br /&gt;// [A2 B2 G2 R2 , A1 B1 G1 R1], 0x21 = 00 10 00 01&lt;br /&gt;__m256 pixel12 = _mm256_permute2f128_ps(pixel01, pixel23, 0x21); &lt;br /&gt;&lt;br /&gt;// [G2 R2 *  *  , R1 *  *  * ]&lt;br /&gt;pixel12 = _mm256_permutevar_ps(pixel12, ctrl2);					&lt;br /&gt;&lt;br /&gt;// [G2 R2 B1 G1 , R1 B0 G0 R0], 0xC8 = 11 00 10 00&lt;br /&gt;pixel01 = _mm256_blend_ps(pixelTmp, pixel12, 0xC8);&lt;br /&gt;_mm256_store_ps((float *)(dstPix), pixel01); &lt;br /&gt;&lt;br /&gt;// [B2 B2 B2 B2 , B2 B2 B2 B2]&lt;br /&gt;pixelTmp = _mm256_broadcast_ss((float *)(srcPix)+10);		&lt;br /&gt;&lt;br /&gt;// [A5 B5 G5 R5 , A4 B4 G4 R4]&lt;br /&gt;__m256 pixel45 = _mm256_load_ps((float *)(srcPix)+16);		&lt;br /&gt;&lt;br /&gt;// [A4 B4 G4 R4 , A3 B3 G3 R3]&lt;br /&gt;__m256 pixel34 = _mm256_permute2f128_ps(pixel23, pixel45, 0x21); &lt;br /&gt;&lt;br /&gt;// [*  B4 G4 R4 , B3 G3 R3 * ]&lt;br /&gt;pixel23  = _mm256_permutevar_ps(pixel34, ctrl3);				&lt;br /&gt;&lt;br /&gt;// [*  B4 G4 R4 , B3 G3 R3 B2],  0x1 = 00 00 00 01&lt;br /&gt;pixel23  = _mm256_blend_ps(pixel23, pixelTmp, 0x1);			&lt;br /&gt;&lt;br /&gt;// [R5 R5 R5 R5 , R5 R5 R5 R5]&lt;br /&gt;pixelTmp = _mm256_broadcast_ss((float *)(srcPix)+20);			&lt;br /&gt;&lt;br /&gt;// [R5 B4 G4 R4 , B3 G3 R3 B2], 0x80 = 10 00 00 00&lt;br /&gt;pixel23  = _mm256_blend_ps(pixel23, pixelTmp, 0x80);&lt;br /&gt;_mm256_store_ps((float *)(dstPix)+8, pixel23); &lt;br /&gt;&lt;br /&gt;// [A7 B7 G7 R7 , A6 B6 G6 R6]&lt;br /&gt;__m256 pixel67 = _mm256_load_ps((float *)(srcPix)+24);					&lt;br /&gt;&lt;br /&gt;// [A6 B6 G6 R6 , A5 B5 G5 R5]&lt;br /&gt;__m256 pixel56 = _mm256_permute2f128_ps(pixel45, pixel67, 0x21); &lt;br /&gt;&lt;br /&gt;// [*  *  *  B6 , *  *  B5 G5]&lt;br /&gt;pixel56 = _mm256_permutevar_ps(pixel56, ctrl4);					&lt;br /&gt;&lt;br /&gt;// [B7 G7 R7 *  , G6 R6 *  * ]&lt;br /&gt;pixel67 = _mm256_permutevar_ps(pixel67, ctrl5);					&lt;br /&gt;&lt;br /&gt;// [B7 G7 R7 B6 , G6 R6 B5 G5], 0xEC = 11 10 11 00&lt;br /&gt;pixel56 = _mm256_blend_ps(pixel56, pixel67, 0xEC);			&lt;br /&gt;_mm256_store_ps((float *)(dstPix)+16, pixel56);&lt;br /&gt;&lt;/pre&gt;
&lt;br /&gt; &lt;br /&gt; &lt;b&gt;Figure 11:&lt;/b&gt; Intel® AVX RGBA to RGB conversion code&lt;br /&gt;&lt;br /&gt;
&lt;h1 class="sectionHeading"&gt;Results&lt;/h1&gt;
&lt;br /&gt; Two implementations of the conversions - a scalar C++ implementation, and the 256-bit Intel® AVX implementation - were compared for performance on the Intel® AVX simulator. An average of three runs for each implementation is computed and compared for runtime performance. The following table shows the speedup achieved by the 256-bit version.&lt;br /&gt;&lt;br /&gt; 
&lt;table class="tableFormat1" border="0" cellpadding="0" cellspacing="0"&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Conversion&lt;/td&gt;
&lt;td&gt;Speedup vs scalar&lt;/td&gt;
&lt;td&gt;Num. pixels&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;RGB to RGBA&lt;/td&gt;
&lt;td&gt;2.73X&lt;/td&gt;
&lt;td&gt;512&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;RGBA to RGB&lt;/td&gt;
&lt;td&gt;2.14X&lt;/td&gt;
&lt;td&gt;512&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;br /&gt;
&lt;h1 class="sectionHeading"&gt;References and Resources&lt;/h1&gt;
&lt;br /&gt; 
&lt;ul&gt;
&lt;li&gt;&lt;a href="http://software.intel.com/en-us/avx/"&gt;http://software.intel.com/en-us/avx/&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;&lt;img src="http://feeds.feedburner.com/~r/ISNMulticore/~4/v6YU4hgk1n4" height="1" width="1"/&gt;</description>
      <link>http://feedproxy.google.com/~r/ISNMulticore/~3/v6YU4hgk1n4/intel-advanced-vector-extensions-pixel-format-conversions</link>
      <pubDate>Thu, 09 Jul 2009 08:29:18 -0700</pubDate>
      <comments>http://software.intel.com/en-us/articles/intel-advanced-vector-extensions-pixel-format-conversions#comments</comments>
      <guid isPermaLink="false">http://software.intel.com/en-us/articles/intel-advanced-vector-extensions-pixel-format-conversions</guid>
      <category>Parallel Programming and Multi-Core</category>
      <category>Visual Computing</category>
      <category>Intel® AVX</category>
    <feedburner:origLink>http://software.intel.com/en-us/articles/intel-advanced-vector-extensions-pixel-format-conversions</feedburner:origLink></item>
    <item>
      <title>Debugging Threaded Applications</title>
      <description>&lt;b&gt;Parallel programming has a reputation for being difficult due to the complexity imposed by the threading dimension. While this view is accurate in part, diligence in designing thread interactions and knowledge of what to look for when debugging will overcome many of the difficulties.&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;By Andrew Binstock&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;The goal of parallel programming is to accelerate program performance by running multiple threads of execution in parallel. By definition, this effort requires every thread to have an impact on at least one other thread (generally, at minimum, the thread that launched it). As long as these interactions are orderly, well-designed, and predictable, improved performance ensues. However, if the threads interfere with each other or distort the algorithm incorrectly, defects specific to parallel programming will creep into the program.&lt;br /&gt;&lt;br /&gt;These defects can be difficult to locate and equally hard to resolve once identified. As a result, in parallel programming, it pays to program defensively. That is, know ahead of time where the traps are and make sure you write your code to guard against them. Even then, an unexpected interaction between two threads will cause a bug to show up now and again. In this article, I examine the primary places where threading bugs lurk and explain how to avoid them. I also look at Intel® Thread Checker, which is one of the few threading tools on the market that can automate the discovery of potential trouble spots. I assume that you're already familiar with threading basics and that you have used some form of mutual exclusion (a mutex, semaphore, or critical section) at some point in the recent past. My comments apply equally to threads on all platforms-Windows*, Linux*, UNIX*, Java*, etc.&lt;br /&gt;&lt;br /&gt;The places where threading bugs most occur are where two threads interact or two threads share a common data variable. The most common violations can be grouped into three categories: data races, deadlocks, and a rarely discussed topic, live locks. Scrupulously avoid them and your life as a parallel programmer will be much smoother.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;
&lt;h1 class="sectionHeading"&gt;Data Races&lt;/h1&gt;
Data races occur when two threads share a variable but don't guard against simultaneous access and at least one thread is modifying the variable. Suppose for example that you've made reservations online for a flight to your first gaming convention. Once the reservation is paid for, you go log in online to choose your seats. At the same time, someone else on your flight is doing the same thing. You see that there is one aisle seat left and you click on it to reserve it for yourself. The other fellow sees the same seat and clicks on it at the same time you do. Due to poor program design, you both see your screens flash with the information that you now have the seat. So, who really gets the seat?&lt;br /&gt;&lt;br /&gt;This situation is a data race-two threads unaware of each other are both trying to update the same variable at the same time.&lt;br /&gt;&lt;br /&gt;Data races should be suspected anytime you get inconsistent results when repeatedly running the same program against the same data. One solution to data race is mutual exclusion (mutex). Using a mutex, for example, one thread will put a lock on the code that updates the variable, so that no other thread can update it until the lock is released. In the airline reservation system, when a click occurs, a locked is placed on the seat-status update code. Now, if someone else clicks on the seat, they will have to wait until the lock releases, at which time the program will discover that the seat is now taken and the user will be told to select another seat.&lt;br /&gt;&lt;br /&gt;Notice the inconsistency of the results: If the two clicks are milliseconds apart and no mutual exclusion is used the second person, will get the seat as his choice will overwrite his rival's. In the second scenario, where mutual exclusion is employed, the first click locks the other thread out and claims the seat.&lt;br /&gt;&lt;br /&gt;Many data races are straightforward. You look where two threads share a data field or resource and you impose sequential access via mutual exclusion. However, some forms of data races can be fairly subtle. Libraries, in particular can create unexpected race conditions, if they have not been thoroughly tested for thread safety. In addition, resources can sometimes create data races. Consider multiple threads running printf() at the same time. Without mutual exclusion, they can step on each others' output easily and provide a screen or a log filled with nonsensical data.&lt;br /&gt;&lt;br /&gt;So make sure your libraries are truly thread-safe, be vigilant about putting mutual exclusion around all data items and system resources that are shared between threads.&lt;br /&gt;&lt;br /&gt;
&lt;h1 class="sectionHeading"&gt;Deadlock&lt;/h1&gt;
Deadlock is a situation in which two threads are blocked because each of them is waiting on a lock held by the other. For example, you're designing a combat game in which the principal objective of the current stage is mopping-up pockets of snipers. You come down a street and, by the rules of the game, you cannot advance because the street has not been cleared of snipers. But by an error in logic, the snipers cannot be cleared until a unit that's stuck behind you clears the street first. You now have deadlock: each thread is waiting on the other (Your advance is waiting on the sniper clearing unit and that unit is waiting for you to move forward). The result is that your thread waits for an event that will never occur. It appears to you that your thread or the game is hung.&lt;br /&gt;&lt;br /&gt;The symptoms of deadlocks often involve two threads suddenly not advancing. If one of those threads is holding a lock that many other threads need, the program may appear to freeze completely.&lt;br /&gt;&lt;br /&gt;There are several ways of avoiding deadlocks. The first is to track every place where a thread uses mutual exclusion and determine whether a lock has any dependence at all on any other thread that could be waiting for it.&lt;br /&gt;&lt;br /&gt;Even this diligent inventory can overlook a subtle deadlock possibility: acquiring locks in the wrong order. Suppose, for example, that your game flashes an announcement on all players' screens whenever someone breaks the all-time record for disabling snipers. The way the logic is currently written, two locks are involved. One lock covers the code that checks your score against the old record and if you've beaten it, it writes your score to the record book. Because of the possibility of a data race (as discussed previously), the code carefully uses mutual exclusion to record the score-so that it can't be updated simultaneously by two different players. The next step is to acquire a lock to broadcast the new all-time high to all the players. The code uses a lock to make sure only one announcement can be made at a time. The code maintains the original lock on the score update, so that the point total in the announcement is not suddenly updated by someone else. So, at announcement time, the routine is holding two locks that enable the program to accurately record and broadcast the new all-time high.&lt;br /&gt;&lt;br /&gt;What could possibly go wrong? As long as all threads, acquire the locks in the same order, all is well. But a summer intern at your company decides that if a rookie-level player sets the all-time record, this should be broadcast immediately. So he acquires the broadcast lock first, does the broadcast, and only then acquires the lock to record the result. As we know, this will cause a data race if two players, one of whom is a rookie, break the record at the same time. It can also cause a deadlock. Suppose the rookie procedure grabs the announcement lock and is waiting on the recording lock, while the other player has the recording lock and is waiting on the broadcast lock. Now both players are waiting on each other in a deadly embrace. So, a crucially important rule is that if multiple locks are required for a transaction, they must always be acquired in the same order.&lt;br /&gt;&lt;br /&gt;One helpful precaution against deadlock is to avoid holding a lock if a second lock cannot be acquired. Almost all thread libraries have some form of threading call-in Pthreads, for example, it's called pthread_mutex_trylock()-that enables a developer to try a lock and if it can't be acquired to return right with an error code rather than wait forever for the lock. This call allows the developer to release a previously held lock and retry the operation later.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;
&lt;h1 class="sectionHeading"&gt;Live Lock&lt;/h1&gt;
Live lock is a situation involving multiple threads in which no thread can advance even though they are actively working. The quintessential example is the Dining Philosopher's problem. In this classic story, five philosophers are in a room meditating on various ideas. In the room, is a table with five plates of noodles and five chopsticks-one chopstick placed between each pair of plates. To eat, a philosopher must pick up two chopsticks, one on either side of his or her plate. The way the algorithm is written, the philosopher sits down and first attempts to pick up the left chopstick. If he's successful, he attempts to pick up the right chopstick. If that's successful too, he eats. If the right chopstick is unavailable, the philosopher puts down the left chopstick, waits five seconds, and tries again-until both chopsticks are available.&lt;br /&gt;&lt;br /&gt;The live lock problem occurs if all five philosophers decide to eat at the same time. Then, each one picks up the left chopstick successfully, and then notices the right chopstick is not available. In synch, they all lay down the left chopstick, wait five seconds, and repeat the process. Again, they all find the left chopstick but discover the right chopstick is not available, so they put down their left chopstick and wait again. As long as they are perfectly in sync, they will repeat the pattern ad infinitum.&lt;br /&gt;&lt;br /&gt;Live locks are nearly always the result of an algorithmic flaw, rather than an implementation error. They are very difficult to identify, because it can be nearly impossible to reproduce them. Consider, for example, that on a quad-core machine, the problem cannot occur, because with only four cores running in parallel, the five philosophers cannot act simultaneously. One will be swapped out briefly to enable the fifth philosopher's action. As a result, one philosopher will always get to eat-and the live lock disappears. However, on processors with five or more cores the problem can indeed occur. So, if the QA engineer or the help desk technicians have quad-core machines, they will never be able to reproduce a bug reported by multiple users.&lt;br /&gt;&lt;br /&gt;The best way to prevent live locks is through vigilance and defensive programming. Always consider what would happen if all possible threads performed the same action at the same time.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;
&lt;h1 class="sectionHeading"&gt;Locating Problem Code&lt;/h1&gt;
As seen in the previous example, the biggest challenges in debugging parallel code of reproducing and locating the problem. Frequently, though, if you can reproduce the problem, you can locate the error. Most debuggers and IDEs today enable you to break on specific threads, so you can see what each one is doing. Parallel debugging, however, becomes highly complex when dozens of threads are in flight. Not only is the debugging difficult, but the defensive programming work requires patient, careful thought and thorough analysis of design.&lt;br /&gt;&lt;br /&gt;Unfortunately, there are very few tools on the market to help locate sources of threading conflicts and of parallel trouble spots. One of the few is Intel® Thread Checker (&lt;a href="http://software.intel.com/en-us/intel-thread-checker/"&gt;http://software.intel.com/en-us/intel-thread-checker/&lt;/a&gt;), which is part of a line of thread-oriented development tools from Intel that includes Intel® Thread Profiler and Intel® Threading Building Blocks (Intel® TBB). In its simplest use case, the Intel® Thread Checker runs a threaded program and monitors the various threads looking for problematic interactions between them. It is capable of identifying race conditions and deadlocks, as well as suspicious threading practices.&lt;br /&gt;&lt;br /&gt;Figure 1 shows a screenshot from the Intel® Thread Checker when it discovers a deadlock in a small sample program.&lt;br /&gt;&lt;br /&gt;&lt;img src="http://software.intel.com/file/21033" /&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Figure 1. Intel® Thread Checker locates a deadlock.&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;Notice how it is able to trace the deadlock statements to specific lines of code in each of the two threads. This can be an invaluable time saver, especially on massively parallel code such as that in MPGs or other complex games.&lt;br /&gt;&lt;br /&gt;Figure 2 shows the display of a data race found when running a different program. In the central panel are shown three places where data races were detected (marked with red circles). Below them are various points of information (marked with blue) that highlight events of possible interest. If you click on any of these events, you are brought to the line of source code in a panel that looks much like Figure 1. This design makes it easy to identify the unprotected variable and determine which thread is modifying it.&lt;br /&gt;&lt;br /&gt;&lt;img src="http://software.intel.com/file/21034" /&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Figure 2. Intel® Thread Checker locates a data race&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;Intel® Thread Checker can certainly be used as a debugging tool, but this approach tends to underutilize it. A more plenary use case is to run it during integration tests. The reason for this is that parallel bugs are elusive creatures. A program that has a data race might run fine and give accurate results for weeks before one run suddenly delivers an incorrect result. To avoid this, the use of the Intel® Thread Checker during the testing phase can automate the discovery of latent defects that are hidden because the right parallel conditions have been able to mask their presence. When a defect is found, then Intel® Thread Checker's debugging facility brings added lift.&lt;br /&gt;&lt;br /&gt;It is worth noting that the Intel® Thread Profiler also is a very helpful companion tool that has thread performance analysis capabilities, which frequently can pinpoint places where threads are interacting suboptimally.)&lt;br /&gt;&lt;br /&gt;In conclusion, you want to do all you can to avoid having to debug threaded code. The best approach is to write threading apps defensively, test extensively, and rely on automated defect detection tools such as Intel® Thread Checker. All in all, this path will save you considerable frustration.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;
&lt;h1 class="sectionHeading"&gt;About the Author&lt;/h1&gt;
Andrew Binstock writes technical white papers at Pacific Data Works LLC. He is also a senior contributing editor for &lt;i&gt;InfoWorld&lt;/i&gt; and a columnist for &lt;i&gt;SD Times&lt;/i&gt;. He is the author or co-author of several books on programming, including two available from Intel Press. During his free time, he contributes to the open-source typesetting and page-layout project, Platypus, (&lt;a target="_blank" href="http://platypus.pz.org"&gt;http://platypus.pz.org&lt;/a&gt;). He can be reached through his blog at &lt;a target="_blank" href="http://binstock.blogspot.com"&gt;http://binstock.blogspot.com&lt;/a&gt;&lt;img src="http://feeds.feedburner.com/~r/ISNMulticore/~4/VMwXpSDCM6s" height="1" width="1"/&gt;</description>
      <link>http://feedproxy.google.com/~r/ISNMulticore/~3/VMwXpSDCM6s/debugging-threaded-applications</link>
      <pubDate>Tue, 07 Jul 2009 09:12:04 -0700</pubDate>
      <comments>http://software.intel.com/en-us/articles/debugging-threaded-applications#comments</comments>
      <guid isPermaLink="false">http://software.intel.com/en-us/articles/debugging-threaded-applications</guid>
      <category>Parallel Programming and Multi-Core</category>
      <category>ISN General</category>
      <category>Visual Computing</category>
      <category>Game Development</category>
    <feedburner:origLink>http://software.intel.com/en-us/articles/debugging-threaded-applications</feedburner:origLink></item>
    <item>
      <title>Intel® Threading Challenge 2009 - First 4 Winners</title>
      <description>&lt;div&gt;Problem #4 - String Matching - Winner Announced:&lt;/div&gt;
&lt;div&gt;Congratulations to "BradleyKuszmaul" who is our 4th problem winner.&lt;/div&gt;
&lt;div&gt;&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;Problem #3 - Searching - Winner Announced:&lt;/div&gt;
&lt;div&gt;Congratulations to "denghui0815" who is our 3rd problem winner.&lt;/div&gt;
&lt;div&gt;&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;Problem #2 - 3SAT - Winner Announced:&lt;/div&gt;
&lt;div&gt;Congratulations to "haojn" who is our 2nd problem winner.&lt;/div&gt;
&lt;div&gt;&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;Problem #1 - Radix Sort - Winner Announced:&lt;/div&gt;
&lt;div&gt;Congratulations to "denghui0815" who is our 1st problem winner.&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/ISNMulticore/~4/EI1AMcsKYVY" height="1" width="1"/&gt;</description>
      <link>http://feedproxy.google.com/~r/ISNMulticore/~3/EI1AMcsKYVY/intel-threading-challenge-2009-first-4-winners</link>
      <pubDate>Thu, 02 Jul 2009 12:10:28 -0700</pubDate>
      <comments>http://software.intel.com/en-us/articles/intel-threading-challenge-2009-first-4-winners#comments</comments>
      <guid isPermaLink="false">http://software.intel.com/en-us/articles/intel-threading-challenge-2009-first-4-winners</guid>
      <category>Parallel Programming and Multi-Core</category>
    <feedburner:origLink>http://software.intel.com/en-us/articles/intel-threading-challenge-2009-first-4-winners</feedburner:origLink></item>
    <item>
      <title>Joe Duffy architect of Parallel Extensions to .NET &amp; author of &amp;#34;Concurrent Programming on Windows&amp;#34; on Parallel Programming Talk</title>
      <description>&lt;span style="font-family: verdana, sans-serif; line-height: 16px;"&gt;Aaron &amp;amp; Clay talked with Joe Duffy of Microsoft on the 37th episode of Parallel Programming Talk of Parallel Programming Talk. Joe Duffy is the lead developer and architect for Parallel Extensions to .NET. He is the author of two books: Concurrent Programming on Windows and Professional .NET Framework 2.0.&lt;/span&gt;&lt;img src="http://feeds.feedburner.com/~r/ISNMulticore/~4/0ZDIY9tXbc8" height="1" width="1"/&gt;</description>
      <link>http://feedproxy.google.com/~r/ISNMulticore/~3/0ZDIY9tXbc8/joe-duffy-architect-of-parallel-extensions-to-net-author-of-concurrent-programming-on-windows-on-parallel-programming-talk</link>
      <pubDate>Thu, 02 Jul 2009 12:06:39 -0700</pubDate>
      <comments>http://software.intel.com/en-us/articles/joe-duffy-architect-of-parallel-extensions-to-net-author-of-concurrent-programming-on-windows-on-parallel-programming-talk#comments</comments>
      <guid isPermaLink="false">http://software.intel.com/en-us/articles/joe-duffy-architect-of-parallel-extensions-to-net-author-of-concurrent-programming-on-windows-on-parallel-programming-talk</guid>
      <category>Parallel Programming and Multi-Core</category>
    <feedburner:origLink>http://software.intel.com/en-us/articles/joe-duffy-architect-of-parallel-extensions-to-net-author-of-concurrent-programming-on-windows-on-parallel-programming-talk</feedburner:origLink></item>
    <item>
      <title>Major Software Tools Update to Intel Compilers and Libraries</title>
      <description>&lt;span style="font-family: verdana, sans-serif;"&gt;
&lt;p style="font-family: verdana, sans-serif; margin-top: 0px; margin-right: 0px; margin-bottom: 10px; margin-left: 0px; line-height: 16px; padding: 0px;"&gt;On June 23 Intel Software Product Division released updates for our C++ and Fortran compilers, Intel Math Kernel (MKL) and Intel Integrated Performance Primitives (IPP) libraries and Cluster toolkits. Noteworthy additions include outstanding performance enhancements, support of Intel® Advanced Vector Extensions (AVX) and inclusion of some elements that debuted in Intel® Parallel Studio last month.&lt;/p&gt;
&lt;p style="font-family: verdana, sans-serif; margin-top: 0px; margin-right: 0px; margin-bottom: 10px; margin-left: 0px; line-height: 16px; padding: 0px;"&gt;Features to note including our AVX and AES support in the tools, our adaptation of some of new features from Parallel Studio to Linux and Mac OS X, and really great tuning of our performance leading MPI library.&lt;/p&gt;
&lt;p style="font-family: verdana, sans-serif; margin-top: 0px; margin-right: 0px; margin-bottom: 10px; margin-left: 0px; line-height: 16px; padding: 0px;"&gt;The specific new product versions are:&lt;/p&gt;
&lt;p style="font-family: verdana, sans-serif; margin-top: 0px; margin-right: 0px; margin-bottom: 10px; margin-left: 0px; line-height: 16px; padding: 0px;"&gt;&lt;a style="font-family: verdana, sans-serif; color: #0860a8; text-decoration: none; padding: 0px; margin: 0px; border: 0px initial initial;" href="http://software.intel.com/en-us/intel-compilers/"&gt;Intel® Professional Edition Compilers 11.1 (Fortran &amp;amp; C/C++, for Windows, Linux, Mac OS X)&lt;/a&gt;&lt;br style="font-family: verdana, sans-serif; padding: 0px; margin: 0px;" /&gt;&lt;a style="font-family: verdana, sans-serif; color: #0860a8; text-decoration: none; padding: 0px; margin: 0px; border: 0px initial initial;" href="http://software.intel.com/en-us/intel-ipp/"&gt;Intel® Integrated Performance Primitives (IPP) 6.1 (for Windows, Linux, Mac OS X)&lt;/a&gt;&lt;br style="font-family: verdana, sans-serif; padding: 0px; margin: 0px;" /&gt;&lt;a style="font-family: verdana, sans-serif; color: #0860a8; text-decoration: none; padding: 0px; margin: 0px; border: 0px initial initial;" href="http://software.intel.com/en-us/intel-mkl/"&gt;Intel® Math Kernel Library (MKL) 10.2 (for Windows, Linux, Mac OS X)&lt;/a&gt;&lt;br style="font-family: verdana, sans-serif; padding: 0px; margin: 0px;" /&gt;&lt;a style="font-family: verdana, sans-serif; color: #0860a8; text-decoration: none; padding: 0px; margin: 0px; border: 0px initial initial;" href="http://software.intel.com/en-us/intel-cluster-toolkit/"&gt;Intel® Cluster Toolkit, Compiler Edition 3.2.1 (for Windows, Linux)&lt;/a&gt;&lt;br style="font-family: verdana, sans-serif; padding: 0px; margin: 0px;" /&gt;&lt;a style="font-family: verdana, sans-serif; color: #0860a8; text-decoration: none; padding: 0px; margin: 0px; border: 0px initial initial;" href="http://software.intel.com/en-us/intel-mpi-library/"&gt;Intel® MPI Library 3.2.1 (for Windows, Linux)&lt;/a&gt;&lt;/p&gt;
&lt;/span&gt;&lt;img src="http://feeds.feedburner.com/~r/ISNMulticore/~4/0yrbRReSCw4" height="1" width="1"/&gt;</description>
      <link>http://feedproxy.google.com/~r/ISNMulticore/~3/0yrbRReSCw4/major-software-tools-update-to-intel-compilers-and-libraries</link>
      <pubDate>Thu, 02 Jul 2009 12:05:13 -0700</pubDate>
      <comments>http://software.intel.com/en-us/articles/major-software-tools-update-to-intel-compilers-and-libraries#comments</comments>
      <guid isPermaLink="false">http://software.intel.com/en-us/articles/major-software-tools-update-to-intel-compilers-and-libraries</guid>
      <category>Parallel Programming and Multi-Core</category>
    <feedburner:origLink>http://software.intel.com/en-us/articles/major-software-tools-update-to-intel-compilers-and-libraries</feedburner:origLink></item>
    <item>
      <title>Q&amp;A from Webinar: Find Errors in Windows* C++ Parallel Applications</title>
      <description>&lt;strong&gt;On April 14, 2009 Bernth Andersson presented a technical session and live demo webinar focusing on the Intel(R) Parallel Debugger Extension and it's use for identifying parallel coding issues and run-time problems related to concurrency. Below are questions that came up during this webinar and answers to those questions.  &lt;br /&gt;&lt;br /&gt;Q1. Can you disable a single specific parallel region? &lt;br /&gt;&lt;/strong&gt;
&lt;blockquote&gt;
&lt;p&gt;A: The "Serialize Parallel Region" option of the Intel(R) Parallel Debugger extensions temporarily sets the OpenMP* omp_set_num_threads() environment variable to 1, thus forcing single threaded execution even on a multi-core system. This temporary change applies to the next parallel block or parallel region in your code relative to the current EIP or program counter location.&lt;br /&gt;&lt;br /&gt;Thus, yes - the "Serialize Parallel Region" option can be applied to a specific parallal region of your choice. You can do so by setting a breakpoint just before you enter the parallel region you would like to have executed as a single serial thread and then selecting the serialization option from the Intel(R) Parallel Debugger Extension menu inside Microsoft* Visual Studio. &lt;/p&gt;
&lt;/blockquote&gt;
&lt;br /&gt;&lt;strong&gt;Q2. Do the Parallel Debugger Extensions only support OpenMP* or also native Windows* threads or Threading Building Blocks?  &lt;/strong&gt;&lt;br /&gt;
&lt;blockquote&gt;A: Most of the Parallel Debugger Extension features like Thread Data Sharing Event Detection, Function Reentrancy Detection amd Serialize Parallel Region, rely on instrumentation of debug information and on the OpenMP* library. As such these features are currently only available for OpenMP* based threading. The one feature that is independent of the threading model used are the enhanced and highly configurable SSE register windows.&lt;/blockquote&gt;
&lt;br /&gt;&lt;strong&gt;Q3. Do the Intel(R) Parallel Debugger Extensions use similar instrumentations as the Intel(R) Parallel Inspector?  &lt;br /&gt;&lt;/strong&gt;
&lt;blockquote&gt;A: The debug symbol information instrumentation done by the Intel(R) C++ Compiler when the /debug:parallel option is set along with /Zi is different from the code instrumentations used by the Intel(R) Parallel Inspector for it's instrumentation assisted operating mode.&lt;br /&gt;The Intel(R) Parallel Debugger Extensions do not rely on executable code instrumentation, but rather on instrumentation of the symbol information used for debugging. As such the Parallel Debugger Extensions cannot statically anaylze the execution flow, but rely on a detectable event that may be of interest happening at real time during a debug session. At the same time, because only the debug information is instrumented there should only be very minimal performance impact on the execuatble if run outside the Microsoft* Visual Studio Debugger.&lt;/blockquote&gt;
&lt;br /&gt;&lt;strong&gt;Q4. Is the compiler option /Qpenmp required for use of the Intel(R) Parallel Debugger Extension?&lt;/strong&gt;
&lt;blockquote&gt;A: Yes, all enhanced parallelism features of the Intel(R) Parallel Debugger Extensions rely on OpenMP* based threading, except for the SSE register windows.&lt;/blockquote&gt;
&lt;br /&gt;&lt;strong&gt;Q5.  Do the Intel(R) Parallel Debugger Extensions require the use of the Intel(R) Compiler?&lt;/strong&gt;
&lt;blockquote&gt;A: Yes, the Parallel Debugger Extensions rely on debug info instrumentation added with the Intel(R) Compiler option /debug:parallel in conjunction with /Qopenmp. Therefore the full capabilities of the Intel(R) Parallel Debugger Extensions are only available when used with the Intel(R) Compiler.&lt;/blockquote&gt;
&lt;br /&gt;&lt;strong&gt;Q6. Can I use the Intel(R) Parallel Debugger Extensions to debug parallelism in the Intel(R) Integrated Performance Primitives and to serialize execution in them?&lt;/strong&gt;
&lt;blockquote&gt;A: In principle yes, BUT this would require rebuilding and relinking the Intel(R) IPP with symbol information and /debug:parallel. Since you are most likely taking the primitives from prebuilt libraries that you link into your project, this not really a supported usage model, although it may work in some cases depending on how the call to the Intel(R) IPP function of your choice is embedded in the rest of your application.&lt;br /&gt;The short answer thus is really no, with some exceptions.&lt;/blockquote&gt;
&lt;strong&gt;&lt;br /&gt;Q7.  Does the OpenMP* library have to be linked in statically? Can 3rd party OpenMP* libraries be used?&lt;/strong&gt;&lt;br /&gt;
&lt;blockquote&gt;A: The OpenMP* library can be linked in statically or dynamically into your application build for the use of the Intel(R) Parallel Debugger Extension. The OpenMP* library used by the Intel(R) C++ Compiler is really a standard OpenMP* library. However, there is the dependency on the debug information instrumentation using /debug:parallel. This instrumentation has only been tested and is only expected to work with the OpenMP* libraries provided with the Intel(R) Compiler.&lt;/blockquote&gt;
&lt;br /&gt;&lt;strong&gt;Q8. What is the main benefit of the SSE Register Window? Does it depend on OpenMP*?&lt;/strong&gt;
&lt;blockquote&gt;A: The SSE Register Window allows you to group the display of the register contents and display said contents in the exact way that you are using it in your parallelized loops, your structured arrays or other parallel structures. By doing this the highly configurable SSE Register Window provides you with the link between your data as it is used in your application and the way this very same data is actually stored and processed in the SSE registers.&lt;br /&gt;&lt;br /&gt;This can be quite valuable for understanding more complex heavily parallel multimedia or graphics code for instants.&lt;br /&gt;&lt;br /&gt;This feature does not rely on any instrumentation or any specific threading implementation. It is independent of OpenMP*. &lt;/blockquote&gt;
&lt;br /&gt;&lt;strong&gt;For additional questions please also refer to the Intel(R) Parallel Debugger Extension article and whitepaper at &lt;a target="_blank" href="http://software.intel.com/en-us/articles/parallel-debugger-extension/" title="Intel(R) Parallel Debugger Extension Article"&gt;http://software.intel.com/en-us/articles/parallel-debugger-extension/&lt;/a&gt;&lt;/strong&gt;&lt;img src="http://feeds.feedburner.com/~r/ISNMulticore/~4/7zhTJrhu6vc" height="1" width="1"/&gt;</description>
      <link>http://feedproxy.google.com/~r/ISNMulticore/~3/7zhTJrhu6vc/qa-from-webinar-find-errors-in-parallel-applications</link>
      <pubDate>Tue, 30 Jun 2009 16:56:46 -0700</pubDate>
      <comments>http://software.intel.com/en-us/articles/qa-from-webinar-find-errors-in-parallel-applications#comments</comments>
      <guid isPermaLink="false">http://software.intel.com/en-us/articles/qa-from-webinar-find-errors-in-parallel-applications</guid>
      <category>Parallel Programming and Multi-Core</category>
      <category>Intel® Compilers</category>
      <category>Intel® Parallel Composer Knowledge Base</category>
    <feedburner:origLink>http://software.intel.com/en-us/articles/qa-from-webinar-find-errors-in-parallel-applications</feedburner:origLink></item>
    <item>
      <title>Intel and IBM Collaborate to Boost Performance Lower Power Consumption</title>
      <description>&lt;span class="sectionHeading"&gt;Abstract &lt;/span&gt;&lt;br /&gt; &lt;br /&gt;Timely, trusted information is the currency of success at all levels of business. But increasing data volumes are making it more difficult to deliver that information. And as data volumes go up, so do the costs of data management. For more than a decade, IBM and Intel have collaborated to optimize enterprise solutions: Complete, cost-effective, performance-optimized stacks of IBM® Information Management software running on servers powered by Intel® processors. Our relentless pursuit of performance has led to some impressive results. Compared to what customers could purchase just over a decade ago, they can now benefit from over 600 times the transaction performance from IBM DB2® on Intel-based servers at nearly 99 percent less cost per transaction. &lt;br /&gt;&lt;br /&gt; &lt;span class="sectionHeading"&gt;Download Full PDF &lt;/span&gt;&lt;br /&gt; &lt;a href="http://software.intel.com/file/20463"&gt;&lt;br /&gt;Intel and IBM collaborate to boost performance lower power consumption&lt;/a&gt; [PDF 256kb]&lt;br /&gt;&lt;br /&gt; &lt;span class="sectionHeading"&gt;Customer Testimonials &lt;/span&gt;
&lt;p align="left"&gt;&lt;br /&gt; "Wherescape was founded  about 11 years ago in 1998 entirely for the purpose of helping customers  expedite the process of creating, building, and managing their data warehouses."&lt;br /&gt; &lt;strong&gt;Video Script&lt;/strong&gt;&lt;br /&gt; &lt;strong&gt;Mark Budzinski&lt;/strong&gt;&lt;br /&gt; &lt;strong&gt;VP &amp;amp; GM&lt;/strong&gt;&lt;br /&gt; &lt;strong&gt;WhereScape  USA, Inc.&lt;/strong&gt;&lt;/p&gt;
&lt;p align="left"&gt;&lt;em&gt;&lt;strong&gt;Data  Warehousing&lt;/strong&gt;&lt;/em&gt;&lt;/p&gt;
&lt;p align="left"&gt;"We had a customer last year  who purchased a bigger server for their data warehouse because they needed more  bandwidth for daily processing and couldn’t put it into their datacenter  because it wouldn’t physically fit into the room."&lt;br /&gt; &lt;strong&gt;Jason Laws&lt;/strong&gt;&lt;br /&gt; &lt;strong&gt;Chief  Architect&lt;/strong&gt;&lt;br /&gt; &lt;strong&gt;WhereScape  Inc.&lt;/strong&gt;&lt;/p&gt;
&lt;p align="left"&gt;"They say the most expensive  server you’re gonna purchase is the one that causes you to build your next  datacenter.  A lot of people are looking  to take their existing datacenter foot print, try and cost reduce it as much as  possible." &lt;br /&gt; &lt;strong&gt;Shannon  Poulin&lt;/strong&gt;&lt;br /&gt; &lt;strong&gt;Director,  Xeon Platform Marketing&lt;/strong&gt;&lt;br /&gt; &lt;strong&gt;Intel Corp.&lt;/strong&gt;&lt;br /&gt; &lt;br /&gt; &lt;span class="style2"&gt;&lt;strong&gt;Intel&lt;/strong&gt;® &lt;strong&gt;Xeon&lt;/strong&gt;® &lt;strong&gt;Processor 5500 Series&lt;/strong&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p align="left"&gt;The way that you do that is  you put in these new Xeon 5500 series servers that consume lower energy.&lt;/p&gt;
&lt;p align="left"&gt;&lt;em&gt;&lt;strong&gt;Energy  Savings&lt;/strong&gt; &lt;/em&gt;&lt;/p&gt;
&lt;p align="left"&gt;or the same energy as the ones they were  replacing but deliver more performance.&lt;/p&gt;
&lt;p align="left"&gt;&lt;em&gt;&lt;strong&gt;Intelligent  Performance&lt;/strong&gt;&lt;/em&gt;&lt;/p&gt;
&lt;p align="left"&gt;"The intelligent performance  of the chips saves energy by turning off processors when they’re not being  used. Power isn’t being used unnecessarily to run processors that are doing  nothing." &lt;br /&gt; &lt;strong&gt;Jason Laws&lt;/strong&gt;&lt;/p&gt;
&lt;p align="left"&gt;&lt;em&gt;&lt;strong&gt;Intel-IBM  Relationship&lt;/strong&gt;&lt;/em&gt;&lt;/p&gt;
&lt;p align="left"&gt;"We have a long history of working together of creating  unique innovation where we can get the best of DB2 working together with the  best of Intel. We have delivered a whole new level of innovation with DB2 9.7  on the Intel Xeon Processor 5500 Series." &lt;br /&gt; &lt;strong&gt;Berni Schiefer&lt;/strong&gt;&lt;br /&gt; &lt;strong&gt;Distinguished Engineer&lt;/strong&gt;&lt;br /&gt; &lt;strong&gt;IBM, Inc.&lt;/strong&gt;&lt;/p&gt;
&lt;p align="left"&gt;"The Xeon 5500 series, that  provides almost 80% better performance over the previous generation of processors  –producing outstanding results, lowering the cost of operation for our clients."&lt;br /&gt; &lt;strong&gt;Sal Vella&lt;/strong&gt;&lt;br /&gt; &lt;strong&gt;VP Development DB2&lt;/strong&gt;&lt;br /&gt; &lt;strong&gt;IBM, Inc.&lt;/strong&gt;&lt;/p&gt;
&lt;p align="left"&gt;"And we’ve been able to do  that at nearly a 50% improvement in performance per watt."&lt;br /&gt; &lt;strong&gt;Shannon Poulin&lt;/strong&gt;&lt;/p&gt;
&lt;p align="left"&gt;"We’ve shown not only that  you can achieve superb performance results by combining the DB2 product with  the Intel processor but we were able to do that with an absolute minimum amount  of tuning."&lt;br /&gt; &lt;strong&gt;Berni Schiefer&lt;/strong&gt;&lt;/p&gt;
&lt;p align="left"&gt;"We have a self-tuning memory manager that looks at the  system memory and says, where should I allocate it for best performance  depending on the kind of workload?  So  terrific technology automating things that normally the database administrator  would have to do themselves." &lt;br /&gt; &lt;strong&gt;Sal Vella&lt;/strong&gt;&lt;/p&gt;
&lt;p align="left"&gt;&lt;em&gt;&lt;strong&gt;Solution  Migration&lt;/strong&gt;&lt;/em&gt;&lt;/p&gt;
&lt;p align="left"&gt;"This migration is actually  pretty easy. We have a partner in China called NewSoft. The sizing to  build the solution was 25-person years which is pretty significant. We did it  over a weekend with one guy. That is how easy it is to get to DB2 9.7."&lt;br /&gt; &lt;strong&gt;Boris Bialek&lt;/strong&gt;&lt;br /&gt; &lt;strong&gt;Program  Director, Data Management Solutions&lt;/strong&gt;&lt;br /&gt; &lt;strong&gt;IBM, Inc.&lt;/strong&gt;&lt;/p&gt;
&lt;p align="left"&gt;&lt;em&gt;&lt;strong&gt;Compression  Technologies&lt;/strong&gt; &lt;/em&gt;&lt;/p&gt;
&lt;p align="left"&gt;"We are industry leading in terms of our compression  technologies and capabilities. In DB2 9.7 we’ve really extended that to a whole  new level.  The rule of thumb for a data  warehouse is that a third of the size of the database is in your data tables;  another third is in your indexes; another third is in temp space. So in 9.7  we’ve addressed the other two-thirds, really, of that triangle. We’ve  introduced compression for indexes and compression for temporary tables and  have gotten fantastic results."&lt;br /&gt; &lt;strong&gt;Richard  Hedges&lt;/strong&gt;&lt;br /&gt; &lt;strong&gt;Dir. of New  Development – DB2&lt;/strong&gt;&lt;br /&gt; &lt;strong&gt;IBM, Inc.&lt;/strong&gt;&lt;br /&gt; &lt;strong&gt; &lt;/strong&gt;&lt;br /&gt; "And when we did some tests with compression we  found the new DB2 9.7 and the new Xeon 5500 series processors were a lot faster  during compression on large tables."&lt;br /&gt; &lt;strong&gt;Jason Laws&lt;/strong&gt;&lt;strong&gt; &lt;/strong&gt;&lt;/p&gt;
&lt;p align="left"&gt;&lt;em&gt;&lt;strong&gt;Increased  Performance&lt;/strong&gt; &lt;/em&gt;&lt;/p&gt;
&lt;p align="left"&gt;Every test we did was faster. In fact some of the  tests they ran twice as fast.&lt;/p&gt;
&lt;p align="left"&gt;"It’s a huge cost savings for our clients, in many cases  resulting in millions of dollars of savings."&lt;br /&gt; &lt;strong&gt;Sal Vella&lt;/strong&gt;&lt;/p&gt;
&lt;p align="left"&gt;&lt;span class="style1"&gt;&lt;strong&gt;Intel&lt;/strong&gt;® &lt;strong&gt;Xeon&lt;/strong&gt;® &lt;strong&gt;Processor 5500 Series&lt;br /&gt; IBM DB2&lt;/strong&gt;® &lt;strong&gt;9.7&lt;br /&gt; Energy  Savings&lt;br /&gt; Intelligent  Performance&lt;/strong&gt;&lt;br /&gt; &lt;strong&gt;Compression  Technologies&lt;br /&gt; Ease of  Use&lt;/strong&gt; &lt;/span&gt;&lt;/p&gt;
&lt;p align="left"&gt;"And when you really consider what’s going on now  with Intel’s intelligent performance, you consider that with what IBM is up to with 9.7 DB2, goodness, this is not  business as usual. This is really game changing technology. That when  appropriately applied you can get the performance gains that are truly  remarkable but do it in such a way that you are managing your power  requirements and your other costs as well. "&lt;br /&gt; &lt;strong&gt;Mark Budzinski&lt;/strong&gt;&lt;/p&gt;&lt;img src="http://feeds.feedburner.com/~r/ISNMulticore/~4/u1sSDDOdKc4" height="1" width="1"/&gt;</description>
      <link>http://feedproxy.google.com/~r/ISNMulticore/~3/u1sSDDOdKc4/intel-and-ibm-collaborate-to-boost-performance-lower-power-consumption</link>
      <pubDate>Thu, 25 Jun 2009 12:55:12 -0700</pubDate>
      <comments>http://software.intel.com/en-us/articles/intel-and-ibm-collaborate-to-boost-performance-lower-power-consumption#comments</comments>
      <guid isPermaLink="false">http://software.intel.com/en-us/articles/intel-and-ibm-collaborate-to-boost-performance-lower-power-consumption</guid>
      <category>Parallel Programming and Multi-Core</category>
    <feedburner:origLink>http://software.intel.com/en-us/articles/intel-and-ibm-collaborate-to-boost-performance-lower-power-consumption</feedburner:origLink></item>
    <item>
      <title>June 30th Parallel Programming Talk - MS Parallel Extensions to .NET and MS Task Parallel Library</title>
      <description>&lt;span style="font-size: small;"&gt;&lt;span style="border-collapse: collapse; font-size: 13px; white-space: pre; -webkit-border-horizontal-spacing: 2px; -webkit-border-vertical-spacing: 2px;"&gt;&lt;span style="border-collapse: separate; font-size: medium; white-space: normal; -webkit-border-horizontal-spacing: 0px; -webkit-border-vertical-spacing: 0px;"&gt;
&lt;div style="font-family: Verdana, Arial, Helvetica, sans-serif; padding-top: 0px; padding-right: 15px; padding-bottom: 15px; padding-left: 10px; color: #000000; font-size: 11px; background-image: initial; background-repeat: initial; background-attachment: initial; -webkit-background-clip: initial; -webkit-background-origin: initial; background-color: #ffffff; background-position: initial initial; margin: 8px;"&gt;&lt;span style="font-family: Arial, Helvetica, sans-serif; font-size: 12px; font-weight: bold; "&gt;
&lt;h1 style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font-size: 12px; "&gt;On the June 30th Parallel Programming Talk Joe Duffy from Microsoft will discuss the MS Parallel Extensions to .NET and MS Task Parallel Library. Joe is a rock star among .Net C# developers. He is the lead developer and architect for Parallel Extensions to .NET. He is the author of two books: Concurrent Programming on Windows and Professional .NET Framework 2.0 We’ll be talking to Joe about his thoughts and experiences with threading applications for the Windows environment, especially with regards to the .NET Framework.&lt;/h1&gt;
&lt;h1 style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font-size: 12px; "&gt;&lt;a style="font-family: verdana, sans-serif; color: #0860a8; text-decoration: none; padding: 0px; margin: 0px; border: 0px initial initial;" href="Intel.com/software/tv"&gt;Tune is LIVE on June 30th at 8:00AM PST&lt;/a&gt;&lt;/h1&gt;
&lt;/span&gt;&lt;/div&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;img src="http://feeds.feedburner.com/~r/ISNMulticore/~4/eVU2Q--lUM4" height="1" width="1"/&gt;</description>
      <link>http://feedproxy.google.com/~r/ISNMulticore/~3/eVU2Q--lUM4/june-30th-parallel-programming-talk</link>
      <pubDate>Fri, 19 Jun 2009 11:30:44 -0700</pubDate>
      <comments>http://software.intel.com/en-us/articles/june-30th-parallel-programming-talk#comments</comments>
      <guid isPermaLink="false">http://software.intel.com/en-us/articles/june-30th-parallel-programming-talk</guid>
      <category>Parallel Programming and Multi-Core</category>
    <feedburner:origLink>http://software.intel.com/en-us/articles/june-30th-parallel-programming-talk</feedburner:origLink></item>
    <item>
      <title>Research@Intel Day Showcases Parallel Programming</title>
      <description>June 18, 2009 was Research@Intel day. Many great hardware and software ideas were demoed at the event.  I think that you'll be most interested in the Immersive Connected Experience Zone where out software teams were presenting two key demos of two experimental Parallel Programming Tools, both were applied to optimize computer vision. The first was &lt;a style="font-family: verdana, sans-serif; color: #0860a8; text-decoration: none; padding: 0px; margin: 0px; border: 0px initial initial;" href="http://software.intel.com/en-us/articles/intel-concurrent-collections-for-cc/"&gt;Concurrent Collections&lt;/a&gt; for C++ which is a new language to describe parallel computations. The second is &lt;a style="font-family: verdana, sans-serif; color: #0860a8; text-decoration: none; padding: 0px; margin: 0px; border: 0px initial initial;" href="http://techresearch.intel.com/articles/Tera-Scale/1514.htm"&gt;Ct Technology&lt;/a&gt; that extends C/C++ to simplify data parallel programs. &lt;a href="http://software.intel.com/en-us/blogs/2009/06/19/research-at-intel-day-2009-tomorrows-ideas-today/"&gt;Read my blog post to learn more.&lt;/a&gt;
&lt;p&gt; &lt;/p&gt;&lt;img src="http://feeds.feedburner.com/~r/ISNMulticore/~4/6jxjD3ojMMI" height="1" width="1"/&gt;</description>
      <link>http://feedproxy.google.com/~r/ISNMulticore/~3/6jxjD3ojMMI/researchintel-day-showcases-parallel-programming</link>
      <pubDate>Fri, 19 Jun 2009 11:22:45 -0700</pubDate>
      <comments>http://software.intel.com/en-us/articles/researchintel-day-showcases-parallel-programming#comments</comments>
      <guid isPermaLink="false">http://software.intel.com/en-us/articles/researchintel-day-showcases-parallel-programming</guid>
      <category>Parallel Programming and Multi-Core</category>
    <feedburner:origLink>http://software.intel.com/en-us/articles/researchintel-day-showcases-parallel-programming</feedburner:origLink></item>
  </channel>
</rss>
