<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" media="screen" href="/~d/styles/rss2full.xsl"?><?xml-stylesheet type="text/css" media="screen" href="http://feeds.feedburner.com/~d/styles/itemcontent.css"?><!-- Generated on Fri, 03 Sep 2010 13:56:11 -0400 --><rss xmlns:atom="http://www.w3.org/2005/Atom" xmlns:feedburner="http://rssnamespace.org/feedburner/ext/1.0" version="2.0">
  <channel>
    
    <title>Intel Software Network - Main Articles Feed</title>
    <link>http://software.intel.com/en-us/articles/all</link>
    <description>Feed of all the articles posted on the main page of Intel Software Network.</description>
    <language>en-us</language>
    <atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="self" type="application/rss+xml" href="http://feeds.feedburner.com/ISNMain" /><feedburner:info uri="isnmain" /><atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="hub" href="http://pubsubhubbub.appspot.com/" /><feedburner:emailServiceId>ISNMain</feedburner:emailServiceId><feedburner:feedburnerHostname>http://feedburner.google.com</feedburner:feedburnerHostname><item>
      <title>How to change the Parallel Studio version integrated into Visual Studio</title>
      <description>&lt;div id="art_pre_template"&gt;
&lt;p&gt;&lt;b&gt;Problem : &lt;/b&gt;&lt;br /&gt;Only one version of Intel® Parallel Studio can be integrated with any one version of Microsoft Visual Studio* at a time. Therefore, if you have Intel Parallel Studio installed on your system and then install a different version along side of it, the newly installed version will be integrated into Visual Studio in place of the previously installed version - this means you will see the newly installed Parallel Studio toolbars, menu items, etc. &lt;/p&gt;
&lt;p&gt;You can control which version of Intel Parallel Studio you use with a particular Visual Studio by performing the steps outlined below.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Environment: &lt;/b&gt;&lt;br /&gt;Windows systems with Microsoft Visual Studio 2005, 2008, and/or 2010 installed along with multiple versions of Intel Parallel Studio.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Root Cause: &lt;/b&gt;&lt;br /&gt;Limit of one Parallel Studio integrated with a version of Visual Studio.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Resolution: &lt;/b&gt;&lt;br /&gt;You will need to change the version of Parallel Studio that is integrated with a particular version of Visual Studio.  This will need to be done for each component of the Parallel Studio that you have installed. &lt;/p&gt;
&lt;ol type="1"&gt;
&lt;li&gt;Begin by removing the integration from the version that is currently integrated. &lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;For  &lt;i&gt;Intel Parallel Amplifier&lt;/i&gt; or &lt;i&gt;Intel Parallel Inspector&lt;/i&gt;, start by opening the Command Prompt window for the version of Parallel Studio you wish to disable. For example:&lt;/p&gt;
&lt;table cellpadding="0" cellspacing="0" border="1"&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td width="498" valign="top"&gt;
&lt;p&gt;To open an Intel Parallel Studio 2011 command prompt in the Visual Studio 2005 mode:&lt;/p&gt;
&lt;p&gt;&lt;b&gt;Start &amp;gt; All Programs &amp;gt; Intel Parallel Studio 2011 &amp;gt; Command Prompt &amp;gt; IA 32 Visual Studio 2005 mode&lt;/b&gt;. &lt;/p&gt;
&lt;p&gt;To open an Intel Parallel Studio command prompt in the Visual Studio 2008 mode:  &lt;/p&gt;
&lt;p&gt;&lt;b&gt;Start &amp;gt; All Programs &amp;gt; Intel Parallel Studio &amp;gt; Command Prompt &amp;gt; IA 32 Visual Studio 2008 mode&lt;/b&gt;. &lt;/p&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt; &lt;/p&gt;
&lt;p&gt;Now invoke the appropriate script to disable the integration:&lt;/p&gt;
&lt;table width="559" cellpadding="0" cellspacing="0" border="1"&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td width="114" valign="top"&gt;
&lt;p&gt;&lt;b&gt;Tool&lt;/b&gt;&lt;/p&gt;
&lt;/td&gt;
&lt;td width="148" valign="top"&gt;
&lt;p align="center"&gt;&lt;b&gt;Visual Studio 2005&lt;/b&gt;&lt;/p&gt;
&lt;/td&gt;
&lt;td width="148" valign="top"&gt;
&lt;p align="center"&gt;&lt;b&gt;Visual Studio 2008&lt;/b&gt;&lt;/p&gt;
&lt;/td&gt;
&lt;td width="148" valign="top"&gt;
&lt;p align="center"&gt;&lt;b&gt;Visual Studio 2010&lt;/b&gt;&lt;/p&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td width="114" valign="top"&gt;
&lt;p&gt;Intel Parallel Amplifier&lt;/p&gt;
&lt;/td&gt;
&lt;td width="148" valign="top"&gt;
&lt;p&gt;ampl-vsreg disable vs2005&lt;/p&gt;
&lt;/td&gt;
&lt;td width="148" valign="top"&gt;
&lt;p&gt;ampl-vsreg disable vs2008&lt;/p&gt;
&lt;/td&gt;
&lt;td width="148" valign="top"&gt;
&lt;p&gt;ampl-vsreg disable vs2010&lt;/p&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td width="114" valign="top"&gt;
&lt;p&gt;Intel Parallel Amplifier 2011&lt;/p&gt;
&lt;/td&gt;
&lt;td width="148" valign="top"&gt;
&lt;p&gt;ampl-vsreg --disable 2005&lt;/p&gt;
&lt;/td&gt;
&lt;td width="148" valign="top"&gt;
&lt;p&gt;ampl-vsreg --disable 2008&lt;/p&gt;
&lt;/td&gt;
&lt;td width="148" valign="top"&gt;
&lt;p&gt;ampl-vsreg --disable 2010&lt;/p&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td width="114" valign="top"&gt;
&lt;p&gt;Intel Parallel Inspector&lt;/p&gt;
&lt;/td&gt;
&lt;td width="148" valign="top"&gt;
&lt;p&gt;insp-vsreg disable vs2005&lt;/p&gt;
&lt;/td&gt;
&lt;td width="148" valign="top"&gt;
&lt;p&gt;insp-vsreg disable vs2008&lt;/p&gt;
&lt;/td&gt;
&lt;td width="148" valign="top"&gt;
&lt;p&gt;insp-vsreg disable vs2010&lt;/p&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td width="114" valign="top"&gt;
&lt;p&gt;Intel Parallel Inspector 2011&lt;/p&gt;
&lt;/td&gt;
&lt;td width="148" valign="top"&gt;
&lt;p&gt;insp-vsreg --disable 2005&lt;/p&gt;
&lt;/td&gt;
&lt;td width="148" valign="top"&gt;
&lt;p&gt;insp-vsreg --disable 2008&lt;/p&gt;
&lt;/td&gt;
&lt;td width="148" valign="top"&gt;
&lt;p&gt;insp-vsreg --disable 2010&lt;/p&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;&lt;br /&gt;&lt;br /&gt;For &lt;i&gt;Intel Parallel Composer&lt;/i&gt; use the&lt;b&gt; &lt;/b&gt;&lt;b&gt;Control Panel &amp;gt; Add/Remove Programs&lt;/b&gt; for the version you want to disable:&lt;/p&gt;
&lt;p&gt;&lt;br /&gt;Select &lt;b&gt;Modify&lt;/b&gt; and disable the following options:&lt;/p&gt;
&lt;p&gt;○ Integrated Documentation&lt;br /&gt;○ Intel Parallel Debugger Extension&lt;br /&gt;○ Integration(s) in Microsoft Visual Studio&lt;/p&gt;
&lt;p&gt;            Select &lt;b&gt;Next &amp;gt; Modify&lt;br /&gt;&lt;/b&gt;&lt;b&gt;&lt;br /&gt;&lt;br /&gt;&lt;/b&gt;&lt;/p&gt;
&lt;ol start="2" type="1"&gt;
&lt;li&gt;Enable the Visual Studio integration.   &lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;For  &lt;i&gt;Intel Parallel Amplifier&lt;/i&gt; or &lt;i&gt;Intel Parallel Inspector&lt;/i&gt;, start by opening the Command Prompt window for the version of Parallel Studio you wish to enable, then invoke the appropriate script to enable the integration:&lt;/p&gt;
&lt;p&gt; &lt;/p&gt;
&lt;table width="565" cellpadding="0" cellspacing="0" border="1"&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td width="138" valign="top"&gt;
&lt;p&gt;&lt;b&gt;Tool&lt;/b&gt;&lt;/p&gt;
&lt;/td&gt;
&lt;td width="130" valign="top"&gt;
&lt;p align="center"&gt;&lt;b&gt;Visual Studio 2005&lt;/b&gt;&lt;/p&gt;
&lt;/td&gt;
&lt;td width="148" valign="top"&gt;
&lt;p align="center"&gt;&lt;b&gt;Visual Studio 2008&lt;/b&gt;&lt;/p&gt;
&lt;/td&gt;
&lt;td width="148" valign="top"&gt;
&lt;p align="center"&gt;&lt;b&gt;Visual Studio 2010&lt;/b&gt;&lt;/p&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td width="138" valign="top"&gt;
&lt;p&gt;Intel Parallel Amplifier&lt;/p&gt;
&lt;/td&gt;
&lt;td width="130" valign="top"&gt;
&lt;p&gt;ampl-vsreg integrate vs2005&lt;/p&gt;
&lt;/td&gt;
&lt;td width="148" valign="top"&gt;
&lt;p&gt;ampl-vsreg integrate vs2008&lt;/p&gt;
&lt;/td&gt;
&lt;td width="148" valign="top"&gt;
&lt;p&gt;ampl-vsreg integrate vs2010&lt;/p&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td width="138" valign="top"&gt;
&lt;p&gt;Intel Parallel Amplifier 2011&lt;/p&gt;
&lt;/td&gt;
&lt;td width="130" valign="top"&gt;
&lt;p&gt;ampl-vsreg --integrate 2005&lt;/p&gt;
&lt;/td&gt;
&lt;td width="148" valign="top"&gt;
&lt;p&gt;ampl-vsreg --integrate 2008&lt;/p&gt;
&lt;/td&gt;
&lt;td width="148" valign="top"&gt;
&lt;p&gt;ampl-vsreg --integrate 2010&lt;/p&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td width="138" valign="top"&gt;
&lt;p&gt;Intel Parallel Inspector&lt;/p&gt;
&lt;/td&gt;
&lt;td width="130" valign="top"&gt;
&lt;p&gt;insp-vsreg integrate vs2005&lt;/p&gt;
&lt;/td&gt;
&lt;td width="148" valign="top"&gt;
&lt;p&gt;insp-vsreg integrate vs2008&lt;/p&gt;
&lt;/td&gt;
&lt;td width="148" valign="top"&gt;
&lt;p&gt;insp-vsreg integrate vs2010&lt;/p&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td width="138" valign="top"&gt;
&lt;p&gt;Intel Parallel Inspector 2011&lt;/p&gt;
&lt;/td&gt;
&lt;td width="130" valign="top"&gt;
&lt;p&gt;insp-vsreg --integrate 2005&lt;/p&gt;
&lt;/td&gt;
&lt;td width="148" valign="top"&gt;
&lt;p&gt;insp-vsreg --integrate 2008&lt;/p&gt;
&lt;/td&gt;
&lt;td width="148" valign="top"&gt;
&lt;p&gt;insp-vsreg --integrate 2010&lt;/p&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt; &lt;/p&gt;
&lt;p&gt;&lt;br /&gt;For &lt;i&gt;Intel Parallel Composer&lt;/i&gt; use the &lt;b&gt;Control Panel &amp;gt; Add/Remove Programs&lt;/b&gt; entry for the version you want to enable:&lt;/p&gt;
&lt;p&gt;&lt;br /&gt;Select &lt;b&gt;Modify&lt;/b&gt; and enable the following options:&lt;/p&gt;
&lt;p&gt;○ Integrated Documentation&lt;br /&gt;○ Intel Parallel Debugger Extension&lt;br /&gt;○ Integration(s) in Microsoft Visual Studio&lt;/p&gt;
Select the Visual Studio versions you would like to enable integration with.&lt;br /&gt;Select &lt;b&gt;Next &amp;gt; Modify&lt;/b&gt;&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/ISNMain/~4/gJEpDBcJssM" height="1" width="1"/&gt;</description>
      <link>http://feedproxy.google.com/~r/ISNMain/~3/gJEpDBcJssM/how-to-change-the-parallel-studio-version-integrated-into-visual-studio</link>
      <pubDate>Fri, 03 Sep 2010 08:09:46 -0400</pubDate>
      <comments>http://software.intel.com/en-us/articles/how-to-change-the-parallel-studio-version-integrated-into-visual-studio#comments</comments>
      <guid isPermaLink="false">http://software.intel.com/en-us/articles/how-to-change-the-parallel-studio-version-integrated-into-visual-studio</guid>
      <category>Intel® Parallel Composer</category>
      <category>Intel® Parallel Amplifier</category>
      <category>Intel® Parallel Inspector</category>
      <category>Intel® Software Development Products Home</category>
      <category>Intel® Parallel Studio Home</category>
      <category>Intel® Parallel Advisor</category>
      <category>Intel® Parallel Amplifier Knowledge Base</category>
      <category>Intel® Parallel Composer Knowledge Base</category>
      <category>Intel® Parallel Inspector Knowledge Base</category>
      <category>Intel® Software Development Products Registration Center Knowledge Base</category>
      <category>Intel® Parallel Advisor Knowledge Base</category>
    <feedburner:origLink>http://software.intel.com/en-us/articles/how-to-change-the-parallel-studio-version-integrated-into-visual-studio</feedburner:origLink></item>
    <item>
      <title>Matrix Multiplication, Performance, and Scalability in OpenMP: Student Challenge</title>
      <description>&lt;img src="http://software.intel.com/file/29011" /&gt; &lt;br /&gt; &lt;b&gt;&lt;br /&gt;&lt;/b&gt;
&lt;p&gt;&lt;br /&gt; High Performance Computing: Model Methods and Means, first  semester 2010, Fa.M.A.F., National University of Córdoba, Argentina.&lt;br /&gt; Lic. Nicolás Wolovick&lt;/p&gt;
&lt;p&gt;&lt;b&gt;Motivation and Objectives &lt;/b&gt;&lt;br /&gt; I am the teacher of &lt;a href="http://cs.famaf.unc.edu.ar/~nicolasw/Docencia/HPCMMM/"&gt;HPCMMM10&lt;/a&gt; course at &lt;a href="http://www.famaf.unc.edu.ar/"&gt;FaMAF&lt;/a&gt;, &lt;a href="http://www.unc.edu.ar/"&gt;National University of Córdoba&lt;/a&gt;, and this course is a satellite of Dr.  Sterling's &lt;a href="http://www.cct.lsu.edu/csc7600/Home.html"&gt;CSC7600&lt;/a&gt; at LSU. In our second year, we have  gained momentum in HPC education and training as well as consulting within our University.&lt;/p&gt;
&lt;p&gt;In the beginning of this year, a  simple, widely known and studied problem was posed to the students: matrix  multiplication. We made an internal contest, to obtain the fastest serial code.  Many versions were submitted, and we finally obtained 20x of improvement over  the most naïve implementation. The students learned a lot about compiler  optimizations, and above all, the effect of the caches in the performance of  the code.&lt;/p&gt;
&lt;p&gt;The objective was to extrapolate  this exercise to a massive multicore architecture, like the one provided by  four Nehalem EX processors that Intel introduced in the first quarter this  year. Having 32 cores to perform the matrix multiplication under the QuickPath  memory communication architecture provided a complex enough scenario to explore  different solutions.&lt;/p&gt;
&lt;p&gt;&lt;b&gt;Bases&lt;/b&gt; &lt;br /&gt; The  students were introduced to the problem and given a kickstart code with a naïve  C using an OpenMP implementation of the problem and a few rules to follow:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Use the 32 physical cores and HyperThreading (HT) was not allowed,  since compute nodes had HT disabled.&lt;/li&gt;
&lt;li&gt;You could use any language accepting OpenMP (C, C++, Fortran), as  long as the datatype used to store the matrix element was a 32 bit single  precision IEEE754  (float in C).&lt;/li&gt;
&lt;li&gt;The winner will be the student obtaining the best time for  OMP_NUM_THREADS=32 having a reasonable scalability. If the 32 processor time  difference was less than 5%, then the winner will be the one obtaining the  lowest sum of all points in the curve of 1, 2, 4, 8, 16, 32 processors.&lt;/li&gt;
&lt;li&gt;The execution should be reproducible in the teacher’s account.  This implies using the compilers installed in the server, not handcrafting  code, and not using stochastic code with high variance.&lt;/li&gt;
&lt;li&gt;The result should be correct for OMP_NUM_THREADS={1,2,4,8,16,32} with  respect to the trivial implementation given by the teacher.&lt;/li&gt;
&lt;li&gt;Submit all jobs using PBS.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;b&gt;Hardware and Software&lt;/b&gt;&lt;br /&gt; A small cluster of a master node  and two compute nodes, all of them featuring a motherboard with four Intel® Xeon®  Processors X7560, 24MB of global L3 cache, 256KB of per core L2 cache and 32KB  of split program and data L1 cache. Each of the compute nodes had 64GB of RAM.&lt;/p&gt;
&lt;p&gt;The operating system was RHEL* 5.4, using a PBS Pro* version 10.2 as a resource manager.  There were three compiler suites, namely the Intel® Compiler Suite v11.1, GNU*  Compiler Collection 4.1.1 and 4.4.3.&lt;/p&gt;
&lt;p&gt;&lt;b&gt;Student Activity&lt;br /&gt; &lt;/b&gt;There were twelve prospective  students joining the challenge. Some of the students were from the current year  HPC course and some of the 2009 HPC course.&lt;/p&gt;
&lt;p&gt;The trial period was not optimal  with respect to our educational calendar, since many students were engaged in  mid-term exams in the period of May 10 to May 28 when the core of trial took  place. This precluded a deep engagement with the challenge, and only half of  the students tried to improve the one second mark of the trivial kernel.&lt;/p&gt;
&lt;p&gt;&lt;b&gt;Kickstart  Code&lt;/b&gt;&lt;br /&gt; Students were provided with a  kickstart code to fix some initial parameters. The multiplication C = A*B had  to be done with 4096*4096 matrices, A and B initialized with random elements,  and a zeroed C. The elements were IEEE754 single precision floating point  numbers, and the walltime measurement included only kernel multiplication time.&lt;/p&gt;
&lt;p&gt;The matrix multiplication kernel  is given below.&lt;/p&gt;
&lt;p&gt;// matrix multiplication, naïve&lt;br /&gt; #pragma  omp parallel for private(j,k) shared(a,b,c)&lt;br /&gt; for (i=0;  i&amp;lt;N; i++)&lt;br /&gt; for  (j=0; j&amp;lt;N; j++)&lt;br /&gt; for  (k=0; k&amp;lt;N; k++)&lt;/p&gt;
&lt;p&gt;c[INDEX(i,j)]  += a[INDEX(i,k)]*b[INDEX(k,j)];&lt;/p&gt;
&lt;p&gt;For the standard implementation,  the 32 core walltimes were:&lt;/p&gt;
&lt;table class="tableFormat1" border="0" cellpadding="0" cellspacing="0"&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td valign="top" width="200"&gt;
&lt;p&gt;&lt;b&gt;Compiler, options&lt;/b&gt;&lt;/p&gt;
&lt;/td&gt;
&lt;td valign="top" width="79"&gt;
&lt;p&gt;&lt;b&gt;Walltime&lt;/b&gt;&lt;/p&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td valign="top" width="200"&gt;
&lt;p&gt;gcc-4.4 -O3&lt;/p&gt;
&lt;/td&gt;
&lt;td valign="top" width="79"&gt;
&lt;p&gt;37.0s&lt;/p&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td valign="top" width="200"&gt;
&lt;p&gt;icc-11 -O3&lt;/p&gt;
&lt;/td&gt;
&lt;td valign="top" width="79"&gt;
&lt;p&gt;5.17s&lt;/p&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td valign="top" width="200"&gt;
&lt;p&gt;icc-11     -fast -xSSE4.2 -O3&lt;/p&gt;
&lt;/td&gt;
&lt;td valign="top" width="79"&gt;
&lt;p&gt;1.07s&lt;/p&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;The gcc times could be greatly  improved if we changed the loop order to i,k,j. This way, the compiler was able  to vectorize the loop and take advantage of SSE instructions. For the Intel Compiler,  all simple optimizations like loop interchange and transposition of the B  matrix, did not affect the walltime at all.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Results&lt;/li&gt;
&lt;li&gt;Two students obtained outstanding  results: Carlos Bederián and Miguel Montes.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Miguel approach was simple but  effective, he took the base version and he replaced malloc’ed matrices by  globally defined ones, avoiding all parameter passing and letting the compiler  to optimize the a[i][j] references. This way he obtained 0.41s under ICC with flags “-fast  -xSSE4.2” using the 32 cores. Miguel also tried Strassen algorithm obtaining  0.45s.&lt;/p&gt;
&lt;p&gt;Carlos explored the program space  in a comprehensive way, from reproducing both Miguel’s result to testing a mix  of Strassen and direct multiplication. His best implementation got 0.387s, using SSE intrinsics,  implementing ideas of &lt;a href="http://www.tacc.utexas.edu/tacc-projects/gotoblas2/"&gt;GotoBLAS2&lt;/a&gt; library for a block that fits  into L1 cache, an external loop that distributes the B matrix between the  processors improving the efficiency of L3 cache, and finally using KMP_AFFINITY=compact. The scaling table for the best  code was:&lt;/p&gt;
&lt;table class="tableFormat1" border="0" cellpadding="0" cellspacing="0"&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td valign="top" width="89"&gt;
&lt;p&gt;&lt;b&gt;Cores&lt;/b&gt;&lt;/p&gt;
&lt;/td&gt;
&lt;td valign="top" width="89"&gt;
&lt;p&gt;&lt;b&gt;1&lt;/b&gt;&lt;/p&gt;
&lt;/td&gt;
&lt;td valign="top" width="89"&gt;
&lt;p&gt;&lt;b&gt;2&lt;/b&gt;&lt;/p&gt;
&lt;/td&gt;
&lt;td valign="top" width="89"&gt;
&lt;p&gt;&lt;b&gt;4&lt;/b&gt;&lt;/p&gt;
&lt;/td&gt;
&lt;td valign="top" width="89"&gt;
&lt;p&gt;&lt;b&gt;8&lt;/b&gt;&lt;/p&gt;
&lt;/td&gt;
&lt;td valign="top" width="89"&gt;
&lt;p&gt;&lt;b&gt;16&lt;/b&gt;&lt;/p&gt;
&lt;/td&gt;
&lt;td valign="top" width="89"&gt;
&lt;p&gt;&lt;b&gt;32&lt;/b&gt;&lt;/p&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td valign="top" width="89"&gt;
&lt;p&gt;&lt;b&gt;Walltime&lt;/b&gt;&lt;/p&gt;
&lt;/td&gt;
&lt;td valign="top" width="89"&gt;
&lt;p&gt;10.603s&lt;/p&gt;
&lt;/td&gt;
&lt;td valign="top" width="89"&gt;
&lt;p&gt;5.404s&lt;/p&gt;
&lt;/td&gt;
&lt;td valign="top" width="89"&gt;
&lt;p&gt;2.843s&lt;/p&gt;
&lt;/td&gt;
&lt;td valign="top" width="89"&gt;
&lt;p&gt;1.598s&lt;/p&gt;
&lt;/td&gt;
&lt;td valign="top" width="89"&gt;
&lt;p&gt;0.731s&lt;/p&gt;
&lt;/td&gt;
&lt;td valign="top" width="89"&gt;
&lt;p&gt;&lt;i&gt;0.387s&lt;/i&gt;&lt;/p&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td valign="top" width="89"&gt;
&lt;p&gt;&lt;b&gt;Rel.Eff.&lt;/b&gt;&lt;/p&gt;
&lt;/td&gt;
&lt;td valign="top" width="89"&gt;
&lt;p&gt;100%&lt;/p&gt;
&lt;/td&gt;
&lt;td valign="top" width="89"&gt;
&lt;p&gt;98%&lt;/p&gt;
&lt;/td&gt;
&lt;td valign="top" width="89"&gt;
&lt;p&gt;95%&lt;/p&gt;
&lt;/td&gt;
&lt;td valign="top" width="89"&gt;
&lt;p&gt;88%&lt;/p&gt;
&lt;/td&gt;
&lt;td valign="top" width="89"&gt;
&lt;p&gt;&lt;i&gt;109%&lt;/i&gt;&lt;/p&gt;
&lt;/td&gt;
&lt;td valign="top" width="89"&gt;
&lt;p&gt;94%&lt;/p&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td valign="top" width="89"&gt;
&lt;p&gt;&lt;b&gt;Abs.Eff.&lt;/b&gt;&lt;/p&gt;
&lt;/td&gt;
&lt;td valign="top" width="89"&gt;
&lt;p&gt;100%&lt;/p&gt;
&lt;/td&gt;
&lt;td valign="top" width="89"&gt;
&lt;p&gt;98%&lt;/p&gt;
&lt;/td&gt;
&lt;td valign="top" width="89"&gt;
&lt;p&gt;94%&lt;/p&gt;
&lt;/td&gt;
&lt;td valign="top" width="89"&gt;
&lt;p&gt;82%&lt;/p&gt;
&lt;/td&gt;
&lt;td valign="top" width="89"&gt;
&lt;p&gt;90%&lt;/p&gt;
&lt;/td&gt;
&lt;td valign="top" width="89"&gt;
&lt;p&gt;&lt;i&gt;85%&lt;/i&gt;&lt;/p&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;We remarked on the high  efficiency of the solution (85% of linear speedup), and the strange super-scalability  from 8 to 16 cores.&lt;/p&gt;
&lt;p&gt;He also implemented an external  Strassen algorithm, with handmade SSE inner blocks. This is the summarizing  table.&lt;/p&gt;
&lt;table class="tableFormat1" border="0" cellpadding="0" cellspacing="0"&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td valign="top" width="89"&gt;
&lt;p&gt;&lt;b&gt;Cores&lt;/b&gt;&lt;/p&gt;
&lt;/td&gt;
&lt;td valign="top" width="89"&gt;
&lt;p&gt;&lt;b&gt;1&lt;/b&gt;&lt;/p&gt;
&lt;/td&gt;
&lt;td valign="top" width="89"&gt;
&lt;p&gt;&lt;b&gt;2&lt;/b&gt;&lt;/p&gt;
&lt;/td&gt;
&lt;td valign="top" width="89"&gt;
&lt;p&gt;&lt;b&gt;4&lt;/b&gt;&lt;/p&gt;
&lt;/td&gt;
&lt;td valign="top" width="89"&gt;
&lt;p&gt;&lt;b&gt;8&lt;/b&gt;&lt;/p&gt;
&lt;/td&gt;
&lt;td valign="top" width="89"&gt;
&lt;p&gt;&lt;b&gt;16&lt;/b&gt;&lt;/p&gt;
&lt;/td&gt;
&lt;td valign="top" width="89"&gt;
&lt;p&gt;&lt;b&gt;32&lt;/b&gt;&lt;/p&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td valign="top" width="89"&gt;
&lt;p&gt;&lt;b&gt;Walltime&lt;/b&gt;&lt;/p&gt;
&lt;/td&gt;
&lt;td valign="top" width="89"&gt;
&lt;p&gt;7.095&lt;/p&gt;
&lt;/td&gt;
&lt;td valign="top" width="89"&gt;
&lt;p&gt;4.198&lt;/p&gt;
&lt;/td&gt;
&lt;td valign="top" width="89"&gt;
&lt;p&gt;2.281&lt;/p&gt;
&lt;/td&gt;
&lt;td valign="top" width="89"&gt;
&lt;p&gt;1.290&lt;/p&gt;
&lt;/td&gt;
&lt;td valign="top" width="89"&gt;
&lt;p&gt;0.722&lt;/p&gt;
&lt;/td&gt;
&lt;td valign="top" width="89"&gt;
&lt;p&gt;0.426&lt;/p&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td valign="top" width="89"&gt;
&lt;p&gt;&lt;b&gt;Rel.Eff.&lt;/b&gt;&lt;/p&gt;
&lt;/td&gt;
&lt;td valign="top" width="89"&gt;
&lt;p&gt;100%&lt;/p&gt;
&lt;/td&gt;
&lt;td valign="top" width="89"&gt;
&lt;p&gt;84%&lt;/p&gt;
&lt;/td&gt;
&lt;td valign="top" width="89"&gt;
&lt;p&gt;92%&lt;/p&gt;
&lt;/td&gt;
&lt;td valign="top" width="89"&gt;
&lt;p&gt;88%&lt;/p&gt;
&lt;/td&gt;
&lt;td valign="top" width="89"&gt;
&lt;p&gt;89%&lt;/p&gt;
&lt;/td&gt;
&lt;td valign="top" width="89"&gt;
&lt;p&gt;84%&lt;/p&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td valign="top" width="89"&gt;
&lt;p&gt;&lt;b&gt;Abs.Eff.&lt;/b&gt;&lt;/p&gt;
&lt;/td&gt;
&lt;td valign="top" width="89"&gt;
&lt;p&gt;100%&lt;/p&gt;
&lt;/td&gt;
&lt;td valign="top" width="89"&gt;
&lt;p&gt;84%&lt;/p&gt;
&lt;/td&gt;
&lt;td valign="top" width="89"&gt;
&lt;p&gt;77%&lt;/p&gt;
&lt;/td&gt;
&lt;td valign="top" width="89"&gt;
&lt;p&gt;68%&lt;/p&gt;
&lt;/td&gt;
&lt;td valign="top" width="89"&gt;
&lt;p&gt;61%&lt;/p&gt;
&lt;/td&gt;
&lt;td valign="top" width="89"&gt;
&lt;p&gt;52%&lt;/p&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;We can see that although the  beginning is better than the previous version, the scalability is not good  achieving an absolute efficiency of 52% for 32 cores.&lt;/p&gt;
&lt;p&gt;Two students with physics  background tested Fortran90 trivial implementations, and this time was not  given to the teacher.&lt;/p&gt;
&lt;p&gt;One of them using  Intel® Fortan Compiler with flags -xT -O3  -no-prec-div -opt-malloc-options=2 -openmp -Zp4 -align, and setting the following  environmental variables KMP_LIBRARY='throughput', KMP_AFFINITY=granularity=fine,compact,1,0; and he obtained a respectable 0.484s for 32 cores.&lt;/p&gt;
&lt;p&gt;&lt;b&gt;Conclusion&lt;/b&gt;&lt;br /&gt; The contest was very interesting  since the hardware was new; in fact, it was not released commercially at that start  of our research.&lt;/p&gt;
&lt;p&gt;There were two major drawbacks. First,  the occupancy of our students precluded them to put more effort into this  challenge; and second, the matrix multiplication problem is one of the flag  applications where all compiler optimizations are targeted to. This was  evidenced at the meager speedup obtained from a trivial code appropriately  tuned using compiler options and OpenMP environment variables (0.41s) with  respect to a handmade SSE intrinsic code (0.387s).&lt;/p&gt;
&lt;p&gt;Nevertheless this approach is not  completely useless since the fastest implementation tested was GotoBLAS2,  having a walltime of 0.283s, and this library does use assembly code targeted  to each different architecture. Additionally, Carlos feels he could have  obtained a greater speedup had he been given more time; his breakthrough in  performance occurred only a few days before deadline.&lt;/p&gt;
&lt;p&gt;One of the initial motivations  was to explore program patterns to take advantage of Intel®  QuickPath Interconnect and the three cache levels. Unfortunately, the compiler  provided by Intel is so efficient with respect to this, that there is very  little maneuvering to be done by non-experts in microprocessor instruction set  and architecture.&lt;/p&gt;
&lt;p&gt;All in all, the students felt  this was a good opportunity. For the course it represented an important leap in  quality and the possibility to test HPC code on very recent hardware.&lt;/p&gt;
&lt;p&gt;&lt;b&gt;Bibliography&lt;br /&gt; &lt;/b&gt;&lt;a href="http://www.google.com/url?q=http%3A%2F%2Fpeople.redhat.com%2Fdrepper%2Fcpumemory.pdf&amp;amp;sa=D&amp;amp;sntz=1&amp;amp;usg=AFQjCNHQVrsRuvj066mkTQtRbHaKTWYXRQ"&gt;&lt;i&gt;What  Every Programmer Should Know About Memory&lt;/i&gt;&lt;/a&gt;, Ulrich Drepper, Red Hat, Inc., 2007.&lt;/p&gt;
&lt;p&gt;&lt;a href="http://code.google.com/p/mm-matrixmultiplicationtool/"&gt;&lt;i&gt;A  Case Study on High Performance Matrix Multiplication&lt;/i&gt;&lt;/a&gt;, André Moré, 2008.&lt;i&gt;&lt;span style="text-decoration: underline;"&gt; &lt;/span&gt;&lt;/i&gt;&lt;/p&gt;
&lt;p&gt;&lt;a href="http://www.tacc.utexas.edu/tacc-projects/gotoblas2/"&gt;&lt;i&gt;GotoBLAS2  library&lt;/i&gt;&lt;/a&gt;, Texas Advanced   Computing Center.&lt;i&gt;&lt;span style="text-decoration: underline;"&gt; &lt;/span&gt;&lt;/i&gt;&lt;/p&gt;
&lt;p&gt;&lt;a href="http://portal.acm.org/citation.cfm?id=1356053"&gt;&lt;i&gt;Anatomy  of High-Performance Matrix Multiplication&lt;/i&gt;&lt;/a&gt;, Kazushige Goto, Robert A. van de  Geijn, University of Texas at Austin, ACM TOMS, 2008.&lt;i&gt;&lt;span style="text-decoration: underline;"&gt; &lt;/span&gt;&lt;/i&gt;&lt;/p&gt;&lt;img src="http://feeds.feedburner.com/~r/ISNMain/~4/N38_xgGM3vg" height="1" width="1"/&gt;</description>
      <link>http://feedproxy.google.com/~r/ISNMain/~3/N38_xgGM3vg/matrix-multiplication-performance-and-scalability-in-openmp-student-challenge</link>
      <pubDate>Fri, 03 Sep 2010 07:44:36 -0400</pubDate>
      <comments>http://software.intel.com/en-us/articles/matrix-multiplication-performance-and-scalability-in-openmp-student-challenge#comments</comments>
      <guid isPermaLink="false">http://software.intel.com/en-us/articles/matrix-multiplication-performance-and-scalability-in-openmp-student-challenge</guid>
      <category>ISC General</category>
      <category>Academic</category>
      <category>Software College Home</category>
    <feedburner:origLink>http://software.intel.com/en-us/articles/matrix-multiplication-performance-and-scalability-in-openmp-student-challenge</feedburner:origLink></item>
    <item>
      <title>GAP Message - remark #30536: (LOOP) Add -Qno-alias-args option for better type-based disambiguation analysis ... Remark #30537 Add the &amp;#34;restrict&amp;#34; keyword to each pointer-typed formal parameter</title>
      <description>&lt;div id="art_pre_template"&gt;
&lt;p&gt;&lt;b&gt;Message 30536&lt;br /&gt;&lt;/b&gt;&lt;/p&gt;
&lt;p&gt;
Add %s option for better type-based disambiguation analysis by the compiler, if appropriate (the option will apply for the entire compilation). This will improve optimizations such as vectorization for the loop at line %d. [VERIFY] Make sure that the semantics of this option is obeyed for the entire compilation. [ALTERNATIVE] Another way to get the same effect is to add the "restrict" keyword to each pointer-typed formal parameter of the routine "%s". This allows optimizations such as vectorization to be applied to the loop at line %d. [VERIFY] Make sure that semantics of the "restrict" pointer qualifier is satisfied: in the routine, all data accessed through the pointer must not be accessed through any other pointer. &lt;br&gt;
bucket-level 4&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;&lt;b&gt;Message 30537&lt;br /&gt;&lt;/b&gt;&lt;/p&gt;
&lt;p&gt;
Add the "restrict" keyword to each pointer-typed formal parameter of the routine "%s". This allows optimizations such as parallelization and vectorization to be applied to the loop at line %d. [VERIFY] Make sure that semantics of the "restrict" pointer qualifier is satisfied: in the routine, all data accessed through the pointer must not be accessed through any other pointer.&lt;br&gt;
&lt;br&gt;
Applies to C/C++ only (This message will not be emitted for Fortran code).&lt;br&gt;
bucket-level 2&lt;br&gt;
&lt;/p&gt;


&lt;p&gt;&lt;b&gt;Description&lt;/b&gt;&lt;/p&gt;

&lt;p&gt;
This message advises the user to apply -Qno-alias-args
(-fnoargument-alias on Linux) option for the specified file. This
option will help the compiler to optimize the loop at the specified
line. The user has to verify that there is no argument-aliasing for
routines in this file before applying this option for the current
file. This option is particularly useful for C++ programs since it
enables type-based disambiguation between pointers that are passed in as
arguments (that in turn enables optimizations such as vectorization
and parallelization). Fortran assumes there is no argument-aliasing by
default, so this is not applicable for Fortran programs.
&lt;/p&gt;

&lt;p&gt;
Help description from the Intel Compiler on /Qalias-args option:
&lt;/p&gt;
&lt;pre name="code" class="plain"&gt;
/Qalias-args[-]
          enable(DEFAULT)/disable C/C++ rule that function arguments may be
          aliased; when disabling the rule, the user asserts that this is safe
&lt;/pre&gt;

&lt;p&gt;
Another way to get a similar effect (other than using the
/Qno-alias-args option that affects the entire file) is to add
restrict qualifier to the pointer arguments to this routine. This
change is more localized since it affects only the routines where the
keyword is applied. The restrict qualifier is part of C standard C99.
This qualifier can be applied to a data pointer to indicate that,
during the scope of that pointer declaration, all data accessed
through it will be accessed only through that pointer but not through
any other pointer. The 'restrict' keyword thus enables the compiler to
perform certain optimizations based on the premise that a given object
cannot be changed through another pointer. It is the responsibility of
the programmer to ensure that restrict-qualified pointers are used as
they were intended to be used. Otherwise, undefined behavior may
result. The Intel compiler requires the additional option -Qrestrict
(-restrict on Linux) when compiling non-C99 programs.
&lt;/p&gt;
&lt;p&gt;
How the rules can be violated:
&lt;/p&gt;
&lt;p&gt;
An example that demonstrates a violation of -Qno-alias-args:
&lt;/p&gt;
&lt;pre name="code" class="cpp"&gt;
void f(double *p, double *q, double *r) {
  int i;
  for (i = 0; i &lt; n; i++)
    p[i] = q[i] + r[i];
}

int n, m;
double A[100], B[100];
...
f(&amp;A[n], &amp;A[m], &amp;B[0]);
&lt;/pre&gt;
&lt;p&gt;
Since both pointers p and q will be pointing to the same array A,
there may be overlap depending on the values of n and m.
&lt;/p&gt;
&lt;p&gt;
It is also wrong to use retrict keyword for parameters p and q in the
function f for this test-case.
&lt;/p&gt;
&lt;p&gt;
The user has to analyze all the callers of the function f and make
sure that such overlap does not exist before applying the -Qno-alias-args
(or the restrict qualifier). Such call-sites may occur in other files
(other than the current file that contains the definition of f) as well.
&lt;/p&gt;


&lt;p&gt;&lt;strong&gt;Example&lt;/strong&gt;&lt;/p&gt;
&lt;pre name="code" class="cpp"&gt;
void matrix_mul_matrix(int  N, float * C,  float  *A,  float  *B) {
  int  i,j,k;

  for (i=0; i&amp;lt;N; i++) {
    for (j=0; j&amp;lt;N; j++) {
      C[i*N+j]=0;
      for(k=0;k&amp;lt;N;k++)
      {
        C[i*N+j]+=A[i*N+k] * B[k*N+j];
      }
    }
  }
}
&lt;/pre&gt;
&lt;p&gt;
For this example, the compiler is unable to apply loop optimizations
such as loop-interchange and vectorization at -O2. Adding the -Qguide=4
option produces the following message: 
&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;
t1.c(9): remark #30536: (LOOP) Add -Qno-alias-args option for better type-based disambiguation analysis by the compiler, if appropriate (the option will apply for the entire compilation). This will improve optimizations such as vectorization for the loop at line 9. [VERIFY] Make sure that the semantics of this option is obeyed for the entire compilation. [ALTERNATIVE] Another way to get the same effect is to add the "restrict" keyword to each pointer-typed formal parameter of the routine "matrix_mul_matrix". This allows optimizations such as vectorization to be applied to the loop at line 9. [VERIFY] Make sure that semantics of the "restrict" pointer qualifier is satisfied: in the routine, all data accessed through the pointer must not be accessed through any other pointer.
&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;
Compiling this example with -Qno-alias-args option added (if the user
decides it is safe to do so) enables loop-interchange (for better data
locality) followed by vectorization of the innermost loop. 
&lt;/p&gt;
&lt;/p&gt;
An alternate way to enable the loop optimizations is to use the
restrict qualifier (if the user decides it is safe to do so) as follows:&lt;/p&gt;
&lt;pre name="code" class="cpp"&gt;
void matrix_mul_matrix(int  N, float * restrict C,  float  * restrict A,  float
 * restrict B) {
  int  i,j,k;

  for (i=0; i&amp;lt;N; i++) {
    for (j=0; j&amp;lt;N; j++) {
      C[i*N+j]=0;
      for(k=0;k&amp;lt;N;k++)
      {
        C[i*N+j]+=A[i*N+k] * B[k*N+j];
      }
    }
  }
}
&lt;/pre&gt;


&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/ISNMain/~4/44MgGbpIyBA" height="1" width="1"/&gt;</description>
      <link>http://feedproxy.google.com/~r/ISNMain/~3/44MgGbpIyBA/gap-message-diagnostic-id-30536-30537</link>
      <pubDate>Fri, 03 Sep 2010 01:27:37 -0400</pubDate>
      <comments>http://software.intel.com/en-us/articles/gap-message-diagnostic-id-30536-30537#comments</comments>
      <guid isPermaLink="false">http://software.intel.com/en-us/articles/gap-message-diagnostic-id-30536-30537</guid>
      <category>Intel® C++ Compiler for Linux* Knowledge Base</category>
      <category>Intel® C++ Compiler for Windows* Knowledge Base</category>
      <category>Intel® Fortran Compiler for Linux* Knowledge Base</category>
      <category>Intel® Visual Fortran Compiler for Windows* Knowledge Base</category>
    <feedburner:origLink>http://software.intel.com/en-us/articles/gap-message-diagnostic-id-30536-30537</feedburner:origLink></item>
    <item>
      <title>GAP Message - remark #30538: (LOOP) Moving the block of code that consists of a function-call ...</title>
      <description>&lt;div id="art_pre_template"&gt;
&lt;p&gt;&lt;b&gt;Message&lt;br /&gt;&lt;/b&gt;&lt;/p&gt;
&lt;p&gt;Moving the block of code that consists of a function-call (line %d), if-condition (line %d), and an early return (line %d) to outside the loop may enable parallelization of the loop at line %d. [VERIFY] Make sure that the function-call does not rely on any computation inside the loop and that restructuring the code as suggested above retains the original program semantics.&lt;/p&gt;
&lt;p&gt;&lt;b&gt;Description&lt;/b&gt;&lt;/p&gt;
&lt;p&gt;This message advises the user to move a function-call and an associated return from inside a loop (and may be insert those before the loop) to help parallelizing the loop. This kind of function-leading-to-return inside a loop typically handles some error-condition inside the loop. If this error-check can be done before starting the execution of the loop (without changing the program semantics), the compiler may be able to parallelize the loop thus improving performance.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Example&lt;/strong&gt;&lt;/p&gt;
&lt;pre name="code" class="cpp"&gt;extern int num_nodes;
typedef struct TEST_STRUCT {
    // Coordinates of city1
    float latitude1;
    float longitude1;

    // Coordinates of city2
    float latitude2;
    float longitude2;
} test_struct;

extern int *mark_larger;
extern float *distances, **matrix;
extern test_struct** nodes;
extern test_struct ***files;

extern void init_node(test_struct *node, int i);
extern void process_nodes(void);
float compute_max_distance(void);

extern int check_error_condition(int width);

#include &amp;lt;math.h&amp;gt;
#include &amp;lt;stdio.h&amp;gt;


void process_nodes(int width)
{
  float const R = 3964.0;
  float temp, lat1, lat2, long1, long2, result, pat2;
  int m, j, temp1 = num_nodes;

      nodes = files[0];
      m = 1;

#pragma loop count min(4)
#pragma parallel
      for (int k=0; k &amp;lt; temp1; k++) {

	  if (check_error_condition(width)) {
	      return;
	  }

	  lat1 = nodes[k]-&amp;gt;latitude1;
	  lat2 = nodes[k]-&amp;gt;latitude2;

	  long1 = nodes[k]-&amp;gt;longitude1;
	  long2 = nodes[k]-&amp;gt;longitude2;

	  // Compute the distance between the two cities
	  temp = sin(lat1) * sin(lat2) + cos(lat1) * cos(lat2) * 
	                                              cos(long1-long2);
	  result = 2.0 * R * atan(sqrt((1.0-temp)/(1.0+temp)));
	  
	  pat2 = 0;
	  for(j=0; j&amp;lt;width; j++) {
	    pat2 += distances[j];
	    matrix[k][j] = distances[k]+j;
	  }
	  // Store the distance computed in the distances array
	  if (result &amp;gt; distances[k]) {
	      distances[k] = result + pat2;
	  }
      }
}
&lt;/pre&gt;
&lt;p&gt;For this example, the compiler is unable to parallelize the loop at line 38. Adding -Qguide option produces the following message:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;exit_path.cpp(38): remark #30538: (LOOP) Moving the block of code that consists of a function-call (line 40), if-condition (line 40), and an early return (line 41) to outside the loop may enable parallelizati on of the loop at line 38. [VERIFY] Make sure that the function-call does not re ly on any computation inside the loop and that restructuring the code as suggest ed above retains the original program semantics.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Compiling this example with the suggested modification below parallelizes the loop (if the user decides it is safe to do so).&lt;/p&gt;
&lt;pre name="code" class="cpp"&gt;extern int num_nodes;
typedef struct TEST_STRUCT {
    // Coordinates of city1
    float latitude1;
    float longitude1;

    // Coordinates of city2
    float latitude2;
    float longitude2;
} test_struct;

extern int *mark_larger;
extern float *distances, **matrix;
extern test_struct** nodes;
extern test_struct ***files;

extern void init_node(test_struct *node, int i);
extern void process_nodes(void);
float compute_max_distance(void);

extern int check_error_condition(int width);

#include &amp;lt;math.h&amp;gt;
#include &amp;lt;stdio.h&amp;gt;


void process_nodes(int width)
{
  float const R = 3964.0;
  float temp, lat1, lat2, long1, long2, result, pat2;
  int m, j, temp1 = num_nodes;

      nodes = files[0];
      m = 1;

      if (check_error_condition(width)) {
	  return;
      }

#pragma loop count min(4)
#pragma parallel
      for (int k=0; k &amp;lt; temp1; k++) {

	  lat1 = nodes[k]-&amp;gt;latitude1;
	  lat2 = nodes[k]-&amp;gt;latitude2;

	  long1 = nodes[k]-&amp;gt;longitude1;
	  long2 = nodes[k]-&amp;gt;longitude2;

	  // Compute the distance between the two cities
	  temp = sin(lat1) * sin(lat2) + cos(lat1) * cos(lat2) * 
	                                              cos(long1-long2);
	  result = 2.0 * R * atan(sqrt((1.0-temp)/(1.0+temp)));
	  
	  pat2 = 0;
	  for(j=0; j&amp;lt;width; j++) {
	    pat2 += distances[j];
	    matrix[k][j] = distances[k]+j;
	  }
	  // Store the distance computed in the distances array
	  if (result &amp;gt; distances[k]) {
	      distances[k] = result + pat2;
	  }
      }
}

&lt;/pre&gt;
&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/ISNMain/~4/2loF8Ig2H0U" height="1" width="1"/&gt;</description>
      <link>http://feedproxy.google.com/~r/ISNMain/~3/2loF8Ig2H0U/gap-message-diagnostic-id-30538</link>
      <pubDate>Fri, 03 Sep 2010 01:26:17 -0400</pubDate>
      <comments>http://software.intel.com/en-us/articles/gap-message-diagnostic-id-30538#comments</comments>
      <guid isPermaLink="false">http://software.intel.com/en-us/articles/gap-message-diagnostic-id-30538</guid>
      <category>Intel® C++ Compiler for Linux* Knowledge Base</category>
      <category>Intel® C++ Compiler for Windows* Knowledge Base</category>
      <category>Intel® Fortran Compiler for Linux* Knowledge Base</category>
      <category>Intel® Visual Fortran Compiler for Windows* Knowledge Base</category>
    <feedburner:origLink>http://software.intel.com/en-us/articles/gap-message-diagnostic-id-30538</feedburner:origLink></item>
    <item>
      <title>GAP Message - remark #30753: (DTRANS) Convert array of struct ... into a new struct ...</title>
      <description>&lt;div id="art_pre_template"&gt;
&lt;p&gt;&lt;b&gt;Message&lt;br&gt;&lt;/b&gt;&lt;/p&gt;
&lt;p&gt;
Convert array of struct "%s" into a new struct whose fields are arrays of the corresponding fields in the original struct. This improves performance due to better data locality. [VERIFY] Make sure that the restructured code satisfies the original program semantics.&lt;br&gt;
Applies to C/C++ only &lt;br&gt;

&lt;/p&gt;


&lt;p&gt;&lt;b&gt;Description&lt;/b&gt;&lt;/p&gt;

&lt;p&gt;
This message advises the user to apply full peeling to a class or structure. 
That means, split a class or structure into separate fields. This is expected 
to improve performance by better utilizing the processor cache.
This message is generated only when entire application is built with IPO.
This transformation requires to change the all access to peeled structure and its fields in the entire application. In some cases, it may not be easy to change source code to apply full peeling. 
&lt;/p&gt;
&lt;p&gt;
Let us assume we want to apply full peeling for the following structure and its references:
&lt;/p&gt;

&lt;pre name="code" class="cpp"&gt;
struct {
  int a;
  double b;
} S, *sp;

struct S s_arr[N];
...
sp = calloc(N, sizeof(S)); // Allocate memory
 
 ...sp[i].a ...              // Access "a" field from "sp"
 ...sp[i].b ...              // Access "b" field from "sp"
 ...s_arr[i].a ...              // Access "a" field from "s_arr"
 ...s_arr[i].b ...              // Access "b" field from "s_arr"
&lt;/pre&gt;
&lt;p&gt;
They can be transformed as follows to apply full peeling to structure
&lt;/p&gt;

&lt;pre name="code" class="cpp"&gt;
struct {
 int a;
} s1;

struct {
 double b;
} new_b, *sp_b;

struct S s_arr[N];
struct new_b s_arr_b[N];
...
sp = calloc(N, sizeof(s1));  // Allocate memory for all peeled fields
sp_b = calloc(N, sizeof(new_b)); 
...
s_arr[i].a                      // Access "a" field from "s_arr"
s_arr_b[i].b                    // Access "b" field from "s_arr_b"
sp[i].a                         // Access "a" field from original pointer "sp"
sp_b[i].b                       // Access "b" field from new pointer "sp_b"
.. 
&lt;/pre&gt;

&lt;p&gt;&lt;strong&gt;Example&lt;/strong&gt;&lt;/p&gt;
&lt;pre name="code" class="cpp"&gt;

// peel.c
#include &amp;lt;stdio.h&amp;gt;
#include &amp;lt;stdlib.h&amp;gt;

#define N 100000
int a[N];
double b[N];
struct S3 {
    int *pi;
    double d;
    int j;
};


struct S3 *sp;

void init_hot_s3_i()
{

    int ii = 0;

    for (ii = 0; ii &amp;lt; N; ii++) {
        sp[ii].pi = &amp;a[ii];
    }
}
void init_hot_s3_d()
{
    int ii = 0;

    for (ii = 0; ii &amp;lt; N; ii++) {
        sp[ii].d = b[ii];
    }
}
void init_hot_s3_j()
{
    int ii = 0;

    for (ii = 0; ii &amp;lt; N; ii++) {
        sp[ii].j = 0;
    }
}
void dump_s3()
{
    int ii;

    for (ii = 0; ii &amp;lt; N; ii++) {
        printf("i= %d ", *(sp[ii].pi));
        printf("d= %g \n", sp[ii].d);
        printf("j= %g \n", sp[ii].j);
    }
}

main()
{

   sp = (struct S3 *)calloc(N, sizeof(struct S3));
   init_hot_s3_i();
   init_hot_s3_d();
   init_hot_s3_j();
   dump_s3();
}

&lt;/pre&gt;
&lt;p&gt;
For the above example, the compiler generates the following advice on x86win with &lt;br&gt;
'icl -Qguide=4 -Qipo peel.c'&lt;br&gt;
&lt;/p&gt;
&lt;p&gt;&lt;blockquote&gt;
peel.c(7): remark #30753: (DTRANS) Convert array of struct "S3" into a new struct whose fields are arrays of the corresponding fields in the original struct. This improves performance due to better data locality. [VERIFY] Make sure that the restructured code satisfies the original program semantics.
&lt;/blockquote&gt;&lt;/p&gt;
&lt;p&gt;
For the above example, sources can be modified as follows to apply full peeling as suggested.
&lt;/p&gt;

&lt;pre name="code" class="cpp"&gt;
#include &amp;lt;stdio.h&amp;gt;
#include &amp;lt;stdlib.h&amp;gt;

#define N 100000
int a[N];
double b[N];
struct S3 {
    int *pi;
};

struct new_d {
    double d;
};

struct new_j {
    int j;
};


struct S3 *sp;
struct new_d *sp_d;
struct new_j *sp_j;

void init_hot_s3_i()
{

    int ii = 0;

    for (ii = 0; ii &amp;lt; N; ii++) {
        sp[ii].pi = &amp;a[ii];
    }
}
void init_hot_s3_d()
{
    int ii = 0;

    for (ii = 0; ii &amp;lt; N; ii++) {
        sp_d[ii].d = b[ii];
    }
}
void init_hot_s3_j()
{
    int ii = 0;

    for (ii = 0; ii &amp;lt; N; ii++) {
        sp_j[ii].j = 0;
    }
}
void dump_s3()
{
    int ii;

    for (ii = 0; ii &amp;lt; N; ii++) {
        printf("i= %d ", *(sp[ii].pi));
        printf("d= %g \n", sp_d[ii].d);
        printf("j= %g \n", sp_j[ii].j);
    }
}

main()
{

   sp = (struct S3 *)calloc(N, sizeof(struct S3));
   sp_d = (struct new_d *)calloc(N, sizeof(struct new_d));
   sp_j = (struct new_j *)calloc(N, sizeof(struct new_j));
   init_hot_s3_i();
   init_hot_s3_d();
   init_hot_s3_j();
   dump_s3();
}

&lt;/pre&gt;&lt;img src="http://feeds.feedburner.com/~r/ISNMain/~4/Oj_2-RxDK8E" height="1" width="1"/&gt;</description>
      <link>http://feedproxy.google.com/~r/ISNMain/~3/Oj_2-RxDK8E/gap-message-diagnostic-id-30753</link>
      <pubDate>Fri, 03 Sep 2010 01:25:12 -0400</pubDate>
      <comments>http://software.intel.com/en-us/articles/gap-message-diagnostic-id-30753#comments</comments>
      <guid isPermaLink="false">http://software.intel.com/en-us/articles/gap-message-diagnostic-id-30753</guid>
      <category>Intel® C++ Compiler for Linux* Knowledge Base</category>
      <category>Intel® C++ Compiler for Windows* Knowledge Base</category>
      <category>Intel® Fortran Compiler for Linux* Knowledge Base</category>
      <category>Intel® Visual Fortran Compiler for Windows* Knowledge Base</category>
    <feedburner:origLink>http://software.intel.com/en-us/articles/gap-message-diagnostic-id-30753</feedburner:origLink></item>
    <item>
      <title>GAP Message -  remark #30754: (DTRANS) Aligning the fields ... in the structure ...</title>
      <description>&lt;div id="art_pre_template"&gt;
&lt;p&gt;&lt;b&gt;Message&lt;br&gt;&lt;/b&gt;&lt;/p&gt;
&lt;p&gt;
Aligning the fields '%s' in the structure '%s' on an 8-byte boundary may improve performance. Default alignment of double precision floating point data is 4-byte on the Linux IA32 platform. [ALTERNATIVE] Reordering fields of the structure may help to align double precision floating point data on an 8-byte boundary. [VERIFY] Make sure that the restructured code satisfies the original program semantics. [ALTERNATIVE] Another way is to use __attribute__((aligned(8))) for the fields '%s' in the structure '%s' to allocate the fields on an 8-byte boundary. [VERIFY] Note that size of the structure '%s' may change due to the alignment changes. Make sure that the change in the structure '%s' layout satisfies the original program semantics.&lt;br&gt;
Applies to C/C++ only &lt;br&gt;
Applicable only for Linux (not for Composer or Composer XE on Windows) &lt;br&gt;
&lt;/p&gt;


&lt;p&gt;&lt;b&gt;Description&lt;/b&gt;&lt;/p&gt;

&lt;p&gt;
This message advises the user to reorder the fields of a class or structure 
type to make "double" fields 8-byte aligned. "double" fields are not
required to be 8-byte aligned on Linux IA32. This is expected to help 
optimizations like Vectorizer to generate better code. The user has to verify 
that the application code does not rely on the structure fields to be laid 
out in a specific order.
&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Example&lt;/strong&gt;&lt;/p&gt;
&lt;pre name="code" class="cpp"&gt;
//alignment.c
#include &amp;lt;stdlib.h&amp;gt;
#include &amp;lt;stdio.h&amp;gt;

#define N 1000

struct S {
    int i;
    double d1;
    double d2;
    double d3;
};

struct S *sp;


static struct S*
alloc_s(int num)
{
    struct S * temp;

    temp = calloc(num, sizeof(struct S));
    return temp;
}

struct S temp;

static void
swap_s(int i, int j)
{

    memcpy(&amp;temp, sp + i, sizeof(struct S));
    memcpy(sp + i, sp + j, sizeof(struct S));
    memcpy(sp+ j, &amp;temp, sizeof(struct S));
}


static void
init_s(int num)
{
    int ii;

    for (ii = 0; ii &amp;lt; num; ii++) {
        sp[ii].i = ii;
        sp[ii].d1 = (double) ii + 1;
        sp[ii].d2 = (double) ii + 2;
        sp[ii].d3 = (double) ii + 3;
    }
}

main()
{
    int ii;
    double d = 0.0;

    sp = alloc_s(N);

    for(ii = 0; ii &amp;lt; N -1; ii += 2) {
        swap_s(ii, ii+1);
    }

    for (ii = 0; ii &amp;lt; N ; ii++) {
        sp[ii].d1 = sp[ii].d1 * sp[ii].d2 * sp[ii].d3;
        d += sp[ii].d1;
    }

    for (ii = 0; ii &amp;lt; N ; ii++) {
        printf(" %d:  %g   %g  %g  \n", sp[ii].i, sp[ii].d1, sp[ii].d2, sp[ii].d3);
    }
}

&lt;/pre&gt;
&lt;p&gt;
For the above example, the compiler generates the following advice on x86linix with &lt;br&gt;
'icc -guide alignment.c'
&lt;/p&gt;
&lt;p&gt;&lt;blockquote&gt;
alignment.c(12): remark #30754: (DTRANS) Aligning the fields 'd1, d2, d3' in the structure 'S' on an 8-byte boundary may improve performance. Default alignment of double precision floating point data is 4-byte on the Linux IA32 platform. [ALTERNATIVE] Reordering fields of the structure may help to align double precision floating point data on an 8-byte boundary. [VERIFY] Make sure that the restructured code satisfies the original program semantics. [ALTERNATIVE] Another
 way is to use __attribute__((aligned(8))) for the fields 'd1, d2, d3' in the structure 'S' to allocate the fields on an 8-byte boundary. [VERIFY] Note that size of the structure 'S' may change due to the alignment changes. Make sure that the change in the structure 'S' layout satisfies the original program semantics.
&lt;/blockquote&gt;&lt;/p&gt;

&lt;pre name="code" class="cpp"&gt;
struct S {
    double d1;
    double d2;
    double d3;
    int i;
};

&lt;/pre&gt;         
&lt;p&gt;
Alternatively, '__attribute__((aligned(8)))' can be used to align 'd1, d2, d3' on 8-byte boundary. One possible way is shown below:
&lt;/p&gt;
&lt;pre name="code" class="cpp"&gt;
struct S {
    int i;
    __attribute__((aligned(8)))  double d1;
    double d2;
    double d3;
};
&lt;/pre&gt;&lt;img src="http://feeds.feedburner.com/~r/ISNMain/~4/Cnr-zMfyn1M" height="1" width="1"/&gt;</description>
      <link>http://feedproxy.google.com/~r/ISNMain/~3/Cnr-zMfyn1M/gap-message-diagnostic-id-30754</link>
      <pubDate>Fri, 03 Sep 2010 01:23:54 -0400</pubDate>
      <comments>http://software.intel.com/en-us/articles/gap-message-diagnostic-id-30754#comments</comments>
      <guid isPermaLink="false">http://software.intel.com/en-us/articles/gap-message-diagnostic-id-30754</guid>
      <category>Intel® C++ Compiler for Linux* Knowledge Base</category>
      <category>Intel® C++ Compiler for Windows* Knowledge Base</category>
      <category>Intel® Fortran Compiler for Linux* Knowledge Base</category>
      <category>Intel® Visual Fortran Compiler for Windows* Knowledge Base</category>
    <feedburner:origLink>http://software.intel.com/en-us/articles/gap-message-diagnostic-id-30754</feedburner:origLink></item>
    <item>
      <title>GAP Message - remark #30756: (DTRANS) Split the structure ... into two parts to improve data locality ...</title>
      <description>&lt;div id="art_pre_template"&gt;
&lt;p&gt;&lt;b&gt;Message&lt;br&gt;&lt;/b&gt;&lt;/p&gt;
&lt;p&gt;
Split the structure '%s' into two parts to improve data locality. Frequently accessed fields are '%s'; performance may improve by putting these fields into one structure and the remaining fields into another structure. Alternatively, performance may also improve by reordering the fields of the structure. Suggested field order: '%s'. [VERIFY] The suggestion is based on the field references in the current compilation. Please make sure that the restructuring is applied to field references in all source files of the application, and that the restructured code satisfies the original program semantics.&lt;br&gt;
Applies to C/C++ only &lt;br&gt;
&lt;/p&gt;


&lt;p&gt;&lt;b&gt;Description&lt;/b&gt;&lt;/p&gt;

&lt;p&gt;
This message is issued when both structure splitting and field reordering
transformations are applicable. Structure splitting transformation is 
expected to lead to higher performance gains if the transformation can be
successfully applied. Field reordering transformation on the other hand is
usually simple enough to apply the downside being that the performance gain
seen may not be as high.&lt;br&gt;

The user has to verify that the structure meets the requirements for 
applying the splitting or reordering transformation. Some of these 
requirements are described in the description of these individual 
transformations.
&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Example&lt;/strong&gt;&lt;/p&gt;
&lt;pre name="code" class="cpp"&gt;
//str_split_reord.c
struct str {
    int a1, b1, carr[100], c1, e1;
};

#define N 1000000

struct str *sp;

void allocate_str_mem()
{
    sp = malloc(N * sizeof(struct str));
}

int hot_func1() {
    int i, ret = 0;

    for (i = 0; i &amp;lt; 1000000; i++) {
        ret += sp[i].a1;
        ret += sp[i].c1;
    }
    sp-&amp;lt;carr[0] = ret;
    return ret;
}

int hot_func2() {
    int ret = 0, i;
    for (i = 0; i &amp;lt; 100000; i++) {
        ret += sp[i].a1;
        ret -= sp[i].e1;
    }
    return ret;
}

int hot_func3() {
    int ret = 0, i;
    for (i = 0; i &amp;lt; 1000000; i++) {
        ret += sp[i].b1;
    }
    return ret;
}

&lt;/pre&gt;
&lt;p&gt;
For the above example, the compiler generates the following advice with &lt;br&gt;
'icl -Qguide=4 -c str_split_reord.c'
&lt;/p&gt;

&lt;p&gt;&lt;blockquote&gt;
str_split_reord.c(2): remark #30756: (DTRANS) Split the structure 'str' into two parts to improve data locality. Frequently accessed fields are 'a1, b1, c1'; performance may improve by putting these fields into one structure and the remaining fields into another structure. Alternatively, performance may also improve by reordering the fields of the structure. Suggested field order: 'a1, c1, e1, b1, carr'. [VERIFY] The suggestion is based on the field references in the current compilation. Please make sure that the restructuring is applied to field references in all source files of the application, and that the restructured code satisfies the original program semantics.\n
&lt;/blockquote&gt;&lt;/p&gt;

&lt;p&gt;
The above example can be modified as below to split the structure 'str' as suggested. Other references, which are not in the current module, to structure 'str' should also be modified similarly. 
&lt;/p&gt;
&lt;pre name="code" class="cpp"&gt;
struct str_cold {
    int carr[100], e1;
};

struct str {
    int a1, b1, c1; struct str_cold *cold_ptr;
};

#define N 1000000

struct str *sp;

void allocate_str_mem()
{
    struct str *temp;
    struct str_cold *cold_begin;
    int index;

    temp = malloc(N * sizeof(struct str) + N * sizeof(struct str_cold));
    sp = temp;
    cold_begin = (struct str_cold *) (temp + N);
    for(index = 0; index &lt; N; index++) {
       temp[index].cold_ptr = cold_begin + index;
    }
}

int hot_func1() {
    int i, ret = 0;

    for (i = 0; i &lt; 1000000; i++) {
        ret += sp[i].a1;
        ret += sp[i].c1;
    }
    sp-&amp;lt;cold_ptr-&amp;lt;carr[0] = ret;
    return ret;
}

int hot_func2() {
    int ret = 0, i;
    for (i = 0; i &lt; 100000; i++) {
        ret += sp[i].a1;
        ret -= sp[i].cold_ptr-&amp;lt;e1;
    }
    return ret;
}

int hot_func3() {
    int ret = 0, i;
    for (i = 0; i &lt; 1000000; i++) {
        ret += sp[i].b1;
    }
    return ret;
}
&lt;/pre&gt;
&lt;p&gt;
For the above example, the only source change required to reorder fields in structure 'str' as alternatively suggested are the following:
&lt;/p&gt;
&lt;pre name="code" class="cpp"&gt;
//str_split_reord.c
struct str {
    int a1, c1, e1, b1, carr[100];
};

...
...
&lt;/pre&gt;&lt;img src="http://feeds.feedburner.com/~r/ISNMain/~4/MtF5n3N2jiw" height="1" width="1"/&gt;</description>
      <link>http://feedproxy.google.com/~r/ISNMain/~3/MtF5n3N2jiw/gap-message-diagnostic-id-30756</link>
      <pubDate>Fri, 03 Sep 2010 01:22:15 -0400</pubDate>
      <comments>http://software.intel.com/en-us/articles/gap-message-diagnostic-id-30756#comments</comments>
      <guid isPermaLink="false">http://software.intel.com/en-us/articles/gap-message-diagnostic-id-30756</guid>
      <category>Intel® C++ Compiler for Linux* Knowledge Base</category>
      <category>Intel® C++ Compiler for Windows* Knowledge Base</category>
      <category>Intel® Fortran Compiler for Linux* Knowledge Base</category>
      <category>Intel® Visual Fortran Compiler for Windows* Knowledge Base</category>
    <feedburner:origLink>http://software.intel.com/en-us/articles/gap-message-diagnostic-id-30756</feedburner:origLink></item>
    <item>
      <title>GAP Message - remark #30757/30758: (DTRANS) Remove unused field ...</title>
      <description>&lt;div id="art_pre_template"&gt;
&lt;p&gt;&lt;b&gt;Message&lt;br&gt;&lt;/b&gt;&lt;/p&gt;
&lt;p&gt;
30757 (Bucket-level 2):&lt;br&gt;
Message emitted only with whole-program recognition&lt;br&gt;
&lt;br&gt;
Remove unused field(s) '%s' from the struct '%s'. [VERIFY] Make sure that the restructured code satisfies the original program semantics.&lt;br&gt;
&lt;br&gt;
30758 (Bucket-level 4):&lt;br&gt;
Message emitted even without whole-program recognition in advanced mode&lt;br&gt;
&lt;br&gt;
Remove unused field(s) "%s" from the struct "%s". [VERIFY] The suggestion is based on the field references in the current compilation. Please make sure that there are no references to these fields across the entire application.&lt;br&gt;
&lt;br&gt;
Applies to C/C++ only &lt;br&gt;
&lt;/p&gt;


&lt;p&gt;&lt;b&gt;Description&lt;/b&gt;&lt;/p&gt;

&lt;p&gt;
This message advises the user that some unused fields were seen in a class or 
structure type. If the unused fields can be removed from the structure 
definition, it will lead to reduced memory usage and better cache utilization
as we no longer fill the cache with unused data. The advice is based on the
analysis of the source code that is seen. The user has to verify that the 
fields that are reported as unused are not accessed elsewhere in the 
application. The user also needs to be careful when removing unused fields if
the code relies on the structure fields to be laid out in a specific order. 
As an example, if the application code uses the address of a field to access 
other fields, it may stop working once unused fields are removed. Note that 
such code is not considered valid in the first place.&lt;/p&gt;

&lt;pre name="code" class="cpp"&gt;
          struct inventory {
                  int quantity;
                  int unused_fld;
                  int price;
           } s1;

           int *ip = &amp;(s1.quantity);
           printf("Price: %d\n", *((ip + 2)));
&lt;/pre&gt;
&lt;p&gt;
Removing unused_fld will cause the code to stop working. 
&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Example 1&lt;/strong&gt;&lt;/p&gt;
&lt;pre name="code" class="cpp"&gt;

//unused_field_1.c
struct str {
    int a1, b1, c1, d1, e1;
};

struct str sp[1000000];

int hot_func1() {
    int i, ret = 0;

    for (i = 0; i &amp;lt; 1000000; i++) {
        ret += sp[i].a1;
        ret += sp[i].b1;
    }
    return ret;
}

int hot_func2() {
    int ret = 0, i;
    for (i = 0; i &amp;lt; 1000000; i++) {
        ret += sp[i].a1;
        ret -= sp[i].c1;
    }
    return ret;
}

int hot_func3() {
    int ret = 0, i;
    for (i = 0; i &amp;lt; 1000000; i++) {
        ret += sp[i].d1;
    }
    return ret;
}

main()
{
    hot_func1();
    hot_func2();
    hot_func3();
}
&lt;/pre&gt;

&lt;p&gt;
For the above example, the compiler generates the following advice with &lt;br&gt;
'icl -Qguide -c unused_field_1.c'
&lt;/p&gt;
&lt;p&gt;&lt;blockquote&gt;
unused_field_1.c(1): remark #30757: (DTRANS) Remove unused field(s) 'e1' from the struct 'str'. [VERIFY] Make sure that the restructured code satisfies the original program semantics.
&lt;/p&gt;&lt;/blockquote&gt;
&lt;p&gt;
For the above example, if the unused fields can be removed, the only source change needed would be the following:
&lt;/p&gt;

&lt;pre name="code" class="cpp"&gt;
//unused_field_1.c
struct str {
    int a1, b1, c1, d1;
};

...
...
&lt;/pre&gt;


&lt;p&gt;&lt;strong&gt;Example 2&lt;/strong&gt;&lt;/p&gt;
&lt;pre name="code" class="cpp"&gt;
//unused_field_2.c 
struct str {
    int a1, b1, c1, d1, e1;
};

extern struct str sp[];

int hot_func1() {
    int i, ret = 0;

    for (i = 0; i &amp;lt; 1000000; i++) {
        ret += sp[i].a1;
        ret += sp[i].b1;
    }
    return ret;
}

int hot_func2() {
    int ret = 0, i;
    for (i = 0; i &amp;lt; 1000000; i++) {
        ret += sp[i].a1;
        ret -= sp[i].c1;
    }
    return ret;
}

int hot_func3() {
    int ret = 0, i;
    for (i = 0; i &amp;lt; 1000000; i++) {
        ret += sp[i].d1;
    }
    return ret;
} 
&lt;/pre&gt;
&lt;p&gt;
For the above example, the compiler generates the following advice with &lt;br&gt;
'icl -Qguide=4 -c unused_field_2.c '
&lt;/p&gt;
&lt;p&gt;&lt;blockquote&gt;
unused_field_2.c(1): remark #30758: (DTRANS) Remove unused field(s) "e1" from the struct "str". [VERIFY] The suggestion is based on the field references in the current compilation. Please make sure that there are no references to these fields across the entire application.
&lt;/p&gt;&lt;/blockquote&gt;
&lt;p&gt;
For the above example, if the unused fields can be removed, the only source change needed would be the following:
&lt;/p&gt;
&lt;pre name="code" class="cpp"&gt;
//unused_field_2.c
struct str {
    int a1, b1, c1, d1;
};

...
...
&lt;/pre&gt;&lt;img src="http://feeds.feedburner.com/~r/ISNMain/~4/XKtX8UOW8ZQ" height="1" width="1"/&gt;</description>
      <link>http://feedproxy.google.com/~r/ISNMain/~3/XKtX8UOW8ZQ/gap-message-diagnostic-id-30757-30758</link>
      <pubDate>Fri, 03 Sep 2010 01:20:55 -0400</pubDate>
      <comments>http://software.intel.com/en-us/articles/gap-message-diagnostic-id-30757-30758#comments</comments>
      <guid isPermaLink="false">http://software.intel.com/en-us/articles/gap-message-diagnostic-id-30757-30758</guid>
      <category>Intel® C++ Compiler for Linux* Knowledge Base</category>
      <category>Intel® C++ Compiler for Windows* Knowledge Base</category>
      <category>Intel® Fortran Compiler for Linux* Knowledge Base</category>
      <category>Intel® Visual Fortran Compiler for Windows* Knowledge Base</category>
    <feedburner:origLink>http://software.intel.com/en-us/articles/gap-message-diagnostic-id-30757-30758</feedburner:origLink></item>
    <item>
      <title>GAP Message - remark #30759/30760: (DTRANS) Remove unused field ...</title>
      <description>&lt;div id="art_pre_template"&gt;
&lt;p&gt;&lt;b&gt;Message&lt;br&gt;&lt;/b&gt;&lt;/p&gt;
&lt;p&gt;
30759 (Bucket-level 2):&lt;br&gt;
Message emitted only with whole-program recognition&lt;br&gt;
&lt;br&gt;
Remove unused field(s) '%s' from the struct '%s'. The fields: '%s' were conservatively assumed by the compiler as referenced since their address is taken. [VERIFY] Make sure that the restructured code satisfies the original program semantics.&lt;br&gt;
&lt;br&gt;
30760 (Bucket-level 4):&lt;br&gt;
Message emitted even without whole-program recognition in advanced mode&lt;br&gt;
&lt;br&gt;
Remove unused field(s) '%s' from the struct '%s'. The fields: '%s' were conservatively assumed by the compiler as referenced since their address is taken. [VERIFY] The suggestion is based on the field references in the current compilation. Please make sure that there are no references to these fields across the entire application.&lt;br&gt;
&lt;br&gt;
Note:&lt;br&gt;
The unused field analysis considers address taken fields as used. It will
report address taken fields also when reporting any unused fields.&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Example 1&lt;/strong&gt;&lt;/p&gt;
&lt;pre name="code" class="cpp"&gt;
//unused_field_3.c
struct str {
    int a1, b1, c1, d1, e1, f1;
};

struct str sp[1000000];

int hot_func1() {
    int i, ret = 0;

    for (i = 0; i &amp;lt; 1000000; i++) {
        ret += sp[i].a1;
        ret += sp[i].b1;
    }
    return ret;
}

int hot_func2() {
    int ret = 0, i;
    for (i = 0; i &amp;lt; 1000000; i++) {
        ret += sp[i].a1;
        ret -= sp[i].c1;
    }
    return ret;
}

int *gip;

int hot_func3() {
    int ret = 0, i;
    for (i = 0; i &amp;lt; 1000000; i++) {
        ret += sp[i].d1;
    }

    gip = &amp;(sp-&amp;gt;f1);
    return ret;
}

int main()
{
    hot_func1();
    hot_func2();
    hot_func3();
}
&lt;/pre&gt;
&lt;p&gt;
For the above example, the compiler generates the following advice with &lt;br&gt;
'icl -Qguide -c unused_field_3.c'
&lt;/p&gt;
&lt;p&gt;&lt;blockquote&gt;
unused_field_3.c(2): remark #30759: (DTRANS) Remove unused field(s) 'e1' from the struct 'str'. The fields: 'f1' were conservatively assumed by the compiler as referenced since their address is taken. [VERIFY] Make sure that the restructured code satisfies the original program semantics.
&lt;/p&gt;&lt;/blockquote&gt;
&lt;p&gt;
For the above example, if the unused fields can be removed, the only source change needed would be the following:
&lt;/p&gt;
&lt;pre name="code" class="cpp"&gt;
//unused_field_3.c
struct str {
    int a1, b1, c1, d1, f1;
};

...
...
&lt;/pre&gt;

&lt;p&gt;&lt;strong&gt;Example 2&lt;/strong&gt;&lt;/p&gt;
&lt;pre name="code" class="cpp"&gt;
//unused_field_4.c
struct str {
    int a1, b1, c1, d1, e1, f1;
};

extern struct str sp[];

int hot_func1() {
    int i, ret = 0;

    for (i = 0; i &amp;lt; 1000000; i++) {
        ret += sp[i].a1;
        ret += sp[i].b1;
    }
    return ret;
}

int hot_func2() {
    int ret = 0, i;
    for (i = 0; i &amp;lt; 1000000; i++) {
        ret += sp[i].a1;
        ret -= sp[i].c1;
    }
    return ret;
}

int *gip;

int hot_func3() {
    int ret = 0, i;
    for (i = 0; i &amp;lt; 1000000; i++) {
        ret += sp[i].d1;
    }

    gip = &amp;(sp-&amp;gt;f1);
    return ret;
}
&lt;/pre&gt;
&lt;p&gt;
For the above example, the compiler generates the following advice with &lt;br&gt;
'icl -Qguide=4 -c unused_field_4.c'. The following advice is emitted only in advanced mode since whole-program is not detected. 
&lt;/p&gt;
&lt;p&gt;&lt;blockquote&gt;
unused_field_4.c(2): remark #30760: (DTRANS) Remove unused field(s) 'e1' from the struct 'str'. The fields: 'f1' were conservatively assumed by the compiler as referenced since their address is taken. [VERIFY] The suggestion is based on the field references in the current compilation. Please make sure that there are no references to these fields across the entire application.
&lt;/p&gt;&lt;/blockquote&gt;
&lt;p&gt;
For the above example, if the unused fields can be removed, the only source change needed would be the following:
&lt;/p&gt;
&lt;pre name="code" class="cpp"&gt;
//unused_field_4.c
struct str {
    int a1, b1, c1, d1, f1;
};

...
...
&lt;/pre&gt;&lt;img src="http://feeds.feedburner.com/~r/ISNMain/~4/JIi86zmWjHk" height="1" width="1"/&gt;</description>
      <link>http://feedproxy.google.com/~r/ISNMain/~3/JIi86zmWjHk/gap-message-diagnostic-id-30759-30760</link>
      <pubDate>Fri, 03 Sep 2010 01:19:09 -0400</pubDate>
      <comments>http://software.intel.com/en-us/articles/gap-message-diagnostic-id-30759-30760#comments</comments>
      <guid isPermaLink="false">http://software.intel.com/en-us/articles/gap-message-diagnostic-id-30759-30760</guid>
      <category>Intel® C++ Compiler for Linux* Knowledge Base</category>
      <category>Intel® C++ Compiler for Windows* Knowledge Base</category>
      <category>Intel® Fortran Compiler for Linux* Knowledge Base</category>
      <category>Intel® Visual Fortran Compiler for Windows* Knowledge Base</category>
    <feedburner:origLink>http://software.intel.com/en-us/articles/gap-message-diagnostic-id-30759-30760</feedburner:origLink></item>
    <item>
      <title>GAP Message -  remark #30755: (DTRANS) Reordering the fields of the structure ...</title>
      <description>&lt;div id="art_pre_template"&gt;
&lt;p&gt;&lt;b&gt;Message&lt;br&gt;&lt;/b&gt;&lt;/p&gt;
&lt;p&gt;
Reordering the fields of the structure '%s' will improve data locality. Suggested field order: '%s'. [VERIFY] The suggestion is based on the field references in the current compilation. Please make sure that the restructured code satisfies the original program semantics.&lt;br&gt;
Applies to C/C++ only &lt;br&gt;
&lt;/p&gt;


&lt;p&gt;&lt;b&gt;Description&lt;/b&gt;&lt;/p&gt;

&lt;p&gt;
This message advises the user to reorder the fields of a class or structure 
type in the specified order. This is expected to improve performance by 
better utilizing the processor cache. The user has to verify that the 
application code does not rely on the structure fields to be laid out in
a specific order. As an example, if the application code uses the address 
of a field to access other fields, it may stop working once the field 
reordering is applied. Note that such code is not considered valid in the 
first place.
&lt;/p&gt;
        
&lt;pre name="code" class="cpp"&gt;  
	 struct inventory {
                  int quantity;
                  int price;
           } s1;

           int *ip = &amp;(s1.quantity);
           printf("Price: %d\n", *((ip + 1)));
&lt;/pre&gt;


&lt;p&gt;&lt;strong&gt;Example&lt;/strong&gt;&lt;/p&gt;
&lt;pre name="code" class="cpp"&gt;
//field_reord.c
struct str {
    int a1, b1, carr[100], c1, d1, e1;
};

extern struct str sp[];

int hot_func1() {
    int i, ret = 0;

    for (i = 0; i &amp;lt; 1000000; i++) {
        ret += sp[i].a1;
        ret += sp[i].c1;
    }
    return ret;
}

int hot_func2() {
    int ret = 0, i;
    for (i = 0; i &amp;lt; 100000; i++) {
        ret += sp[i].a1;
        ret -= sp[i].e1;
    }
    return ret;
}

int hot_func3() {
    int ret = 0, i;
    for (i = 0; i &amp;lt; 1000000; i++) {
        ret += sp[i].carr[10];
    }
    return ret + sp[0].b1 + sp[0].d1;
}

&lt;/pre&gt;
&lt;p&gt;
For the above example, the compiler generates the following advice with &lt;br&gt;
'icl -Qguide -c field_reord.c'
&lt;/p&gt;
&lt;p&gt;&lt;blockquote&gt;
field_reord.c(2): remark #30755: (DTRANS) Reordering the fields of the structure 'str' will improve data locality. Suggested field order: 'a1, c1, e1, carr, b1, d1'. [VERIFY] The suggestion is based on the field references in the current compilation. Please make sure that the restructured code satisfies the original program semantics.
&lt;/blockquote&gt;&lt;/p&gt;
&lt;p&gt;
For the above example, the only changes in field_reord.c to reorder fields of the structure 'str' as advised are the following:
&lt;/p&gt;

&lt;pre name="code" class="cpp"&gt;
//field_reord.c
struct str {
    int a1, c1, e1, carr[100], b1, d1;
};
...
...
&lt;/pre&gt;&lt;img src="http://feeds.feedburner.com/~r/ISNMain/~4/KMfwx58gZJI" height="1" width="1"/&gt;</description>
      <link>http://feedproxy.google.com/~r/ISNMain/~3/KMfwx58gZJI/gap-message-diagnostic-id-30755</link>
      <pubDate>Fri, 03 Sep 2010 01:17:44 -0400</pubDate>
      <comments>http://software.intel.com/en-us/articles/gap-message-diagnostic-id-30755#comments</comments>
      <guid isPermaLink="false">http://software.intel.com/en-us/articles/gap-message-diagnostic-id-30755</guid>
      <category>Intel® C++ Compiler for Linux* Knowledge Base</category>
      <category>Intel® C++ Compiler for Windows* Knowledge Base</category>
      <category>Intel® Fortran Compiler for Linux* Knowledge Base</category>
      <category>Intel® Visual Fortran Compiler for Windows* Knowledge Base</category>
    <feedburner:origLink>http://software.intel.com/en-us/articles/gap-message-diagnostic-id-30755</feedburner:origLink></item>
  </channel>
</rss>
