<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" media="screen" href="/~d/styles/rss2full.xsl"?><?xml-stylesheet type="text/css" media="screen" href="http://feeds.feedburner.com/~d/styles/itemcontent.css"?><!-- Generated on Thu, 09 Jul 2009 23:57:46 -0700 --><rss xmlns:atom="http://www.w3.org/2005/Atom" xmlns:feedburner="http://rssnamespace.org/feedburner/ext/1.0" version="2.0">
  <channel>
    
    <title>Intel Software Network - Main Articles Feed</title>
    <link>http://software.intel.com/en-us/articles/all</link>
    <description>Feed of all the articles posted on the main page of Intel Software Network.</description>
    <language>en-us</language>
    <atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="self" href="http://feeds.feedburner.com/ISNMain" type="application/rss+xml" /><feedburner:emailServiceId>ISNMain</feedburner:emailServiceId><feedburner:feedburnerHostname>http://feedburner.google.com</feedburner:feedburnerHostname><atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="hub" href="http://pubsubhubbub.appspot.com" /><item>
      <title>Setting H.264 encoding parameters in Intel IPP media processing samples</title>
      <description>&lt;br /&gt;&lt;br /&gt;Intel® IPP media processing samples provide both encoding parameter files and H264EncoderParams class to allow users to set encoding parameters for H.264 encoders. &lt;br /&gt;&lt;br /&gt;UMC::H264EncoderParams class has method "ReadParamFile". Using this function, users can read H.264 the parameters from configuration text file. Please check \audio-video-codecs\codec\h264_enc\readme.htm to understand each field of H.263 parameter files. &lt;br /&gt;&lt;br /&gt;Users can also change UMC::H264EncoderParams class to change H.264 encoding parameters. To learn UMC::H264EncoderParams class members, please check UMC manual document (\audio-video-codecs\doc\umc-manual.pdf), Chapter 4, "Derived Classes", "UMC::H264VideoEncoder" part. &lt;br /&gt;&lt;br /&gt;Depending on the video content, some encoding parameters(e.g number of B frames, num_ref_frames, subblock split, Cabac setting) will impact the performance and quality on targeted bit rate. Users need to balance between speed and video quality according to their application requirement. &lt;br /&gt;&lt;br /&gt;The following are two example configuration files. One is target for encoding performance, and the other is for video qualities.&lt;br /&gt;&lt;img src="http://feeds.feedburner.com/~r/ISNMain/~4/qtO9tLT1h2Q" height="1" width="1"/&gt;</description>
      <link>http://feedproxy.google.com/~r/ISNMain/~3/qtO9tLT1h2Q/setting-h264-encoding-parameters-in-intel-ipp-media-processing-samples</link>
      <pubDate>Thu, 09 Jul 2009 20:13:29 -0700</pubDate>
      <comments>http://software.intel.com/en-us/articles/setting-h264-encoding-parameters-in-intel-ipp-media-processing-samples#comments</comments>
      <guid isPermaLink="false">http://software.intel.com/en-us/articles/setting-h264-encoding-parameters-in-intel-ipp-media-processing-samples</guid>
      <category>Intel® Integrated Performance Primitives Knowledge Base</category>
    <feedburner:origLink>http://software.intel.com/en-us/articles/setting-h264-encoding-parameters-in-intel-ipp-media-processing-samples</feedburner:origLink></item>
    <item>
      <title>Teach Parallel! Join the Discussion.</title>
      <description>&lt;!--CTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//E--&gt;&lt;br /&gt;&lt;b&gt;Should we Teach Parallelism to undergraduates? Yes? &lt;a href="http://software.intel.com/en-us/blogs/author/paul-steinberg/"&gt;Come to our online discussions to debate the how, when and where.&lt;/a&gt;&lt;/b&gt; &lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt; 
&lt;table border="0" cellpadding="10" cellspacing="2"&gt;
&lt;tbody&gt;
&lt;tr bgcolor="#f0f8ff"&gt;
&lt;td width="3" bgcolor="#f0f8ff"&gt;&lt;/td&gt;
&lt;td width="194"&gt;&lt;b&gt;&lt;span class="sectionHeading"&gt;Title&lt;br /&gt;&lt;/span&gt;&lt;/b&gt;&lt;/td&gt;
&lt;td width="616"&gt;&lt;b&gt;Description &lt;br /&gt;&lt;/b&gt;&lt;/td&gt;
&lt;td width="83"&gt;&lt;b&gt;Date&lt;br /&gt;&lt;/b&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr bgcolor="#f0f8ff"&gt;
&lt;td bgcolor="#f0f8ff" height="88"&gt;&lt;img src="http://software.intel.com/file/20312" /&gt;&lt;/td&gt;
&lt;td valign="top"&gt;Passion, Beauty, Joy, Awe and Computer Science.
&lt;h5&gt;Dan Garcia, Lecturer SOE in the Computer Science division of the EECS department at the University of California, Berkeley.&lt;/h5&gt;
&lt;/td&gt;
&lt;td valign="top"&gt;We in academia and industry are at least a generation behind in preparing the next generation of computer scientists and engineers for parallel and many core computing. &lt;br /&gt;We must fundamentally reevaluate what and how we teach: data structures, algorithms, testing and more all need to be rethought in terms of parallelism. &lt;br /&gt;At the very least, students should take one full quarter or semester of parallelism as undergraduates. Even better, the undergraduate curriculum should be infused with parallelism inclusively.&lt;br /&gt; &lt;a href="http://blip.tv/file/2101347/"&gt;See this episode here&lt;/a&gt;&lt;/td&gt;
&lt;td valign="top"&gt;5/12/2009 10:00 AM 30 Minutes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr bgcolor="#f8f8ff"&gt;
&lt;td align="left"&gt;&lt;img src="http://software.intel.com/file/20289" alt="DanErnst.jpg" title="DanErnst.jpg" width="150" height="150" /&gt;&lt;/td&gt;
&lt;td valign="top" align="left"&gt;Preparing Students for Ubiquitous Parallelism.
&lt;h5&gt;Professor Daniel Ernst, University of Wisconsin, Eau Claire.&lt;/h5&gt;
&lt;/td&gt;
&lt;td valign="top"&gt;Professor Ernst has successfully introduced parallelism throughout the undergraduate curriculum at UWEC. His approach is to give students practice with the concepts behind parallel programming early and often by integrating them into existing course work. Join the discussion on this topic &lt;br /&gt; &lt;a href="http://blip.tv/file/2101404/"&gt;See this episode here&lt;/a&gt;&lt;/td&gt;
&lt;td valign="top"&gt;5/5/2009 10:00 AM 30 Minutes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr bgcolor="#f0f8ff"&gt;
&lt;td bgcolor="#f0f8ff" height="88"&gt;&lt;img src="http://software.intel.com/file/20299" /&gt;&lt;/td&gt;
&lt;td valign="top"&gt;Re-envisioning the Computer Science Curriculum.
&lt;h5&gt;Dr. Dan Reed, Microsoft, Director of Multicore Research.&lt;/h5&gt;
&lt;/td&gt;
&lt;td valign="top"&gt;Dan Reed is Microsoft's Scalable and Multicore Computing Strategist. Join the conversation as Dan talks about how industry and academia must change to cope with the coming multiplicity of heterogeneous compute cores.&lt;/td&gt;
&lt;td valign="top"&gt;5/12/2009 10:00 AM 30 Minutes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr bgcolor="#f8f8ff"&gt;
&lt;td align="left"&gt;&lt;img src="http://software.intel.com/file/20304" /&gt;&lt;/td&gt;
&lt;td valign="top"&gt;The View from Intel Research.
&lt;h5&gt;Dr. Tim Mattson, Intel Principal Engineer.&lt;/h5&gt;
&lt;p&gt; &lt;/p&gt;
&lt;/td&gt;
&lt;td valign="top"&gt;Dr. Tim Mattson, Intel Principal Engineer, has been an early (and vocal) proponent of thinking parallel both in industry and academia. His past work as creator of OpenMP, as well as his present research on abstractions that bridge across parallel system design, parallel programming environments, and application software give him a unique perspective on the topic of teaching parallelism. &lt;br /&gt; &lt;a href="http://blip.tv/file/2101404/"&gt;See this episode here&lt;/a&gt;&lt;/td&gt;
&lt;td valign="top"&gt;5/19/2009 10:00 AM 30 Minutes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr bgcolor="#f0f8ff"&gt;
&lt;td&gt;&lt;img src="http://software.intel.com/file/20305" /&gt;&lt;/td&gt;
&lt;td valign="top"&gt;Curriculum for Multi-core.
&lt;h5&gt;Professor Matt Wolf, Research Scientist CERCS Center for Experimental Research in Computer Systems.&lt;/h5&gt;
&lt;/td&gt;
&lt;td valign="top"&gt;Multicore breaks a fundamental link in how we prepare our current and future developers – teach them to break a problem down into pieces and find a nice logical progression to solve each individual piece sequentially. A normal CS curriculum gets around to telling people about the idea of concurrent execution only as they have one foot out the door. At Georgia Tech, we've been trying to tackle this by trying to integrate bits of multi-core throughout the curriculum – introduced gently into the entry classes, and getting increasingly more focused as time goes on. This admittedly means we have to forgo teaching some things to make space in the curriculum, but so far it has been surprisingly little. &lt;br /&gt; &lt;a href="http://blip.tv/file/2197958/"&gt;See this episode here&lt;/a&gt;&lt;/td&gt;
&lt;td valign="top"&gt;6/2/2009 10:00 AM 30 Minutes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr bgcolor="#f8f8ff"&gt;
&lt;td align="left"&gt;&lt;img src="http://software.intel.com/file/20306" /&gt;&lt;/td&gt;
&lt;td valign="top"&gt;
&lt;h5&gt;&lt;span style="color: #000000; background-color: #f8f8ff;"&gt;Common Strategies for Parallelism.&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;Professor Wen-mei Hwu, Walter J. ("Jerry") Sanders III-Advanced Micro Devices Endowed Chair in Electrical and Computer Engineering in the Coordinated Science Laboratory of the University of Illinois at Urbana-Champaign.&lt;/h5&gt;
&lt;/td&gt;
&lt;td valign="top"&gt;We seek models that impose structure on parallel control flow and on synchronization. Current language specifications already discourage the use of data races, but do not aid the programmer in achieving this goal. A stronger guarantee is determinism, which guarantees that for a given input, the program will always produce the same output. This output is the result of an equivalent sequential execution, providing a simple semantic model. This model facilitates code development and debugging, while still exposing to the programmer a parallel performance model. Effectively, deterministic languages can ride on the advances in sequential programming, including safety, modularity, and composability. Many programs, especially a large class of transformative programs, are deterministic; however, current languages do not aid in expressing them in provably deterministic terms. We wish to explore the extent to which language support can be used to guarantee data-race-freedom, determinism, and other higher level coordination structures, in the context of modern sequential programming practices and client applications.  &lt;br /&gt; &lt;a href="http://blip.tv/file/2246337/"&gt;See this episode here&lt;/a&gt;&lt;/td&gt;
&lt;td valign="top"&gt;6/9/2009 10:00 AM 30 Minutes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr bgcolor="#f0f8ff"&gt;
&lt;td align="left"&gt;&lt;img src="http://software.intel.com/file/20307" /&gt;&lt;/td&gt;
&lt;td valign="top" align="left"&gt;Teaching Paralleism to Students. Teaching Parallelism to faculty.
&lt;h5&gt;Associate Professor Charley Peck, Earlham College, Richmond, IN.&lt;/h5&gt;
&lt;/td&gt;
&lt;td valign="top"&gt;
&lt;p&gt;As a member of the SuperComputing Conference's Education Program Steering Committee (2007-2011) he is one of a group of people developing and delivering curriculum for teaching high performance computing and computational science to undergraduate faculty and students. Charlie's student/faculty research covers how 3D Internet technology such as metaverses can be used to support science education , parallelism in the undergraduate computer science curriculum, and scaling scientific kernels to the next generation of petascale computational resources. Working with colleagues from the Education Program, Charlie is co-PI of the LittleFe project. LittleFe is a low-cost,portable, computational cluster primarily used for high performance computing and computational science education, outreach, and training.&lt;br /&gt; &lt;a href="http://blip.tv/file/2257485/"&gt;See this episode here&lt;/a&gt;&lt;/p&gt;
&lt;/td&gt;
&lt;td valign="top"&gt;6/16/2009 10:00 AM 30 Minutes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr bgcolor="#f8f8ff"&gt;
&lt;td&gt;&lt;img src="http://software.intel.com/file/20308" /&gt;&lt;/td&gt;
&lt;td&gt;HPC Centers can help support curricular change
&lt;h5&gt;Scott Lathrop, Blue Waters Technical Program Manager for Education &amp;amp; TeraGrid Area Director for Education, Outreach and Training.&lt;/h5&gt;
&lt;/td&gt;
&lt;td&gt;Scott Lathrop splits his time between being the TeraGrid Director of Education, Outreach and Training (EOT) at the University of Chicago/Argonne National Laboratory, and being the Blue Waters Technical Program Manager for Education for NCSA. Lathrop has been involved in high performance computing and communications activities since 1986. Lathrop coordinates education, outreach and training activities among the eleven Resource Providers involved in the TeraGrid project. He coordinates undergraduate and graduate education activities for the Blue Waters project. &lt;br /&gt; &lt;a href="http://blip.tv/file/2313527/"&gt;See this episode here&lt;/a&gt;&lt;/td&gt;
&lt;td valign="top"&gt;6/30/2009 10:00 AM 30 Minutes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr bgcolor="#f0f8ff"&gt;
&lt;td&gt;&lt;img src="http://software.intel.com/file/21128" /&gt;&lt;/td&gt;
&lt;td&gt;Computational Sciences and Parallelism
&lt;h5&gt;Professor Rubin Landau, Professor Emeritus of Physics, Oregon State University.&lt;/h5&gt;
&lt;/td&gt;
&lt;td&gt;SFrom 1974-2000 Landau directed a basic research program in computational and theoretical particle physics and nuclear physics (over 80 publications) funded by the DOE and NSF. While at Oregon State University, he introduced five new undergraduate courses in Computational Physics and Science, a graduate course in Nonlinear Dynamics, and a new curriculum for graduate-level Advanced Quantum Mechanics. After gaining departmental, college, university and state approval, in 2001 Landau founded, and now directs, the B.S. Degree program in Computational Physics (CPUG). The program combines the new courses with those in the Math and CS departments to provide a multidisciplinary, research-rich approach to modern physics education. This program has received interest as a model for future physics education, and Landau regularly consults with other schools, reviews their programs, and contributes to CP development in South Africa, Colombia, Korea, Ireland and India.&lt;/td&gt;
&lt;td valign="top"&gt;6/30/2009 10:00 AM 30 Minutes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr bgcolor="#f0f8ff"&gt;
&lt;td&gt;&lt;img src="http://software.intel.com/file/20310" /&gt;&lt;/td&gt;
&lt;td&gt;Education Outreach.
&lt;h5&gt;Diane Baxter, Director of Education at the San Diego Supercomputer Center at UCSD.&lt;/h5&gt;
&lt;/td&gt;
&lt;td&gt;We share a national challenge to address the lack of full participation by women and minorities in science, math, engineering, and technology graduate programs and careers. In all of our programs, we consciously and conscientiously strive to create opportunities for broadening participation in cyberinfrastructure. That takes careful listening to others' needs and creative collaboration to meet those needs.&lt;/td&gt;
&lt;td valign="top"&gt;July 21 10:00 AM 30 Minutes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr bgcolor="#f8f8ff"&gt;
&lt;td&gt;&lt;img src="http://software.intel.com/file/21129" /&gt;&lt;/td&gt;
&lt;td&gt;Special Event. Live from New York!&lt;br /&gt;&lt;b&gt;Are High School Whiz Kids Ready to "Think Parallel?"&lt;/b&gt;&lt;/td&gt;
&lt;td&gt;Jeff Birnbaum, CTO Merrill Lynch, noted earlier this year there are few programmers  in the hiring pipeline who understand parallel programming for multicore, a critical skill for the financial services sector.   As many industries have clearly made the shift to multi- and many-core processing for the toughest computing challenges, education needs to follow suit, and teach parallel programming throughout computer sciences curriculum, starting at the high school level. &lt;br /&gt;With Mr.  Birnbaum and the Brooklyn Technical High School, the Intel Academic Community has brought together 21 brilliant technical high school students in New York, to teach them the basics of parallel programming and see what they are capable of.  .&lt;/td&gt;
&lt;td valign="top"&gt;July 21 10:00 AM 30 Minutes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr bgcolor="#f0f8ff"&gt;
&lt;td&gt;&lt;img src="http://software.intel.com/file/20309" /&gt;&lt;/td&gt;
&lt;td&gt;Intel Parallel Studio as a tool for teaching Parallelism in the classroom.
&lt;h5&gt;James Reinders, Intel Chief Software Evangelist.&lt;/h5&gt;
&lt;/td&gt;
&lt;td&gt;Intel brings simplified, end-to-end parallelism to Microsoft Visual Studio* C/C++ developers with Intel Parallel Studio. Intel Parallel Studio eases implementation at every stage in the development cycle for designing, coding, debugging, and tuning applications. Dr. Reinders will discuss how Parallel Studio can be used by those teachong Computer and Computational Sciences.&lt;/td&gt;
&lt;td valign="top"&gt;7/28/2009 10:00 AM 30 Minutes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr bgcolor="#f8f8ff"&gt;
&lt;td&gt;&lt;img src="http://www.eecs.berkeley.edu/Faculty/Photos/Homepages/patterson.jpg" /&gt;&lt;/td&gt;
&lt;td&gt;Parallelism in the Text: The new edition of &lt;i&gt;Computer Organization and Design.&lt;/i&gt;
&lt;h5&gt;David Patterson, Pardee Professor of Computer Science at the University of California at Berkeley.&lt;/h5&gt;
&lt;/td&gt;
&lt;td&gt;Professor Patterson's book, "Computer Organization and Design" is arguably the most used text for the computer architecture course taught in every CS curricula. The new addition, with its increased focus on parallelism, as well as the content on GPUs and multithreaded multiprocessors for visual computing and other uses,will bring important changes to teaching and the introduction of parallelism. Join us for a discussion on this topic.&lt;/td&gt;
&lt;td valign="top"&gt;8/11/2009 10:00 AM 30 Minutes&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;br /&gt;
&lt;p&gt;&lt;b&gt;&lt;br /&gt;Listen Live or access past episodes. &lt;/b&gt;&lt;br /&gt; 
&lt;object height="300" width="400" codebase="http://download.macromedia.com/pub/shockwave/cabs/flash/swflash.cab#version=6,0,40,0" classid="clsid:d27cdb6e-ae6d-11cf-96b8-444553540000"&gt;
&lt;param name="src" value="http://www.blogtalkradio.com/BTRPlayer.swf?file=http%3A%2F%2Fwww%2Eblogtalkradio%2Ecom%2FTeachParallel%2Fplay%5Flist%2Exml%3Fitemcount%3D4&amp;amp;autostart=false&amp;amp;shuffle=false&amp;amp;bordercolor=#999999&amp;amp;backgroundcolor=#FFFFFF&amp;amp;dashboardcolor=#0098CBplaylistcolor=#999999&amp;amp;playlisthovercolor=#333333&amp;amp;cornerradius=10&amp;amp;callback=http://www.blogtalkradio.com/FlashPlayerCallback.aspx?referrer_url=/Profile.aspx&amp;amp;width=400&amp;amp;height=300&amp;amp;volume=80&amp;amp;corner=rounded" /&gt;
&lt;param name="wmode" value="transparent" /&gt;
&lt;param name="quality" value="high" /&gt;&lt;embed height="300" width="400" quality="high" wmode="transparent" src="http://www.blogtalkradio.com/BTRPlayer.swf?file=http%3A%2F%2Fwww%2Eblogtalkradio%2Ecom%2FTeachParallel%2Fplay%5Flist%2Exml%3Fitemcount%3D4&amp;amp;autostart=false&amp;amp;shuffle=false&amp;amp;bordercolor=#999999&amp;amp;backgroundcolor=#FFFFFF&amp;amp;dashboardcolor=#0098CBplaylistcolor=#999999&amp;amp;playlisthovercolor=#333333&amp;amp;cornerradius=10&amp;amp;callback=http://www.blogtalkradio.com/FlashPlayerCallback.aspx?referrer_url=/Profile.aspx&amp;amp;width=400&amp;amp;height=300&amp;amp;volume=80&amp;amp;corner=rounded" type="application/x-shockwave-flash"&gt;&lt;/embed&gt;
&lt;/object&gt;
&lt;/p&gt;
&lt;hr /&gt;
&lt;p&gt;Join Paul Steinberg, Intel Academic Community Manager, and Tom Murphy, Professor of Computer Science at Contra Costa College, as they discuss this issue with experts in the academy and industry.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Our show runs live every second Tuesday at 10AM PDT. &lt;a href="http://software.intel.com/en-us/blogs/author/paul-steinberg/"&gt;Find more here&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Join us in live in the Science SIM virtual world - &lt;a href="http://www.sciencesim.com/wiki/doku.php/gettingstarted"&gt;http://www.sciencesim.com/wiki/doku.php/gettingstarted&lt;/a&gt; &lt;/li&gt;
&lt;li&gt;Visit our archive for past shows&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt; &lt;/p&gt;
&lt;hr /&gt;
&lt;p&gt; &lt;/p&gt;
&lt;p&gt;&lt;br /&gt;All major manufactures of CPUs, GPUs and ASICs have moved to a many core design, yet universities and colleges are not training engineers in the parallel and concurrent disciplines needed to efficiently program on such systems. Today's computer science curriculums rarely include parallelism and when they do, many unversity and college teachers, lectures and professors are only just themselves coming up to speed on how to effectivley teach this subject.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;Undergraduates need to be exposed to parallel programming techniques starting in CS1 and then need to build on the skill in every (relevant) course. This is not the case at most institutions; when they teach parallel computing at all, they often relegate it to advanced topics or elective courses. That said, there are a number of colleges and universities that have found that it is not all that difficult to incorporate it into their existing curriculum.&lt;br /&gt;&lt;br /&gt;Your hosts:&lt;/p&gt;
&lt;table border="0" cellpadding="2" cellspacing="2"&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;img src="http://software.intel.com/file/18513" width="183" border="0" height="257" /&gt;&lt;/td&gt;
&lt;td&gt;Paul Steinberg is Academic Community Manager for the Intel Software Network. &lt;br /&gt;Since joining Intel in 1999, Paul has worked as Intel Senior Technical Marketing Engineer for Java and Web Services as well as a Course Developer for Intel Software College. &lt;br /&gt;Before joining Intel, Paul was project lead and writer for New Media Magazine Labs in Cambridge, Massachusetts, as well as a Solutions Developer for Progress Software. &lt;br /&gt;&lt;br /&gt;Paul's other interests include Middle Eastern history and culture. Paul spent five years as a Research Fellow at the Harry S Truman Institute for the Advancement of Peace at the Hebrew University of Jerusalem, and six years as Visiting Scholar and Research Associate at the Center for Middle Eastern Studies at Harvard University&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="http://software.intel.com/en-us/blogs/author/wolfmurphy/"&gt;&lt;img src="http://software.intel.com/file/18514" width="183" border="0" height="183" /&gt;&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;&lt;span class="sectionBodyText"&gt;Professor Tom Murphy is the CS Program Chair and Director of Contra Costa College HPC Regional Education Training Center&lt;br /&gt;&lt;br /&gt;Tom is teaching and advancing Computational Science Education. He helps lead weeklong Parallel and Distributed Programming workshops across the US through the &lt;/span&gt;
&lt;p&gt;&lt;a target="_blank" href="http://sc-education.org/media/2009WorkshopFlyer.pdf" class="sectionBodyText"&gt;SC and National Computational Science Institute&lt;/a&gt;&lt;span class="sectionBodyText"&gt;. He is member of the SC07-11 Education Program steeringcommittee the SC07-09 Education Program, making it a year-round effort, complete with a student programming contest. Through this process he has designed, built, and refined an inexpensive, portable computational cluster (http://LittleFe.net).&lt;br /&gt;&lt;b&gt;Follow Tom's &lt;/b&gt;&lt;a href="http://software.intel.com/en-us/blogs/author/wolfmurphy/"&gt;&lt;b&gt;Blog here&lt;/b&gt;&lt;/a&gt;&lt;b&gt;.&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt; &lt;/p&gt;
&lt;hr /&gt;
&lt;p&gt; &lt;/p&gt;
&lt;p&gt; &lt;/p&gt;&lt;img src="http://feeds.feedburner.com/~r/ISNMain/~4/k-hrx0Gfas8" height="1" width="1"/&gt;</description>
      <link>http://feedproxy.google.com/~r/ISNMain/~3/k-hrx0Gfas8/teach-parallel-online-discussions</link>
      <pubDate>Thu, 09 Jul 2009 18:42:14 -0700</pubDate>
      <comments>http://software.intel.com/en-us/articles/teach-parallel-online-discussions#comments</comments>
      <guid isPermaLink="false">http://software.intel.com/en-us/articles/teach-parallel-online-discussions</guid>
      <category>Academic</category>
    <feedburner:origLink>http://software.intel.com/en-us/articles/teach-parallel-online-discussions</feedburner:origLink></item>
    <item>
      <title>Intel(R) Parallel Inspector Comparison with Intel(R) Thread Checker</title>
      <description>&lt;span style="font-size: small;"&gt;
&lt;p&gt;The following table can help you decide which tool to use: &lt;br /&gt; &lt;br /&gt; 
&lt;table id="table1" class="sectionHeadingText" border="1" width="100%"&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td align="center" bgcolor="#0099ff"&gt;&lt;b&gt; &lt;span style="font-family: Verdana; color: #ffffff; font-size: x-small;"&gt;Intel® Parallel Inspector&lt;/span&gt;&lt;/b&gt;&lt;/td&gt;
&lt;td align="center" bgcolor="#0099ff"&gt;&lt;b&gt; &lt;span style="font-family: Verdana; color: #ffffff; font-size: x-small;"&gt;Intel® Thread Checker&lt;/span&gt;&lt;/b&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td align="right"&gt;&lt;span style="font-family: Verdana; font-size: x-small;"&gt;Threading errors - Data races and Deadlocks&lt;/span&gt;&lt;/td&gt;
&lt;td align="center" valign="bottom"&gt;&lt;span style="font-family: Verdana; font-size: x-small;"&gt;√&lt;/span&gt;&lt;/td&gt;
&lt;td align="center" valign="bottom"&gt;&lt;span style="font-family: Verdana; font-size: x-small;"&gt;√&lt;/span&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td align="right"&gt;&lt;span style="font-family: Verdana; font-size: x-small;"&gt;Does not require special build or source code&lt;/span&gt;&lt;/td&gt;
&lt;td align="center" valign="bottom"&gt;&lt;span style="font-family: Verdana; font-size: x-small;"&gt;√&lt;/span&gt;&lt;/td&gt;
&lt;td align="center" valign="bottom"&gt;&lt;span style="font-family: Verdana; font-size: x-small;"&gt;√&lt;/span&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td align="right"&gt;&lt;span style="font-family: Verdana; font-size: x-small;"&gt;Memory errors&lt;/span&gt;&lt;/td&gt;
&lt;td align="center" bgcolor="#99ccff" valign="bottom"&gt;&lt;span style="font-family: Verdana; font-size: x-small;"&gt;√&lt;/span&gt;&lt;/td&gt;
&lt;td align="center" valign="bottom"&gt;&lt;br /&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td align="right"&gt;&lt;span style="font-family: Verdana; font-size: x-small;"&gt;Easier to learn and reuse&lt;/span&gt;&lt;/td&gt;
&lt;td align="center" bgcolor="#99ccff" valign="bottom"&gt;&lt;span style="font-family: Verdana; font-size: x-small;"&gt;√&lt;/span&gt;&lt;/td&gt;
&lt;td align="center" valign="bottom"&gt;&lt;br /&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td align="right"&gt;&lt;span style="font-family: Verdana; font-size: x-small;"&gt;Low overhead analysis&lt;/span&gt;&lt;/td&gt;
&lt;td align="center" bgcolor="#99ccff" valign="bottom"&gt;&lt;span style="font-family: Verdana; font-size: x-small;"&gt;√&lt;/span&gt;&lt;/td&gt;
&lt;td align="center" valign="bottom"&gt;&lt;br /&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td align="right"&gt;&lt;span style="font-family: Verdana; font-size: x-small;"&gt;Improved scalable analysis without serializing the app&lt;/span&gt;&lt;/td&gt;
&lt;td align="center" bgcolor="#99ccff" valign="bottom"&gt;&lt;span style="font-family: Verdana; font-size: x-small;"&gt;√&lt;/span&gt;&lt;/td&gt;
&lt;td align="center" valign="bottom"&gt;&lt;br /&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td align="right"&gt;&lt;span style="font-family: Verdana; font-size: x-small;"&gt;Windows* standalone&lt;/span&gt;&lt;/td&gt;
&lt;td align="center" valign="bottom"&gt;&lt;br /&gt;&lt;/td&gt;
&lt;td align="center" bgcolor="#99ccff" valign="bottom"&gt;&lt;span style="font-family: Verdana; font-size: x-small;"&gt;√&lt;/span&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td align="right"&gt;&lt;span style="font-family: Verdana; font-size: x-small;"&gt;Linux* support&lt;/span&gt;&lt;/td&gt;
&lt;td align="center" valign="bottom"&gt;&lt;br /&gt;&lt;/td&gt;
&lt;td align="center" bgcolor="#99ccff" valign="bottom"&gt;&lt;span style="font-family: Verdana; font-size: x-small;"&gt;√&lt;/span&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td align="right"&gt;&lt;span style="font-family: Verdana; font-size: x-small;"&gt;Licensing&lt;/span&gt;&lt;/td&gt;
&lt;td align="center"&gt;&lt;span style="font-family: Verdana; font-size: xx-small;"&gt;Single User&lt;/span&gt;&lt;/td&gt;
&lt;td align="center"&gt;&lt;span style="font-family: Verdana; font-size: xx-small;"&gt;Single User &amp;amp; Floating&lt;/span&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td align="right"&gt;&lt;span style="font-family: Verdana; font-size: x-small;"&gt;Support&lt;/span&gt;&lt;/td&gt;
&lt;td align="center"&gt;&lt;span style="font-family: Verdana; font-size: xx-small;"&gt;forum support&lt;br /&gt; premier support option&lt;/span&gt;&lt;/td&gt;
&lt;td align="center"&gt;&lt;span style="font-family: Verdana; font-size: xx-small;"&gt;unlimited premier support &amp;amp;&lt;br /&gt; 1 year product updates&lt;/span&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;/p&gt;
&lt;p&gt; &lt;/p&gt;
&lt;span style="font-size: small;"&gt;
&lt;p&gt;Intel Thread Checker is still the right choice for developers who need:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Analysis outside of Visual Studio (standalone version)&lt;/li&gt;
&lt;li&gt;Floating licenses&lt;/li&gt;
&lt;li&gt;Unlimited, secure, formal support&lt;/li&gt;
&lt;li&gt;Ability to check Linux applications&lt;/li&gt;
&lt;/ul&gt;
&lt;/span&gt;
&lt;p&gt; &lt;/p&gt;
&lt;/span&gt;&lt;img src="http://feeds.feedburner.com/~r/ISNMain/~4/8l_0cHlGT2k" height="1" width="1"/&gt;</description>
      <link>http://feedproxy.google.com/~r/ISNMain/~3/8l_0cHlGT2k/intelr-parallel-inspector-comparison-with-intelr-thread-checker</link>
      <pubDate>Thu, 09 Jul 2009 16:08:18 -0700</pubDate>
      <comments>http://software.intel.com/en-us/articles/intelr-parallel-inspector-comparison-with-intelr-thread-checker#comments</comments>
      <guid isPermaLink="false">http://software.intel.com/en-us/articles/intelr-parallel-inspector-comparison-with-intelr-thread-checker</guid>
      <category>Intel® Parallel Inspector Knowledge Base</category>
      <category>Intel® Thread Checker for Windows* Knowledge Base</category>
    <feedburner:origLink>http://software.intel.com/en-us/articles/intelr-parallel-inspector-comparison-with-intelr-thread-checker</feedburner:origLink></item>
    <item>
      <title>Optimizing Software Applications for NUMA</title>
      <description>&lt;h1 class="sectionHeading"&gt;Download Article&lt;/h1&gt;
&lt;br /&gt; Download &lt;a href="http://software.intel.com/file/21113"&gt;Optimizing Software Applications for NUMA&lt;/a&gt; [PDF 83KB]&lt;br /&gt;&lt;br /&gt;
&lt;h1 class="sectionHeading"&gt;Introduction&lt;/h1&gt;
&lt;br /&gt; In this brief technical paper, we provide an overview of the NUMA shared memory architecture and describe various techniques for optimizing application memory performance within a NUMA-based system.  In particular, we discuss the role of processor affinity, memory allocation using implicit operating system policies, and the use of the system API's for assigning and migrating memory pages using explicit directives.&lt;br /&gt;&lt;br /&gt;
&lt;h1 class="sectionHeading"&gt;The Basics of NUMA&lt;/h1&gt;
&lt;br /&gt; &lt;b&gt;NUMA&lt;/b&gt;, or &lt;b&gt;Non-Uniform Memory Access&lt;/b&gt;, is a shared memory architecture that describes the placement of main memory modules with respect to processors in a multiprocessor system.  Perhaps the best way to understand NUMA is to compare it with its cousin &lt;b&gt;UMA&lt;/b&gt;, or &lt;b&gt;Uniform Memory Access&lt;/b&gt;.&lt;br /&gt;&lt;br /&gt; In the UMA memory architecture, all processors access shared memory through a bus (or another type of interconnect) as seen in the following diagram:&lt;br /&gt;&lt;br /&gt; &lt;img src="http://software.intel.com/file/21114" /&gt;&lt;br /&gt;&lt;br /&gt; UMA gets its name from the fact that each processor must use the same shared bus to access memory, resulting in a memory access time that is uniform across all processors. Note that access time is also independent of data location within memory.  That is, access time remains the same regardless of which shared memory module contains the data to be retrieved.&lt;br /&gt;&lt;br /&gt; In the NUMA shared memory architecture, each processor has its own &lt;i&gt;local &lt;/i&gt;memory module that it can access directly and with a distinctive performance advantage.  At the same time, it can also access any memory module belonging to another processor using a shared bus (or some other type of interconnect) as seen in the diagram below:&lt;br /&gt;&lt;br /&gt; &lt;img src="http://software.intel.com/file/21115" /&gt;&lt;br /&gt;&lt;br /&gt; What gives NUMA its name is that memory access time varies with the location of the data to be accessed.  If data resides in local memory, access is fast.  If data resides in remote memory, access is slower.  The advantage of the NUMA architecture as a hierarchical shared memory scheme is its potential to improve &lt;i&gt;average case access&lt;/i&gt; time through the introduction of fast, local memory.&lt;br /&gt;&lt;br /&gt; Modern multiprocessor systems mix these basic architectures as seen in the following diagram:&lt;br /&gt;&lt;br /&gt; &lt;img src="http://software.intel.com/file/21116" /&gt;&lt;br /&gt;&lt;br /&gt; In this complex hierarchical scheme, processors are grouped by their physical location on one or the other multi-core CPU package or "node".  Processors within a node share access to memory modules as per the UMA shared memory architecture.  At the same time, they may also access memory from the remote node using a shared interconnect, but with slower performance as per the NUMA shared memory architecture.&lt;br /&gt;&lt;br /&gt; Server platforms like Intel® Xeon® using the Intel® Core i7 processors provide an example of this complex memory architecture, and for this reason our discussion will center on it henceforth.  Note that such platforms employ a fast interconnect technology known as Intel® QuickPath Interconnect (QPI) to mitigate (but not eliminate) the problem of slower remote memory performance.&lt;br /&gt;&lt;br /&gt;
&lt;h1 class="sectionHeading"&gt;NUMA Advantages and Risks&lt;/h1&gt;
&lt;br /&gt; The advantage of the NUMA shared memory architecture is its &lt;i&gt;potential &lt;/i&gt;to reduce memory access time in the average case.   By providing each node with its own local memory, memory accesses can take place in parallel and avoid throughput limitations and contention issues associated with a shared memory bus.  In fact, memory constrained systems can theoretically improve their performance by up to the number of nodes on the system.  For example, a memory-constrained dual processor system could conceivably double its performance if processors could access memory in a fully parallelized manner.&lt;br /&gt;&lt;br /&gt; The downside of the NUMA architecture, however, is the cost associated when data is not local to the processor.  In the NUMA model, the time required to retrieve data from an adjacent node within the NUMA model will be significantly higher than that required to access local memory.  Furthermore, the time required to retrieve data from a non-adjacent node may be even higher, complicating memory performance and generating a hierarchy of access time possibilities.  In general, as the distance from a processor increases, the cost of accessing memory increases.&lt;sup&gt;2&lt;/sup&gt;&lt;br /&gt;&lt;br /&gt; The key issue in determining whether the performance benefits of the NUMA architecture can be realized, then, is &lt;b&gt;data placement&lt;/b&gt;.  The more data can effectively be placed in memory local to the processor that needs it, the move overall access time will benefit from the architecture.  Conversely, the more data fails to be local to the node that will access it, the more memory performance will suffer from the architecture.  For this reason, the NUMA architecture can be said to provide the potential to reduce overall memory access times.  To realize this &lt;i&gt;potential&lt;/i&gt;, strategies are needed to ensure smart data placement.  An application that effectively manages such placement is one that has been "optimized for NUMA", is "NUMA-aware", or is "NUMA-friendly".&lt;br /&gt;&lt;br /&gt;
&lt;h1 class="sectionHeading"&gt;Strategies for NUMA Optimization&lt;/h1&gt;
&lt;br /&gt; Two key notions in managing performance within the NUMA shared memory architecture are &lt;i&gt;processor affinity&lt;/i&gt; and &lt;i&gt;data placement.&lt;/i&gt;&lt;br /&gt;&lt;br /&gt; &lt;b&gt;Processor Affinity&lt;/b&gt;&lt;br /&gt;&lt;br /&gt; Affinity refers to the persistence of association with a particular resource instance, despite the availability of another instance for the same purpose.  Consider the case of processor affinity.  Today's complex operating systems assign application threads to processor cores using a scheduler.  A scheduler will take into account system state and various policy objectives (e.g., "balance load across cores" or "aggregate threads on a few cores and put remaining cores to sleep") and match application threads to physical cores accordingly. A given thread will execute on its assigned core for some period of time and then wait as other threads are given the chance to execute.  If another core becomes available, the scheduler may choose to migrate the thread to insure timely execution and meet its policy objectives.&lt;br /&gt;&lt;br /&gt; Thread migration from one core to another poses a problem for the NUMA shared memory architecture because of the way it disassociates a thread from its local memory allocations.  That is, a thread may allocate memory on node 1 at startup as it runs on a core within the node 1 package.  But when the thread is later migrated to a core on node 2, the data stored earlier becomes remote and memory access time significantly increases.&lt;br /&gt;&lt;br /&gt; Enter processor affinity.  Using a system API, or by modifying an OS data structure (e.g., affinity mask), a specific core or set of cores can be associated with an application thread.  The scheduler will then observe this affinity in its scheduling decisions for the lifetime of the thread.  For example, a thread may be configured to run only on cores 0 through 3, all of which belong to quad core CPU package 0.  Henceforth, the scheduler will choose among these alternatives without migrating the thread to another package.&lt;br /&gt;&lt;br /&gt; Exercising processor affinity insures that memory allocations remain local to the thread(s) that need them.  Several downsides, however, should be noted.  In general, processor affinity may significantly harm system performance by restricting scheduler options and creating resource contention when better resources management could have otherwise been used.  For example, affinity restrictions may prevent the scheduler from assigning waiting threads to unutilized cores during a particular interval.  Or, low priority threads may adversely impact high priority threads due to affinity restrictions that prevent adjustments through the use of additional cores.  Processor affinity restrictions may even hurt the application itself when additional execution time on another node would have more than compensated for a slower memory access time.&lt;br /&gt;&lt;br /&gt; Such downsides imply the need to think carefully about whether processor affinity solutions are right for a particular application and shared system context.  Note, finally, that processor affinity APIs offered by some systems support priority "hints" and affinity "suggestions" to the scheduler in addition to explicit directives.  Such suggestions may insure optimal performance in the common case yet avoid constraining scheduling options during periods of high resource contention.&lt;br /&gt;&lt;br /&gt; &lt;b&gt;Data Placement Using Implicit Memory Allocation Policies&lt;/b&gt;&lt;br /&gt;&lt;br /&gt; In the simple case, many operating systems transparently provide support for NUMA-friendly data placement.  When a single-threaded application allocates memory, the processor will simply assign memory pages to the physical memory associated with the requesting thread's node (CPU package), thus insuring that it is local to the thread and access performance is optimal.&lt;br /&gt;&lt;br /&gt; Alternatively, some operating systems will wait for the first memory access before committing on memory page assignment.2  To understand the advantage here, consider a multi-threaded application with a start-up sequence that includes memory allocations by a main control thread, followed by the creation of various worker threads, followed by a long period of application processing or service.  While it may seem reasonable to place memory pages local to the requesting thread, in fact, they are more effectively placed local to the worker threads that will access the data.  As such, the operating system will observe the first access request and commit page assignments based on the requester's node location.&lt;br /&gt;&lt;br /&gt; These two policies together illustrate the importance of an application programmer being aware of the NUMA context of the program's deployment.  If the page placement policy is based on first access, the programmer can exploit this fact by including a carefully designed data access sequence at startup that will generate "hints" to the operating system on optimal memory placement.  If the page placement policy is based on requester location, the programmer should insure that memory allocations are made by the thread that will subsequently access the data and not by an initialization or control thread designed to act as a provisioning agent.&lt;br /&gt;&lt;br /&gt; Multiple threads accessing the same data are best co-located on the same node so that the memory allocations of one, placed local to the node, can benefit all.  This may, for example, be used by prefetching schemes designed to improve application performance by generating data requests in advance of actual need.  Such threads must generate data placement that is local to the actual consumer threads for the NUMA architecture to provide its characteristic performance speedup.&lt;br /&gt;&lt;br /&gt; It should be noted that when an operating system has fully consumed the physical memory resources of one node, memory requests coming from threads on the same node will typically be fulfilled by sub-optimal allocations made on a remote node.  The implication for memory-hungry applications is to correctly size the memory needs of a particular thread and to insure local placement with respect to the accessing thread.&lt;br /&gt;&lt;br /&gt; For situations where a large number of threads will randomly share the same pool of data from all nodes, the recommendation is to stripe the data evenly across all nodes.  Doing so spreads the memory access load and avoids bottleneck access patterns on a single node within the system. &lt;sup&gt;3&lt;/sup&gt;&lt;br /&gt;&lt;br /&gt; &lt;b&gt;Data Placement Using Explicit Memory Allocation Directives&lt;/b&gt;&lt;br /&gt;&lt;br /&gt; Another approach to data placement in NUMA-based systems is to make use of system APIs that explicitly configure the location of memory page allocations.  An example of such APIs is the libnuma library for Linux.&lt;sup&gt;1&lt;/sup&gt;&lt;br /&gt;&lt;br /&gt; Using the API, a programmer may be able to associate virtual memory address ranges with particular nodes, or simply to indicate the desired node within the memory allocation system call itself.  With this capability, an application programmer can insure the placement of a particular data set regardless of which thread allocates it or which thread accesses it first.  This may be useful, for example, in schemes where complex applications make use of a memory management thread acting on behalf of worker threads.  Or, it may prove useful for applications that create many short-lived threads, each of which have predictable data requirements.  Pre-fetching schemes are another area that could benefit considerably from such control.&lt;br /&gt;&lt;br /&gt; The downside of this scheme, of course, is the management burden placed on the application in handling memory allocations and data placement.  Misplaced data may cause performance that is significantly worse than default system behavior.  Explicit memory management also presupposes fine-grained control over processor affinity throughout application use.&lt;br /&gt;&lt;br /&gt; Another capability available to the application programmer through NUMA-based memory management APIs is memory page migration.  In general, migration of memory pages from one node to another is an expensive operation and something to be avoided.  Not only is there the cost of migrating the data, but all associated memory references must be discovered and modified to observe the new mapping.  As the remapping is taking place, pages must be removed from operating system page lists and detached from normal swapping mechanisms.   &lt;br /&gt;&lt;br /&gt; Having said this, given an application that is both long-lived and memory intensive, migrating memory pages to re-establish a NUMA-friendly configuration may be worth the price.3  Consider, for example, a long lived application with various threads that have terminated and new threads that have been created but reside on another node.  Data is now no longer local to the threads that need it and sub-optimal access requests now dominate.  Application-specific knowledge of a thread's lifetime and data needs can be used to determine whether an explicit migration is in order.&lt;br /&gt;&lt;br /&gt; Finally, the API may provide functions for obtaining page residency or for examining memory access behavior under the current configuration.  Such tools may provide the means to implement a monitoring scheme that makes explicit migration adjustments when memory accesses within the NUMA context fall below a defined threshold.&lt;br /&gt;&lt;br /&gt;
&lt;h1 class="sectionHeading"&gt;Summary&lt;/h1&gt;
&lt;br /&gt; &lt;b&gt;NUMA&lt;/b&gt;, or &lt;b&gt;Non-Uniform Memory Access&lt;/b&gt;, is a shared memory architecture that describes the placement of main memory modules with respect to processors in a multiprocessor system.  The advantage of the NUMA architecture as a &lt;i&gt;hierarchical &lt;/i&gt;shared memory scheme is its potential to improve &lt;i&gt;average &lt;/i&gt;case access &lt;i&gt;time &lt;/i&gt;through the introduction of fast, local memory.  To realize the potential of NUMA systems, however, careful &lt;i&gt;data placement&lt;/i&gt; is needed. The more data can effectively be placed in memory local to the processor that needs it, the more overall access time will benefit from the architecture.&lt;br /&gt;&lt;br /&gt; In this brief technical paper, we have described various strategies and considerations for ensuring optimal data placement within a NUMA-based system.  In particular, we have discussed the role of processor affinity, memory allocation strategies that use implicit operating system page placement policies, and the use of the system API's for assigning and migrating memory pages using explicit directives.&lt;br /&gt;&lt;br /&gt;
&lt;h1 class="sectionHeading"&gt;References&lt;/h1&gt;
&lt;br /&gt; &lt;ol&gt;
&lt;li&gt;Drepper, Ulrich.  "What Every Programmer Should Know About Memory".  November 2007.&lt;/li&gt;
&lt;li&gt;Intel® 64 and IA-32 Architectures Optimization Reference Manual.  See Section 8.8 on "Affinities and Managing Shared Platform Resources".  March 2009.&lt;/li&gt;
&lt;li&gt;Lameter, Christoph.  "Local and Remote Memory: Memory in a Linux/NUMA System".  June 2006.&lt;/li&gt;
&lt;/ol&gt;
&lt;h1 class="sectionHeading"&gt;Author Bio&lt;/h1&gt;
&lt;br /&gt; David E. Ott is a Senior Software Engineer with Intel's Software Solutions Group.  He joined Intel in 2005 as a middleware systems engineer for the Technology and Manufacturing Group.  Currently, David focuses on power and virtualization aspects of enterprise server platforms.  David holds M.S. and Ph.D. degrees in Computer Science from the University of North Carolina at Chapel Hill.&lt;img src="http://feeds.feedburner.com/~r/ISNMain/~4/RI_PzOwDNjw" height="1" width="1"/&gt;</description>
      <link>http://feedproxy.google.com/~r/ISNMain/~3/RI_PzOwDNjw/optimizing-software-applications-for-numa</link>
      <pubDate>Thu, 09 Jul 2009 14:28:18 -0700</pubDate>
      <comments>http://software.intel.com/en-us/articles/optimizing-software-applications-for-numa#comments</comments>
      <guid isPermaLink="false">http://software.intel.com/en-us/articles/optimizing-software-applications-for-numa</guid>
      <category>Parallel Programming and Multi-Core</category>
    <feedburner:origLink>http://software.intel.com/en-us/articles/optimizing-software-applications-for-numa</feedburner:origLink></item>
    <item>
      <title>Intel® Advanced Vector Extensions: Pixel Format Conversions</title>
      <description>&lt;h1 class="sectionHeading"&gt;Download Article&lt;/h1&gt;
&lt;br /&gt; Download &lt;a href="http://software.intel.com/file/21089"&gt;Intel® Advanced Vector Extensions: Pixel Format Conversions&lt;/a&gt; [PDF 1.7MB]&lt;br /&gt;&lt;br /&gt;
&lt;h1 class="sectionHeading"&gt;Introduction&lt;/h1&gt;
&lt;br /&gt; Intel® Advanced Vector Extensions (Intel® AVX) is a 256 bit instruction set extension to Intel® Streaming SIMD Extensions (Intel® SSE) and is designed for applications that are floating point intensive. Intel® AVX extends all the 16 XMM registers to 256-bits (YMM registers), thus essentially doubling the width of existing XMM registers which leads to improved performance and power efficiency over 128-bit SIMD instructions. Intel® AVX introduces distinct destination argument that results in fewer register copies, better register use, smaller code size, and other benefits. Intel® AVX also introduces several new instructions for blending and rearranging data in the YMM registers.&lt;br /&gt;&lt;br /&gt; This document describes techniques to optimize pixel format conversion routines (commonly used in image processing applications) using the new Intel® AVX extensions. The two conversions demonstrated here are RGB to RGBA and RGBA to RGB. Though, R, G, B, and A components can be different data type in different applications, we only discuss single precision floating point (SP FP) components. The Intel® AVX performance is compared against the scalar version of the conversion routines on the same simulator. The Intel® AVX versions are implemented in compiler intrinsics and the code was compiled using the Intel® C Compiler that supports Intel® AVX intrinsics.&lt;br /&gt;&lt;br /&gt; This paper will describe only the Intel® AVX implementation of the format conversions.&lt;br /&gt;&lt;br /&gt; The RGB-to-RGBA and RGBA-to-RGB conversion algorithms make use of the Intel® AVX instructions VPERMILPS, VPERM2F128, and VBLENDPS to rearrange, and mask off data when copying from the source to the destination buffers.&lt;br /&gt;&lt;br /&gt;
&lt;h1 class="sectionHeading"&gt;RGB to RGBA&lt;/h1&gt;
&lt;br /&gt; The destination and source pixel buffers are aligned to 32-byte boundaries and the conversion routines expect them to be so. The following figure depicts the arrangement of the source and destination buffers in memory, for n pixels. In this figure R0 is at a lower address than G0, and so on. In order to use aligned load and store in Intel® AVX implementation for better performance, destination and source pixel buffers should be aligned on 32-byte boundary in the memory. The Intel® AVX conversion routines make assumption that both destination and source are aligned on a 32-byte boundary.&lt;br /&gt;&lt;br /&gt;
&lt;p style="text-align: center;"&gt;&lt;img src="http://software.intel.com/file/21080" /&gt;&lt;/p&gt;
&lt;br /&gt;&lt;br /&gt; &lt;b&gt;Figure 1:&lt;/b&gt; Arrangement of source and destination pixels in memory&lt;br /&gt;&lt;br /&gt; Each YMM register is 256-bit wide which allows us to load and store eight SPFP values at a time. In each iteration of the loop we load multiple source values, rearrange the data, and insert the alpha value (in this example, 1.0) and store the result to the destination address.&lt;br /&gt;&lt;br /&gt; Since the conversion is from a 3-channel pixel to 4-channel pixel, we could have loaded twelve SP FP values from the source (four RGB pixels) and written sixteen SP FP (four RGBA pixels) values per iteration. Doing so will force us to use unaligned loads since in the next iteration we have to load pixels from an offset of twelve from the source address. There will be severe performance penalties when the unaligned accesses cross cache-line boundaries. Hence we will try to avoid unaligned loads altogether by unrolling the loop twice to load eight RGB pixels.&lt;br /&gt;&lt;br /&gt; The algorithm is implemented in four steps, computing two destination pixels at each step. We first load eight single precision FP values starting from the source address using the _mm256_load_ps() aligned load intrinsic. The values are then shuffled to a temporary YMM register using _mm256_permutevar_ps() intrinsic with a control mask of {0,1,2,0,0,0,1,0} so that the R0, G0, B0, G1, and B1 are copied to their corresponding locations in the destination. Next R1 is broadcast using _mm256_broadcast_ss() to a temporary YMM register and the result is blended using a mask of 16 (00 01 00 00) with the output from the shuffle operation. Finally, the alpha value (1.0) is blended with the result from previous blend operation using a mask of 136 (10 00 10 00) to produce destination pixels zero and one. The result is written to the memory starting at the address of the destination using _mm256_store_ps(). The following figure illustrates this step &lt;i&gt;(Step1)&lt;/i&gt;.&lt;br /&gt;&lt;br /&gt;
&lt;p style="text-align: center;"&gt;&lt;img src="http://software.intel.com/file/21081" /&gt;&lt;/p&gt;
&lt;br /&gt;&lt;br /&gt; &lt;b&gt;Figure 2:&lt;/b&gt; RGB to RGBA &lt;i&gt;Step1&lt;/i&gt;&lt;br /&gt;&lt;br /&gt; The next eight FP values are loaded and shuffled with the eight values previously loaded using the intrinsic _mm256_permute2f128_ps() with a control mask of 33 (00 10 00 01) to produce an intermediate result. This intermediate result is shuffled using _mm256_permutevar_ps() intrinsic with a control mask of {2,3,0,0,1,2,3,0}, blended with B2 and the alpha value to get the destination pixels two and three. These steps are illustrated below &lt;i&gt;(Step2)&lt;/i&gt;.&lt;br /&gt;&lt;br /&gt;
&lt;p style="text-align: center;"&gt;&lt;img src="http://software.intel.com/file/21082" /&gt;&lt;/p&gt;
&lt;br /&gt;&lt;br /&gt; &lt;b&gt;Figure 3:&lt;/b&gt; RGB to RGBA &lt;i&gt;Step2&lt;/i&gt;&lt;br /&gt;&lt;br /&gt; The next eight FP values are loaded from an offset of sixteen from the start of the source address and shuffled with the eight FP values loaded in Step2 using an appropriate control mask. These resulting values are in turn shuffled again and blended with R5 and the alpha values, producing destination pixels four and five as illustrated below &lt;i&gt;(Step3)&lt;/i&gt;.&lt;br /&gt;&lt;br /&gt;
&lt;p style="text-align: center;"&gt;&lt;img src="http://software.intel.com/file/21083" /&gt;&lt;/p&gt;
&lt;br /&gt;&lt;br /&gt; &lt;b&gt;Figure 4:&lt;/b&gt; RGB to RGBA &lt;i&gt;Step3&lt;/i&gt;&lt;br /&gt;&lt;br /&gt; The final set of eight FP values is loaded from an offset of twenty four from the source address. These values are shuffled, blended with B6 and the alpha to produce destination pixels six and seven. These steps are illustrated below &lt;i&gt;(Step4)&lt;/i&gt;.
&lt;p style="text-align: center;"&gt;&lt;img src="http://software.intel.com/file/21084" /&gt;&lt;/p&gt;
&lt;br /&gt;&lt;br /&gt; &lt;b&gt;Figure 5:&lt;/b&gt; RGB to RGBA &lt;i&gt;Step4&lt;/i&gt;&lt;br /&gt;&lt;br /&gt; The source and destination addresses are incremented by twenty four and thirty two respectively. Steps &lt;i&gt;Step1, Step2, Step3,&lt;/i&gt; and &lt;i&gt;Step4&lt;/i&gt; are repeated for the remainder of pixels.&lt;br /&gt;&lt;br /&gt; The figure below shows the source code that demonstrates the above steps.&lt;br /&gt;&lt;br /&gt;
&lt;pre name="code" class="cpp"&gt;// 8 RGB ==&amp;gt; RBGA per iteration&lt;br /&gt;&lt;br /&gt;// [G2 R2 B1 G1 , R1 B0 G0 R0]&lt;br /&gt;__m256 pixel23 = _mm256_load_ps((float *)(srcPix));&lt;br /&gt;&lt;br /&gt;// [* B1 G1 *, * B0 G0 R0], ctrl = [0,1,0,0, 0,2,1,0]&lt;br /&gt;__m256 pixel01 = _mm256_permutevar_ps(pixel23, ctrl);		&lt;br /&gt;&lt;br /&gt;// [R1 R1 R1 R1 , R1 R1 R1 R1]&lt;br /&gt;__m256 pixelTemp = _mm256_broadcast_ss((float *)(srcPix+3));&lt;br /&gt;&lt;br /&gt;// [*  B1 G1 R1 , *  B0 G0 R0], mask = 00 01 00 00&lt;br /&gt;pixel01 = _mm256_blend_ps(pixel01, pixelTemp, 16);			&lt;br /&gt;		&lt;br /&gt;// [1. B1 G1 R1 , 1. B0 G0 R0], mask = 10 00 10 00&lt;br /&gt;pixel01 = _mm256_blend_ps(pixel01, alphaOne, 136);	&lt;br /&gt;_mm256_store_ps((float *)(dstPix), pixel01);  &lt;br /&gt;&lt;br /&gt;// [R5 B4 G4 R4 , B3 G3 R3 B2]&lt;br /&gt;__m256 pixel45  = _mm256_load_ps((float *)(srcPix+8));		&lt;br /&gt;&lt;br /&gt;// [B3 G3 R3 B2 , G2 R2 B1 G1]  mask = 00 10 00 01		&lt;br /&gt;pixel23 = _mm256_permute2f128_ps(pixel23, pixel45, 33);	&lt;br /&gt;	&lt;br /&gt;// [* B3 G3 R3, * * G2 R2], ctrl2 = [0,3,2,1, 0,0,3,2]&lt;br /&gt;pixel23 = _mm256_permutevar_ps(pixel23, ctrl2);				&lt;br /&gt;&lt;br /&gt;// [B2 B2 B2 B2 , B2 B2 B2 B2]&lt;br /&gt;pixelTemp = _mm256_broadcast_ss((float *)(srcPix+8));		&lt;br /&gt;&lt;br /&gt;// [*  B3 G3 R3 , *  B2 G2 R2], mask = 00 00 01 00&lt;br /&gt;pixel23 = _mm256_blend_ps(pixel23, pixelTemp, 4);	&lt;br /&gt;pixel23 = _mm256_blend_ps(pixel23, alphaOne, 136);			&lt;br /&gt;_mm256_store_ps((float *)(dstPix+8), pixel23);  &lt;br /&gt;&lt;br /&gt;// [B7 G7 R7 B6, G6 R6 B5 G5]&lt;br /&gt;__m256 pixel67  = _mm256_load_ps((float *)(srcPix+16));		&lt;br /&gt;&lt;br /&gt;// [G6 R6 B5 G5, R5 B4 G4 R4]  mask = 00 10 00 01&lt;br /&gt;pixel45 = _mm256_permute2f128_ps(pixel45, pixel67, 33);		&lt;br /&gt;// [*  B5 G5 *, * B4 G4 R4]&lt;br /&gt;pixel45 = _mm256_permutevar_ps(pixel45, ctrl);		&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;// [R5 R5 R5 R5 , R5 R5 R5 R5]		&lt;br /&gt;pixelTemp = _mm256_broadcast_ss((float *)(srcPix+15)); &lt;br /&gt;&lt;br /&gt;// [* G6 R6 R6, * B4 G4 R4]		&lt;br /&gt;pixel45 = _mm256_blend_ps(pixel45, pixelTemp, 16);	&lt;br /&gt;pixel45 = _mm256_blend_ps(pixel45, alphaOne, 136);			&lt;br /&gt;_mm256_store_ps((float *)(dstPix+16), pixel45);  &lt;br /&gt;&lt;br /&gt;// [* B7 G7 R7, * * G6 R6]&lt;br /&gt;pixel67 = _mm256_permutevar_ps(pixel67, ctrl2);						&lt;br /&gt;// [B6 B6 B6 B6 , B6 B6 B6 B6]&lt;br /&gt;pixelTemp = _mm256_broadcast_ss((float *)(srcPix+20));		&lt;br /&gt;&lt;br /&gt;// [* B7 G7 R7, * B6 G6 R6]		&lt;br /&gt;pixel67 = _mm256_blend_ps(pixel67, pixelTemp, 4);			&lt;br /&gt;pixel67 = _mm256_blend_ps(pixel67, alphaOne, 136);&lt;br /&gt;_mm256_store_ps((float *)(dstPix+24), pixel67); &lt;br /&gt;&lt;/pre&gt;
&lt;br /&gt; &lt;br /&gt; &lt;b&gt;Figure 6:&lt;/b&gt; Intel® AVX RGB to RGBA conversion code&lt;br /&gt;&lt;br /&gt;
&lt;h1 class="sectionHeading"&gt;RGBA to RGB&lt;/h1&gt;
&lt;br /&gt; The destination and source pixel buffers are aligned to 32-byte boundaries and the conversion routines expect them to be so. The following figure depicts the arrangement of the source and destination buffers in memory, for &lt;b&gt;n&lt;/b&gt; pixels. In this figure R0 is at a lower address than G0, etc.&lt;br /&gt;&lt;br /&gt;
&lt;p style="text-align: center;"&gt;&lt;img src="http://software.intel.com/file/21085" /&gt;&lt;/p&gt;
&lt;br /&gt;&lt;br /&gt; &lt;b&gt;Figure 7:&lt;/b&gt; Arrangement of source and destination pixels in memory&lt;br /&gt;&lt;br /&gt; In each iteration of the loop we load multiple source pixels, rearrange the data, and remove the alpha value  and store the result to the destination address.&lt;br /&gt;&lt;br /&gt; Since the conversion is from a 4-channel pixel to 3-channel pixel, we need to load sixteen SP FP values from the source (four RGBA pixels) and write twelve values (four RGB pixels) per iteration. Doing so will force us to use unaligned stores since in the next iteration we have to write the result at an offset of twelve from the destination address. As explained before we will avoid all unaligned accesses by unrolling the loop twice thus writing twenty four values (six RGB pixels) at a time.&lt;br /&gt;&lt;br /&gt; We first load sixteen SP FP values starting from the source address by invoking the _mm256_load_ps() aligned load intrinsic twice. The pixels are then rearranged using a combination of _mm256_permutevar_ps() and _mm256_permute2f128_ps() instrinsics and the intermediate results blended using an appropriate mask to produce the first set of destination FP values. The following figure illustrates this step &lt;i&gt;(Step1)&lt;/i&gt;.&lt;br /&gt;&lt;br /&gt;
&lt;p style="text-align: center;"&gt;&lt;img src="http://software.intel.com/file/21086" /&gt;&lt;/p&gt;
&lt;br /&gt;&lt;br /&gt; &lt;b&gt;Figure 8:&lt;/b&gt; RGBA to RGB &lt;i&gt;Step1&lt;/i&gt;&lt;br /&gt;&lt;br /&gt; The next set of eight FP values are loaded and using a series of _mm256_permute2f128_ps(), _mm256_permutevar_ps(), _mm256_blend_ps() and _mm256_broadcast_ss() intrinsics and blending with previously loaded values the next set of eight destination values are produced, as illustrated below &lt;i&gt;(Step2)&lt;/i&gt;.&lt;br /&gt;&lt;br /&gt;
&lt;p style="text-align: center;"&gt;&lt;img src="http://software.intel.com/file/21087" /&gt;&lt;/p&gt;
&lt;br /&gt;&lt;br /&gt; &lt;b&gt;Figure 9:&lt;/b&gt; RGBA to RGB &lt;i&gt;Step2&lt;/i&gt;&lt;br /&gt;&lt;br /&gt; In the third step &lt;i&gt;(Step3)&lt;/i&gt;, source RGBA pixels six and seven are loaded from an offset of twenty four from the source address and shuffled and blended with the previously loaded pixels four and five using a series of _mm256_permute2f128_ps(), _mm256_permutevar_ps(), and _mm256_blend_ps() intrinsics to produce the last set of destination values for the current iteration. The following figure depicts this step.&lt;br /&gt;&lt;br /&gt;
&lt;p style="text-align: center;"&gt;&lt;img src="http://software.intel.com/file/21088" /&gt;&lt;/p&gt;
&lt;br /&gt;&lt;br /&gt; &lt;b&gt;Figure 10:&lt;/b&gt; RGBA to RGB &lt;i&gt;Step3&lt;/i&gt;&lt;br /&gt;&lt;br /&gt; The source and destination addresses are incremented by thirty two and twenty four respectively. &lt;i&gt;Steps Step1, Step2,&lt;/i&gt; and &lt;i&gt;Step3&lt;/i&gt; are repeated for the remainder of pixels.&lt;br /&gt;&lt;br /&gt; The figure below shows the source code that demonstrates the above steps.&lt;br /&gt;&lt;br /&gt;
&lt;pre name="code" class="cpp"&gt;// 8 RGBA ==&amp;gt; 8 RGB conversion per iteration&lt;br /&gt;&lt;br /&gt;// [A1 B1 G1 R1 , A0 B0 G0 R0]		&lt;br /&gt;__m256 pixel01 = _mm256_load_ps((float *)(srcPix));				&lt;br /&gt;&lt;br /&gt;// [*  *  B1 G1 , *  B0 G0 R0] &lt;br /&gt;__m256 pixelTmp = _mm256_permutevar_ps(pixel01, ctrl1);			&lt;br /&gt;&lt;br /&gt;// [A3 B3 G3 R3 , A2 B2 G2 R2]&lt;br /&gt;__m256 pixel23 = _mm256_load_ps((float *)(srcPix)+8); &lt;br /&gt;&lt;br /&gt;// [A2 B2 G2 R2 , A1 B1 G1 R1], 0x21 = 00 10 00 01&lt;br /&gt;__m256 pixel12 = _mm256_permute2f128_ps(pixel01, pixel23, 0x21); &lt;br /&gt;&lt;br /&gt;// [G2 R2 *  *  , R1 *  *  * ]&lt;br /&gt;pixel12 = _mm256_permutevar_ps(pixel12, ctrl2);					&lt;br /&gt;&lt;br /&gt;// [G2 R2 B1 G1 , R1 B0 G0 R0], 0xC8 = 11 00 10 00&lt;br /&gt;pixel01 = _mm256_blend_ps(pixelTmp, pixel12, 0xC8);&lt;br /&gt;_mm256_store_ps((float *)(dstPix), pixel01); &lt;br /&gt;&lt;br /&gt;// [B2 B2 B2 B2 , B2 B2 B2 B2]&lt;br /&gt;pixelTmp = _mm256_broadcast_ss((float *)(srcPix)+10);		&lt;br /&gt;&lt;br /&gt;// [A5 B5 G5 R5 , A4 B4 G4 R4]&lt;br /&gt;__m256 pixel45 = _mm256_load_ps((float *)(srcPix)+16);		&lt;br /&gt;&lt;br /&gt;// [A4 B4 G4 R4 , A3 B3 G3 R3]&lt;br /&gt;__m256 pixel34 = _mm256_permute2f128_ps(pixel23, pixel45, 0x21); &lt;br /&gt;&lt;br /&gt;// [*  B4 G4 R4 , B3 G3 R3 * ]&lt;br /&gt;pixel23  = _mm256_permutevar_ps(pixel34, ctrl3);				&lt;br /&gt;&lt;br /&gt;// [*  B4 G4 R4 , B3 G3 R3 B2],  0x1 = 00 00 00 01&lt;br /&gt;pixel23  = _mm256_blend_ps(pixel23, pixelTmp, 0x1);			&lt;br /&gt;&lt;br /&gt;// [R5 R5 R5 R5 , R5 R5 R5 R5]&lt;br /&gt;pixelTmp = _mm256_broadcast_ss((float *)(srcPix)+20);			&lt;br /&gt;&lt;br /&gt;// [R5 B4 G4 R4 , B3 G3 R3 B2], 0x80 = 10 00 00 00&lt;br /&gt;pixel23  = _mm256_blend_ps(pixel23, pixelTmp, 0x80);&lt;br /&gt;_mm256_store_ps((float *)(dstPix)+8, pixel23); &lt;br /&gt;&lt;br /&gt;// [A7 B7 G7 R7 , A6 B6 G6 R6]&lt;br /&gt;__m256 pixel67 = _mm256_load_ps((float *)(srcPix)+24);					&lt;br /&gt;&lt;br /&gt;// [A6 B6 G6 R6 , A5 B5 G5 R5]&lt;br /&gt;__m256 pixel56 = _mm256_permute2f128_ps(pixel45, pixel67, 0x21); &lt;br /&gt;&lt;br /&gt;// [*  *  *  B6 , *  *  B5 G5]&lt;br /&gt;pixel56 = _mm256_permutevar_ps(pixel56, ctrl4);					&lt;br /&gt;&lt;br /&gt;// [B7 G7 R7 *  , G6 R6 *  * ]&lt;br /&gt;pixel67 = _mm256_permutevar_ps(pixel67, ctrl5);					&lt;br /&gt;&lt;br /&gt;// [B7 G7 R7 B6 , G6 R6 B5 G5], 0xEC = 11 10 11 00&lt;br /&gt;pixel56 = _mm256_blend_ps(pixel56, pixel67, 0xEC);			&lt;br /&gt;_mm256_store_ps((float *)(dstPix)+16, pixel56);&lt;br /&gt;&lt;/pre&gt;
&lt;br /&gt; &lt;br /&gt; &lt;b&gt;Figure 11:&lt;/b&gt; Intel® AVX RGBA to RGB conversion code&lt;br /&gt;&lt;br /&gt;
&lt;h1 class="sectionHeading"&gt;Results&lt;/h1&gt;
&lt;br /&gt; Two implementations of the conversions - a scalar C++ implementation, and the 256-bit Intel® AVX implementation - were compared for performance on the Intel® AVX simulator. An average of three runs for each implementation is computed and compared for runtime performance. The following table shows the speedup achieved by the 256-bit version.&lt;br /&gt;&lt;br /&gt; 
&lt;table class="tableFormat1" border="0" cellpadding="0" cellspacing="0"&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Conversion&lt;/td&gt;
&lt;td&gt;Speedup vs scalar&lt;/td&gt;
&lt;td&gt;Num. pixels&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;RGB to RGBA&lt;/td&gt;
&lt;td&gt;2.73X&lt;/td&gt;
&lt;td&gt;512&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;RGBA to RGB&lt;/td&gt;
&lt;td&gt;2.14X&lt;/td&gt;
&lt;td&gt;512&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;br /&gt;
&lt;h1 class="sectionHeading"&gt;References and Resources&lt;/h1&gt;
&lt;br /&gt; 
&lt;ul&gt;
&lt;li&gt;&lt;a href="http://software.intel.com/en-us/avx/"&gt;http://software.intel.com/en-us/avx/&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;&lt;img src="http://feeds.feedburner.com/~r/ISNMain/~4/v6YU4hgk1n4" height="1" width="1"/&gt;</description>
      <link>http://feedproxy.google.com/~r/ISNMain/~3/v6YU4hgk1n4/intel-advanced-vector-extensions-pixel-format-conversions</link>
      <pubDate>Thu, 09 Jul 2009 08:29:18 -0700</pubDate>
      <comments>http://software.intel.com/en-us/articles/intel-advanced-vector-extensions-pixel-format-conversions#comments</comments>
      <guid isPermaLink="false">http://software.intel.com/en-us/articles/intel-advanced-vector-extensions-pixel-format-conversions</guid>
      <category>Parallel Programming and Multi-Core</category>
      <category>Visual Computing</category>
      <category>Intel® AVX</category>
    <feedburner:origLink>http://software.intel.com/en-us/articles/intel-advanced-vector-extensions-pixel-format-conversions</feedburner:origLink></item>
    <item>
      <title>Accessing an IA-32 DLL from a native Itanium® Architecture Process</title>
      <description>&lt;div class="sectionBody"&gt;&lt;strong&gt;By Clayne Robison, Intel Software Solutions Group, Server ISV Enabling&lt;/strong&gt;&lt;br /&gt;Porting to Itanium architecture: Avoiding dependencies on 3&lt;sup style="FONT-SIZE: 11px"&gt;rd&lt;/sup&gt;-party IA32 libraries by using COM or other out-of-process marshalling mechanisms.&lt;br /&gt;&lt;/div&gt;
&lt;div class="sectionHeading"&gt;Abstract/Overview&lt;/div&gt;
&lt;div class="sectionBody"&gt;Porting an application to the Intel Itanium architecture has many benefits, including the ability to take advantage of 64-bit address space and the architecture's explicit parallelism. However, complex applications often depend on components developed 3rd-party software sources, a dependency that puts the porting project at risk.&lt;br /&gt;&lt;br /&gt;An analysis of these dependencies may lead to the question: "Is it possible to use an IA32 DLL from within a native Itanium architecture process?" Unfortunately, none of the operating systems currently ported to the Intel Itanium architecture allow this. However, IA32 libraries can be accessed from native Itanium architecture processes provided the IA32 libraries are loaded into a separate process space, and some marshaling mechanism is in place.&lt;br /&gt;&lt;br /&gt;This source code demonstrates how an IA32 DLL can be used by a native Itanium architecture process using COM.&lt;br /&gt;&lt;/div&gt;
&lt;div class="sectionHeading"&gt;Target Audience&lt;/div&gt;
&lt;div class="sectionBody"&gt;Enterprise application developers&lt;br /&gt;&lt;/div&gt;
&lt;div class="sectionHeading"&gt;Sample Category:&lt;/div&gt;
&lt;div class="sectionBody"&gt;Operational Example Code. A complete Microsoft Visual Studio* workspace with three projects.&lt;br /&gt;&lt;/div&gt;
&lt;div class="sectionHeading"&gt;Implementation Language:&lt;/div&gt;
&lt;div class="sectionBody"&gt;C++/COM&lt;br /&gt;&lt;/div&gt;
&lt;div class="sectionHeading"&gt;Target Hardware &amp;amp; Software Platforms&lt;/div&gt;
&lt;div class="sectionBody"&gt;&lt;strong&gt;Hardware Systems:&lt;/strong&gt; Itanium Architecture&lt;br /&gt;&lt;strong&gt;Operating Systems:&lt;/strong&gt; Windows*, NET 64.&lt;br /&gt;&lt;strong&gt;Compilers:&lt;/strong&gt; Microsoft Visual C++ 6.0 SP5 with Microsoft Platform SDK (November 2001)&lt;br /&gt;&lt;/div&gt;
&lt;div class="sectionHeading"&gt;Additional Information&lt;/div&gt;
&lt;div class="sectionBody"&gt;The project includes instructions on how to build all binaries on a Win32 system using Microsoft Visual C++ 6.0 (SP5) with the November 2001 Platform SDK. Once all binaries are built, they can be run on a Win64 system. All binaries can also be run on a Win32 system (with the exception of the native Itanium architecture test application.)&lt;br /&gt;&lt;br /&gt;&lt;a href="http://software.intel.com/en-us/articles/code-samples-license?target=http%3A%2F%2Fwww.intel.com%2Fcd%2Fids%2Fdeveloper%2Fasmo-na%2Feng%2F40522.htm"&gt;Download Code Sample&lt;/a&gt;&lt;/div&gt;
&lt;!--page break--&gt;
&lt;hr /&gt;&lt;img src="http://feeds.feedburner.com/~r/ISNMain/~4/uZasSPOQrA8" height="1" width="1"/&gt;</description>
      <link>http://feedproxy.google.com/~r/ISNMain/~3/uZasSPOQrA8/accessing-an-ia-32-dll-from-a-native-itaniumr-architecture-process</link>
      <pubDate>Wed, 08 Jul 2009 17:25:19 -0700</pubDate>
      <comments>http://software.intel.com/en-us/articles/accessing-an-ia-32-dll-from-a-native-itaniumr-architecture-process#comments</comments>
      <guid isPermaLink="false">http://software.intel.com/en-us/articles/accessing-an-ia-32-dll-from-a-native-itaniumr-architecture-process</guid>
      <category>ISN General</category>
      <category>Itanium</category>
    <feedburner:origLink>http://software.intel.com/en-us/articles/accessing-an-ia-32-dll-from-a-native-itaniumr-architecture-process</feedburner:origLink></item>
    <item>
      <title>Announcing Version 2.1 of Intel® Graphics Performance Analyzers </title>
      <description>&lt;!--  --&gt;&lt;span style="text-decoration: underline;"&gt;&lt;b&gt;New features added to Intel® Graphics Performance Analyzers (Intel® GPA)&lt;/b&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt; Intel is pleased to announce Intel® Graphics Performance Analyzers 2.1.&lt;br /&gt;&lt;br /&gt; This version includes a number of significant new product features that have been added at the request of developers and publishers worldwide. These new features will allow you to more quickly and easily identify and resolve performance bottlenecks in your game or graphics application. If you have used GPA in the past or are just using it for the first time, it is recommended that you download this updated version, and start using it for your game optimization tasks. &lt;br /&gt;&lt;br /&gt;&lt;span style="text-decoration: underline;"&gt;&lt;b&gt;Key GPA Links:&lt;/b&gt;&lt;/span&gt;&lt;br /&gt; &lt;a target="_blank" title="GPA Home Page" href="http://www.intel.com/software/gpa/"&gt;&lt;b&gt; &lt;/b&gt;&lt;/a&gt; 
&lt;ul&gt;
&lt;li&gt;&lt;a target="_blank" title="GPA Home Page" href="http://www.intel.com/software/gpa/"&gt;Intel® Graphics Performance Analyzers Home Page (includes instructions for downloading the 2.1 release)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a target="_blank" title="GPA Help Forums" href="http://software.intel.com/en-us/forums/intel-graphics-performance-analyzers/"&gt;Help Forums&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a target="_blank" title="GPA FAQ" href="http://software.intel.com/en-us/articles/gpa-faq/"&gt;GPA FAQ&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;b&gt;&lt;span style="text-decoration: underline;"&gt;Summary of New features in Intel® GPA 2.1:&lt;/span&gt;&lt;br /&gt; &lt;br /&gt; &lt;/b&gt;&lt;b&gt;System Analyzer&lt;/b&gt;&lt;br /&gt;&lt;br /&gt; 
&lt;ul&gt;
&lt;li&gt; Additional DX metrics&lt;/li&gt;
&lt;li&gt; Single-step frames&lt;/li&gt;
&lt;li&gt;Hot-Key Frame Capture&lt;/li&gt;
&lt;/ul&gt;
&lt;b&gt;Frame Analyzer&lt;/b&gt;&lt;br /&gt;&lt;br /&gt; 
&lt;ul&gt;
&lt;li&gt;New Metrics per draw call                      
&lt;ul&gt;
&lt;li&gt; Vertex Shader Duration&lt;/li&gt;
&lt;li&gt;Pixel Shader Duration &lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt; Stacked bar chart (Vertex Shader and Pixel Shader time within GPU duration) &lt;/li&gt;
&lt;li&gt; Configuration of bar chart for both X and Y axis &lt;/li&gt;
&lt;li&gt;Pixel history -- Select any pixel in any render target&lt;/li&gt;
&lt;li&gt; Per render target overdraw visualization&lt;/li&gt;
&lt;li&gt; Buffer viewing options (render target, texture)                      
&lt;ul&gt;
&lt;li&gt; Mouse wheel zoom&lt;/li&gt;
&lt;li&gt; View single R,G, or B channel&lt;/li&gt;
&lt;li&gt; View alpha channel&lt;/li&gt;
&lt;li&gt; Histogram + clamping &lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt; Export metrics per selected draw calls to .csv file&lt;/li&gt;
&lt;li&gt; Disable Draw Call Experiment&lt;/li&gt;
&lt;/ul&gt;
&lt;b&gt;&lt;span style="text-decoration: underline;"&gt;Intel® Graphics Performance Analyzers, Version 2.1 Feature Overview&lt;br /&gt;&lt;br /&gt;GPA 2.1 System Analyzer:&lt;/span&gt;&lt;/b&gt;&lt;b&gt;&lt;span style="text-decoration: underline;"&gt;&lt;br /&gt;&lt;br /&gt;&lt;/span&gt; Additional DX Metrics &lt;/b&gt;(outlined in green below) &lt;b&gt;-&lt;/b&gt; on a per-frame basis, allows for users to understand in greater detail what is happening in each frame with a complete set of DX metrics:&lt;br /&gt; 
&lt;ul&gt;
&lt;li&gt;DX State Block Applies&lt;/li&gt;
&lt;li&gt; DX State Block Captures&lt;/li&gt;
&lt;li&gt; DX Render Target Changes&lt;/li&gt;
&lt;li&gt; DX Render Target Clears &lt;/li&gt;
&lt;li&gt; DX Color Fills &lt;/li&gt;
&lt;li&gt; DX Surface Updates &lt;/li&gt;
&lt;li&gt; DX Stretch Rects &lt;/li&gt;
&lt;li&gt; DX Surface Locks &lt;/li&gt;
&lt;li&gt; DX Volume Locks &lt;/li&gt;
&lt;li&gt; DX Surface Lock Time&lt;/li&gt;
&lt;li&gt; DX Volume Lock Time&lt;/li&gt;
&lt;/ul&gt;
&lt;img src="http://software.intel.com/file/20531" title="image002.jpg" alt="image002.jpg" /&gt;&lt;br /&gt; &lt;b&gt; &lt;/b&gt;&lt;br /&gt;&lt;br /&gt; &lt;b&gt;Single Step Frame Capture&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;&lt;img src="http://software.intel.com/file/20533" title="image004.jpg" alt="image004.jpg" /&gt;&lt;br /&gt;&lt;br /&gt; &lt;img src="http://software.intel.com/file/20534" title="image005.jpg" alt="image005.jpg" /&gt;&lt;br /&gt;&lt;br /&gt; &lt;b&gt;Pause, then single step to the exact frame you want to capture.  Then press the capture button -- allows you to capture the frame you see.&lt;/b&gt;&lt;br /&gt;&lt;br /&gt; &lt;b&gt; &lt;/b&gt;&lt;br /&gt;&lt;b&gt;Capture in-game via Hot-Key&lt;/b&gt;&lt;br /&gt;&lt;br /&gt; &lt;b&gt;Overview:&lt;/b&gt;&lt;br /&gt;&lt;br /&gt; 
&lt;ul&gt;
&lt;li&gt; &lt;b&gt;Simply hit the hot-key at any point during game execution on the game machine to capture a DX frame.&lt;/b&gt;&lt;/li&gt;
&lt;li&gt;&lt;b&gt;E&lt;/b&gt;&lt;b&gt;nd-user configurable Hot-Key &lt;/b&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;br /&gt; &lt;b&gt;Example shots captured through Hot-key&lt;/b&gt;&lt;br /&gt;&lt;br /&gt; &lt;img src="http://software.intel.com/file/20536" title="image007.jpg" alt="image007.jpg" /&gt;&lt;br /&gt;&lt;br /&gt;&lt;img src="http://software.intel.com/file/20538" title="image009.jpg" alt="image009.jpg" /&gt;&lt;br /&gt;&lt;br /&gt; &lt;b&gt; &lt;/b&gt;&lt;b&gt;&lt;span style="text-decoration: underline;"&gt;GPA 2.1 Frame Analyzer&lt;/span&gt;&lt;/b&gt;&lt;br /&gt;&lt;b&gt;&lt;br /&gt; Pixel History - &lt;/b&gt;allows you to select any pixel in any render target in the render target viewer and to see all draw calls that have affected that pixel in a new tab (the following screen shot shows usage of pixel selection from an overdraw visualization so that you can select "hot" pixels).&lt;br /&gt;&lt;br /&gt; &lt;img src="http://software.intel.com/file/20539" title="image010.jpg" alt="image010.jpg" /&gt;&lt;br /&gt;&lt;br /&gt; &lt;b&gt;&lt;br /&gt; Vertex &amp;amp; Pixel Shader Duration metric - &lt;/b&gt;Vertex and pixel shader durations have been added as metrics for all devices.  The screenshot below shows both metrics at the same time in the erg bar chart with vertex shader duration selected as the x-axis metric and pixel shader duration selected as the y-axis metric.&lt;br /&gt;&lt;br /&gt;&lt;img src="http://software.intel.com/file/20540" title="image011.jpg" alt="image011.jpg" /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt; &lt;b&gt;Overdraw per Render Target - &lt;/b&gt;The benefit of this feature is that it allows you to find areas of the scene that are really expensive from an overdraw perspective and then determine exactly which draw calls are affecting that overdraw and exactly which draw calls are causing the majority of the overdraw.&lt;br /&gt; &lt;b&gt; &lt;/b&gt;&lt;br /&gt;&lt;br /&gt; &lt;img src="http://software.intel.com/file/20542" title="image013.jpg" alt="image013.jpg" /&gt;&lt;img src="http://software.intel.com/file/20544" title="image015.jpg" alt="image015.jpg" /&gt;&lt;br /&gt;&lt;br /&gt; &lt;b&gt; &lt;/b&gt;&lt;br /&gt;&lt;b&gt;Buffer Histograms with Clamping - &lt;/b&gt;Buffer histograms are now a slide out option for any buffer displayed (render target or texture).  This enables you to see the color channels as well as the alpha channel of any buffer and allows you to either manually clamp the highlights or shadows or auto clamp through a single button press.  Clamping enables you to expand the dynamic range of the buffer as you view it so you can see portions of the buffer that might be hard to differentiate due to similar values.&lt;br /&gt;&lt;br /&gt; &lt;img src="http://software.intel.com/file/20545" title="image016.jpg" alt="image016.jpg" /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt; &lt;b&gt; &lt;/b&gt;&lt;b&gt;Export Draw Call Subset to CSV - &lt;/b&gt;This is a great feature set that allows for the export at the draw call level to a CSV file that can be managed in an external application, such as Microsoft Excel*. This is a selection based feature as well, meaning that if you select one draw call you will export one draw call; or if you select the entire frame you will export the entire frame.&lt;br /&gt;&lt;br /&gt; &lt;b&gt;&lt;img src="http://software.intel.com/file/20547" title="image018.jpg" alt="image018.jpg" /&gt;&lt;br /&gt;&lt;br /&gt;&lt;/b&gt;&lt;br /&gt; Exported to CSV file:&lt;br /&gt;&lt;br /&gt; &lt;img src="http://software.intel.com/file/20549" title="image020.png" alt="image020.png" /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt; &lt;br /&gt;&lt;b&gt;Disabled Draw Call Experiment -&lt;/b&gt; Allows you to experiment with various draw calls and see which ones are causing the bottleneck and how you can improve performance&lt;br /&gt;&lt;br /&gt;&lt;img src="http://software.intel.com/file/20550" title="image021.jpg" alt="image021.jpg" /&gt;&lt;br /&gt; &lt;br /&gt;&lt;br /&gt; Disabled Draw Calls:  &lt;br /&gt;&lt;br /&gt;&lt;img src="http://software.intel.com/file/20551" title="image022.jpg" alt="image022.jpg" /&gt;&lt;br /&gt;&lt;br /&gt; &lt;br /&gt; The Experiments tab showing Disabled Draw calls (ergs):&lt;br /&gt;&lt;br /&gt;&lt;img src="http://software.intel.com/file/20552" title="image023.jpg" alt="image023.jpg" /&gt;&lt;br /&gt; &lt;br /&gt;&lt;br /&gt; &lt;b&gt;&lt;span style="text-decoration: underline;"&gt;Additional information:&lt;/span&gt;&lt;/b&gt;&lt;br /&gt;&lt;br /&gt; For additional information please check out the forums and knowledge base articles on the GPA website (links are located at the top of this document).  For more detailed information on all the features of GPA 2.1, please refer to the user's guides and FAQ, also linked above.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt; &lt;i&gt;* Other names and brands may be claimed as the property of others&lt;/i&gt;&lt;i&gt;.&lt;/i&gt;&lt;br /&gt;&lt;br /&gt;&lt;img src="http://feeds.feedburner.com/~r/ISNMain/~4/QJFluoFTWZc" height="1" width="1"/&gt;</description>
      <link>http://feedproxy.google.com/~r/ISNMain/~3/QJFluoFTWZc/GPA-version2dot1</link>
      <pubDate>Wed, 08 Jul 2009 14:29:33 -0700</pubDate>
      <comments>http://software.intel.com/en-us/articles/GPA-version2dot1#comments</comments>
      <guid isPermaLink="false">http://software.intel.com/en-us/articles/GPA-version2dot1</guid>
      <category>Intel® Graphics Performance Analyzers Knowledge Base</category>
    <feedburner:origLink>http://software.intel.com/en-us/articles/GPA-version2dot1</feedburner:origLink></item>
    <item>
      <title>Prana Studios leverages Intel® Xeon® Processor 5500 Series to get better 3D animation rendering</title>
      <description>&lt;p&gt;&lt;b&gt;Introduction:&lt;/b&gt;  Prana Studios is a leading Animation house based out of Mumbai and Los Angeles. Prana's core business is focused on four main areas: Long-form CG content, location based entertainment, game cinematics, and feature film effects. They began collaborating with Intel to resolve a technical challenge they were facing while working on an ongoing animation movie co-produced with a leading Bollywood Studio.&lt;/p&gt;
&lt;p&gt; &lt;/p&gt;
&lt;p&gt;&lt;b&gt;The Challenge: &lt;/b&gt;The technical challenge was to resolve the&lt;b&gt; &lt;/b&gt;renderer&lt;b&gt; &lt;/b&gt;performance&lt;b&gt; &lt;/b&gt;of displacements and 3D motion blur in the Improving&lt;b&gt; &lt;/b&gt;SitexGraphics*&lt;b&gt; &lt;/b&gt;Air* product .&lt;/p&gt;
&lt;p&gt;&lt;b&gt;&lt;br /&gt;The Solution: &lt;/b&gt;To meet this challenge, the Prana* team explored Air* multi-core software optimizations and also evaluated the latest hardware platform powered by the Intel® Xeon® processors 5500 series with its Intel® Hyper-Threading Technology (Intel® HT Technology).&lt;/p&gt;
&lt;p&gt; &lt;/p&gt;
&lt;p&gt;&lt;b&gt;The Impact: &lt;/b&gt;The performance of Air* renderer was significantly higher on the Intel® Xeon® processor 5500 series-based machine as compared to the earlier generation of the hardware with the Intel® Xeon® processor 5355 series-based platform. The Air* renderer showed phenomenal performance improvement on the Intel® Xeon® processor 5500 series with average gains of 1.8X with Intel® HT Technology ON and 1.45X with Intel® HT Technology OFF as compared to the older generation Intel® Xeon® processor 5355 series-based platform.&lt;/p&gt;
&lt;p&gt; &lt;/p&gt;
&lt;p&gt;&lt;b&gt;Application Optimizations in Air* Software: &lt;/b&gt;During evaluation of the Intel® Xeon® processor 5500 series-based platform, the Prana* team found some scalability issues with the Air* renderer when moving from 8-thread to 16-threadexecution. This feedback was given to the SitexGraphics* what? Group?, who investigated the threading issues in the Air* renderer and released a fix that enabled 16-thread execution on the Intel® Xeon® processor 5500 series-based  platform (with the Intel® HT Technology feature ON).&lt;/p&gt;
&lt;p&gt; &lt;/p&gt;
&lt;p&gt;&lt;b&gt;Deploying the Intel® Xeon® Processor 55xx Series-based Platform:&lt;/b&gt; Performance of Air* renderer was evaluated using workloads that constituted foliage, fur and texture scenes. Performance measurements on the Intel® Xeon® processor 5500 series-based platform were done with both Intel® HT Technology ON and OFF. It was found that the Intel® Xeon® processor 5355 series-based platform took 199, 871, and 728 seconds? while the Intel® Xeon® processor 5500 series-based platform took 137, 616, and 527 seconds with Intel® HT Technology OFF (8-thread execution) and 93, 495, and 419 seconds respectively, with Intel® HT Technology ON (16-thread execution) for rendering the 3 workloads. Therefore, optimum rendering performance was achieved on the Intel® Xeon® processor5500 series-based platform with the Intel® HT Technology ON (Ref Fig. 1). &lt;/p&gt;
&lt;p&gt;The average performance gains on the Intel® Xeon® processor 5500 series-based platform with Intel® HT Technology OFF and ON were 1.45X and 1.8X respectively, when compared to the Intel® Xeon® processor 5355 series-based platform.&lt;/p&gt;
&lt;p&gt; &lt;/p&gt;
&lt;p&gt;
&lt;table width="100%" cellpadding="0" cellspacing="0"&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;p&gt;&lt;b&gt;Fig. 1: Lower is better&lt;br /&gt;&lt;img src="http://software.intel.com/file/20092" alt="Prana_fig+1.jpg" title="Prana_fig+1.jpg" /&gt;&lt;/b&gt;&lt;/p&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;table width="100%" cellpadding="0" cellspacing="0"&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;p&gt; &lt;/p&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;a&gt;&lt;/a&gt; &lt;/p&gt;
&lt;p&gt; &lt;/p&gt;
&lt;p&gt;&lt;b&gt; &lt;/b&gt;&lt;b&gt;"Good things" about Intel® Xeon® processor 5500 series: &lt;/b&gt;Intel® Xeon® processor-based servers provide reliable, efficient, and proven performance, designed from ground up to meet data-demanding enterprise requirements. The Intel® Xeon® processors are the ideal choice for business-critical computing.  &lt;/p&gt;
&lt;p&gt;&lt;b&gt;&lt;br /&gt;Configuration of the machines tested:&lt;/b&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="text-decoration: underline;"&gt;Intel® Xeon® processor 5355 series-based Platform&lt;br /&gt;&lt;/span&gt;•  &lt;b&gt;Hardware: &lt;/b&gt;Dual Processors Intel® Xeon® CPU 5355 Series @ 2.66GHz with 8GB FBD2 800MHz RAM&lt;br /&gt;•  &lt;b&gt;OS:&lt;/b&gt; Windows* XP* Professional x64 Edition v5.2.3790 Service Pack 2 Build 3790&lt;br /&gt;•  &lt;b&gt;Software Stack: &lt;/b&gt;Maya*2008 Ext2 (32-bit), MayaMan* 2.0.15 (32-bit), SitexGraphics* Air* 8.09 (32-bit)&lt;/p&gt;
&lt;p&gt;&lt;span style="text-decoration: underline;"&gt;Intel® Xeon® processor 5500 series-based Platform&lt;br /&gt;&lt;/span&gt;•  &lt;b&gt;Hardware: &lt;/b&gt;Dual Processors Intel® Xeon® CPU 5560 Series @ 2.8GHz with 8 GB DDR3 1066MHz RAM&lt;br /&gt;•  &lt;b&gt;OS:&lt;/b&gt; Windows XP Professional x64 Edition v5.2.3790 Service Pack 2 Build 3790&lt;br /&gt;•  &lt;b&gt;Software Stack: &lt;/b&gt;Maya*2008 Ext2 (32-bit), MayaMan* 2.0.15 (32-bit), SitexGraphics* Air* 8.09 (32-bit)&lt;/p&gt;
&lt;br /&gt;&lt;br /&gt;
&lt;hr size="1" width="33%" align="left" /&gt;
*Other names and brands may be claimed as the property of others.&lt;img src="http://feeds.feedburner.com/~r/ISNMain/~4/zGhj79UGXmU" height="1" width="1"/&gt;</description>
      <link>http://feedproxy.google.com/~r/ISNMain/~3/zGhj79UGXmU/prana-studios-leverages-intel-xeon</link>
      <pubDate>Wed, 08 Jul 2009 01:40:23 -0700</pubDate>
      <comments>http://software.intel.com/en-us/articles/prana-studios-leverages-intel-xeon#comments</comments>
      <guid isPermaLink="false">http://software.intel.com/en-us/articles/prana-studios-leverages-intel-xeon</guid>
      <category>Xeon</category>
      <category>Visual Computing</category>
    <feedburner:origLink>http://software.intel.com/en-us/articles/prana-studios-leverages-intel-xeon</feedburner:origLink></item>
    <item>
      <title>OMP abort: Initializing libguide.lib, but found libguide40.lib already</title>
      <description>&lt;!--CTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dt--&gt; 
&lt;table border="0" cellpadding="0" cellspacing="15"&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td class="bodycopy"&gt;
&lt;p&gt;&lt;b&gt;Symptom(s):&lt;/b&gt;&lt;br /&gt;&lt;b&gt;OMP abort: Initializing libguide.lib, but found libguide40.lib already initialized.&lt;/b&gt;&lt;/p&gt;
&lt;p&gt;This can cause performance degradation.&lt;/p&gt;
Set environment variable KMP_DUPLICATE_LIB_OK=TRUE if you want your program to continue in this case.
&lt;p&gt;&lt;b&gt;&lt;/b&gt;&lt;/p&gt;
&lt;p&gt;&lt;b&gt;Cause:&lt;/b&gt;&lt;/p&gt;
&lt;p style="text-align: justify;"&gt;&lt;b&gt;mkl_c.lib&lt;/b&gt; is one of mkl static library interface. It defines &lt;b&gt;libguide.lib&lt;/b&gt; as a default library for resolution of threading library calls. But other Intel® software such as Intel® IPP, Intel® C++ Compiler and Intel® OpenCV defined &lt;b&gt;libguid40.lib&lt;/b&gt; as default. So the error arises because of the duplicate initialization of &lt;b&gt;libguide.lib&lt;/b&gt; when using static MKL library and other Intel software at the same time.&lt;/p&gt;
&lt;p&gt;&lt;b&gt;Solution:&lt;/b&gt;&lt;/p&gt;
&lt;p style="text-align: justify;"&gt;You could use linker switch &lt;b&gt;/nodefaultlib:libguide.lib&lt;/b&gt; and link with &lt;b&gt;libguide40.lib&lt;/b&gt; by adding the option to the link line or in project option.&lt;/p&gt;
&lt;p style="text-align: justify;"&gt;For example, link &lt;b&gt;users.obj mkl_c.lib /nodefaultlib:libguide.lib libguide40.lib&lt;/b&gt;. This will stop &lt;b&gt;mkl_c.lib&lt;/b&gt; from defining &lt;b&gt;libguide.lib&lt;/b&gt; as a default library for resolution of threading library calls.&lt;/p&gt;
&lt;p&gt;&lt;b&gt;Operating System:&lt;/b&gt;&lt;/p&gt;
&lt;table border="0" cellpadding="0" cellspacing="0"&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td class="xs"&gt;Windows* XP Professional, Windows* XP Home Edition, Windows* XP Tablet PC Edition, Windows* XP Media Center Edition&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;table border="0" cellpadding="0" cellspacing="0"&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;img src="http://software.intel.com/file/6324" height="5" width="388" /&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td height="10"&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;img src="http://feeds.feedburner.com/~r/ISNMain/~4/Slhw7jPGnNE" height="1" width="1"/&gt;</description>
      <link>http://feedproxy.google.com/~r/ISNMain/~3/Slhw7jPGnNE/performance-tools-for-software-developers-omp-abort-initializing-libguidelib-but-found-libguide40lib-already</link>
      <pubDate>Tue, 07 Jul 2009 22:20:25 -0700</pubDate>
      <comments>http://software.intel.com/en-us/articles/performance-tools-for-software-developers-omp-abort-initializing-libguidelib-but-found-libguide40lib-already#comments</comments>
      <guid isPermaLink="false">http://software.intel.com/en-us/articles/performance-tools-for-software-developers-omp-abort-initializing-libguidelib-but-found-libguide40lib-already</guid>
      <category>Intel® Integrated Performance Primitives Knowledge Base</category>
      <category>Intel® Math Kernel Library Knowledge Base</category>
    <feedburner:origLink>http://software.intel.com/en-us/articles/performance-tools-for-software-developers-omp-abort-initializing-libguidelib-but-found-libguide40lib-already</feedburner:origLink></item>
    <item>
      <title>Intel® Visual Fortran Compiler Documentation not available from Help menu in Visual Studio 2008 Shell*</title>
      <description>&lt;br /&gt;
&lt;div id="art_pre_template"&gt;&lt;b&gt;Reference Number : DPD200137640&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Version : 11.1&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Operating System : Windows&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Problem Description : &lt;/strong&gt;When using Intel® Visual Fortran with Microsoft Visual Studio 2008 Shell*, the compiler documentation is not available from the help menu.  There is a menu option, Help &amp;gt; Intel Visual Fortran Compiler Pro &amp;gt; Intel Visual Fortran Compiler Help which opens the Microsoft Document Explorer but there is no Intel® Visual Fortran Compiler documentation displayed.  &lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Resolution Status : &lt;/strong&gt;This is a known issue that may be resolved in a future product release.  This issue occurs when Intel Visual Fortran with Visual Studio 2008 Shell is installed on a system that has had Visual Studio 2005 Premier Partner Edition (VSPPE) previously installed.  The workaround is to uninstall the Intel Compilers, Visual Studio 2008 Shell, and VSPPE and then reinstall Visual Studio 2008 Shell and the Intel Compilers.  &lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;i&gt;[DISCLAIMER: The information on this web site is intended for hardware system manufacturers and software developers. Intel does not warrant the accuracy, completeness or utility of any information on this site. Intel may make changes to the information or the site at any time without notice. Intel makes no commitment to update the information at this site. ALL INFORMATION PROVIDED ON THIS WEBSITE IS PROVIDED "as is" without any express, implied, or statutory warranty of any kind including but not limited to warranties of merchantability, non-infringement of intellectual property, or fitness for any particular purpose. Independent companies manufacture the third-party products that are mentioned on this site. Intel is not responsible for the quality or performance of third-party products and makes no representation or warranty regarding such products. The third-party supplier remains solely responsible for the design, manufacture, sale and functionality of its products. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. *Other names and brands may be claimed as the property of others.]&lt;/i&gt;&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/ISNMain/~4/HFvJfpdRXrk" height="1" width="1"/&gt;</description>
      <link>http://feedproxy.google.com/~r/ISNMain/~3/HFvJfpdRXrk/fortran-documentation-not-available-from-help-menu-in-visual-studio-2008-shell</link>
      <pubDate>Tue, 07 Jul 2009 17:11:34 -0700</pubDate>
      <comments>http://software.intel.com/en-us/articles/fortran-documentation-not-available-from-help-menu-in-visual-studio-2008-shell#comments</comments>
      <guid isPermaLink="false">http://software.intel.com/en-us/articles/fortran-documentation-not-available-from-help-menu-in-visual-studio-2008-shell</guid>
      <category>Intel® Visual Fortran Compiler for Windows* Knowledge Base</category>
    <feedburner:origLink>http://software.intel.com/en-us/articles/fortran-documentation-not-available-from-help-menu-in-visual-studio-2008-shell</feedburner:origLink></item>
  </channel>
</rss>
