<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" media="screen" href="/~d/styles/atom10full.xsl"?><?xml-stylesheet type="text/css" media="screen" href="http://feeds.feedburner.com/~d/styles/itemcontent.css"?><feed xmlns="http://www.w3.org/2005/Atom" xmlns:openSearch="http://a9.com/-/spec/opensearch/1.1/" xmlns:georss="http://www.georss.org/georss" xmlns:gd="http://schemas.google.com/g/2005" xmlns:thr="http://purl.org/syndication/thread/1.0" xmlns:feedburner="http://rssnamespace.org/feedburner/ext/1.0" gd:etag="W/&quot;CUQBRXk7fip7ImA9WhdQEEo.&quot;"><id>tag:blogger.com,1999:blog-8517115461722256631</id><updated>2011-08-11T16:35:54.706+02:00</updated><category term="particle" /><category term="GPU" /><category term="tonemapping" /><category term="deferred lighting" /><category term="directx" /><category term="premultiplied alpha" /><category term="simulator" /><category term="PS3" /><category term="blending" /><category term="video driver" /><category term="shader" /><category term="Software Rendering" /><category term="light" /><category term="2010" /><category term="HDR" /><category term="fedora" /><category term="VC++" /><category term="cell" /><category term="post processing" /><category term="geometry" /><category term="C++" /><category term="color grading" /><category term="compression" /><category term="virtual memory" /><category term="GCC" /><category term="multiple inheritance" /><category term="ibm" /><category term="CPU" /><category term="circular buffer" /><category term="rasterization" /><category term="COJ" /><category term="particle data structure" /><category term="Linux" /><category term="kernel" /><category term="ALU" /><category term="optimization" /><category term="windows" /><category term="LUT" /><category term="SPU" /><category term="Aggregated" /><category term="Siggraph" /><category term="papers" /><category term="compiler" /><category term="occlusion culling" /><title>KriScg</title><subtitle type="html">Another blog of another graphics programmer</subtitle><link rel="http://schemas.google.com/g/2005#feed" type="application/atom+xml" href="http://kriscg.blogspot.com/feeds/posts/default" /><link rel="alternate" type="text/html" href="http://kriscg.blogspot.com/" /><author><name>KriS</name><uri>http://www.blogger.com/profile/01055035127272189489</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="16" height="16" src="http://img2.blogblog.com/img/b16-rounded.gif" /></author><generator version="7.00" uri="http://www.blogger.com">Blogger</generator><openSearch:totalResults>13</openSearch:totalResults><openSearch:startIndex>1</openSearch:startIndex><openSearch:itemsPerPage>25</openSearch:itemsPerPage><atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="self" type="application/atom+xml" href="http://feeds.feedburner.com/Kriscg" /><feedburner:info uri="kriscg" /><atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="hub" href="http://pubsubhubbub.appspot.com/" /><entry gd:etag="W/&quot;D0AFSHkzeSp7ImA9Wx9UFk0.&quot;"><id>tag:blogger.com,1999:blog-8517115461722256631.post-3629024966129121286</id><published>2011-02-13T01:47:00.002+01:00</published><updated>2011-02-13T15:01:59.781+01:00</updated><app:edited xmlns:app="http://www.w3.org/2007/app">2011-02-13T15:01:59.781+01:00</app:edited><category scheme="http://www.blogger.com/atom/ns#" term="windows" /><category scheme="http://www.blogger.com/atom/ns#" term="kernel" /><category scheme="http://www.blogger.com/atom/ns#" term="video driver" /><category scheme="http://www.blogger.com/atom/ns#" term="virtual memory" /><category scheme="http://www.blogger.com/atom/ns#" term="directx" /><title>Virtual memory on PC</title><content type="html">There is an &lt;a href="http://solid-angle.blogspot.com/2011/02/virtual-addressing-101.html"&gt;excelent post&lt;/a&gt; about virtual memory. It's written mainly from a perspective of console developer. On consoles most of memory issues are TLB misses and physical memory limit. I'll try to write more about how (bad) it looks on PC (windows) with 32 bits programs. Especially nowadays when games require more and more data.&lt;br /&gt;
&lt;br /&gt;
Firstly half of program's virtual address space is taken by kernel. This means that first pointer's bit is unused and it can be used for some evil trickery :). Moreover first and last 64kb are reserved by kernel.&lt;br /&gt;
&lt;br /&gt;
Program's source and heap has to be loaded somewhere. When compiling using VC++ default place is 0x0040000. Then a bunch of DLLs are loaded into strange virtual memory addresses. You can check what DLLs are loaded, into what address and see their size using &lt;a href="http://www.dependencywalker.com/"&gt;Dependacy Walker&lt;/a&gt;. Use start profiling feature to see real virtual memory address of given DLL. DLLs and program usually aren't loaded into one contiguous address range. At this point we didn't call new/malloc even once and virtual memory is already fragmented.&lt;br /&gt;
&lt;br /&gt;
Now there comes video driver. It will use precious virtual memory for managed resources, command buffer and temporary for locking non managed resources. Especially creating/locking non managed resources is quite misinforming as DirectX returns "out of video memory" instead of "out of virtual memory". It's very tempting to put all static level geometry into one 100mb non-managed vertex buffer. When creating/filling this VB video driver will try to allocate contiguous 100mb chunk of virtual memory. This will likely result in program crash after some time.&lt;br /&gt;
&lt;br /&gt;
Windows uses 4kb pages, so doing smaller allocations will lead to internal fragmentation. I guess already everyone is using some kind of custom memory allocator, so it isn't a problem.&lt;br /&gt;
&lt;br /&gt;
There is /LARGEADDRESSAWARE linker flag, which allows to use additional 1gb of virtual memory. It requires user to change boot params and usually doesn't work well in practice (system stability issues etc.). It's also possible to compile as 64 bit program, but according to &lt;a href="http://store.steampowered.com/hwsurvey?platform=pc"&gt;Steam HW survey&lt;/a&gt; half of gamers use a 32 bit OS. This is really annoying that MS is still making 32 bit systems because currently min PC game spec CPUs are core2 or similar with 64 bit support.&lt;br /&gt;
&lt;br /&gt;
Summarizing in theory memory shouldn't be a problem on PC, but in practice it's a precious and fragile resource.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8517115461722256631-3629024966129121286?l=kriscg.blogspot.com' alt='' /&gt;&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/Kriscg/~4/y6bZzLiwufc" height="1" width="1"/&gt;</content><link rel="replies" type="application/atom+xml" href="http://kriscg.blogspot.com/feeds/3629024966129121286/comments/default" title="Post Comments" /><link rel="replies" type="text/html" href="http://kriscg.blogspot.com/2011/02/virtual-memory-on-pc.html#comment-form" title="0 Comments" /><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/8517115461722256631/posts/default/3629024966129121286?v=2" /><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/8517115461722256631/posts/default/3629024966129121286?v=2" /><link rel="alternate" type="text/html" href="http://feedproxy.google.com/~r/Kriscg/~3/y6bZzLiwufc/virtual-memory-on-pc.html" title="Virtual memory on PC" /><author><name>KriS</name><uri>http://www.blogger.com/profile/01055035127272189489</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="16" height="16" src="http://img2.blogblog.com/img/b16-rounded.gif" /></author><thr:total>0</thr:total><feedburner:origLink>http://kriscg.blogspot.com/2011/02/virtual-memory-on-pc.html</feedburner:origLink></entry><entry gd:etag="W/&quot;A0UMQnYyfSp7ImA9Wx9XEE0.&quot;"><id>tag:blogger.com,1999:blog-8517115461722256631.post-4253730048326426927</id><published>2010-10-27T00:19:00.002+02:00</published><updated>2011-01-03T00:01:23.895+01:00</updated><app:edited xmlns:app="http://www.w3.org/2007/app">2011-01-03T00:01:23.895+01:00</app:edited><category scheme="http://www.blogger.com/atom/ns#" term="ALU" /><category scheme="http://www.blogger.com/atom/ns#" term="shader" /><category scheme="http://www.blogger.com/atom/ns#" term="optimization" /><category scheme="http://www.blogger.com/atom/ns#" term="GPU" /><title>Shader optimizations</title><content type="html">A small list of basic and sometimes overlooked shader optimization possibilities.&amp;nbsp;This are very small gains, but they can sum up and maybe there will be some free time for an additional point light or a better shadow filter?&lt;br /&gt;
&lt;br /&gt;
&lt;b&gt;Full screen quad vs full screen triangle&lt;/b&gt;&lt;br /&gt;
Post processing effects or color/z downsampling usually are rendered using full screen quads.&amp;nbsp;Hardware works on at least 2x2 pixels groups. Pixel group size goes up with time (just like expected game resolution). For example&amp;nbsp;NVIDIA Fermi has 4x2 pixel groups and older &amp;nbsp;NVIDIA&amp;nbsp;G80-G92 use 2x2 quads. This means, that rendering two fullscreen triangles creates some overlapping quads on the diagonal. In 1000x1000 pixel resolution and 2x2 pixel quads, there will be 500 quads shaded two times. If we factor out cache misses, there will be 0.2% of additional work.&amp;nbsp;Besides using single fullscreen triangle there is one vertex less to push to GPU :). In my synthetic test (on crappy geforce 240) difference was around 0.2% - 0.3%.&lt;br /&gt;
&lt;br /&gt;
&lt;b&gt;Direct3D shader compiler (FXC) on PC&lt;/b&gt;&lt;br /&gt;
Instruction counts displayed by FXC on PC doesn't mean much nowadays. It just translates HLSL to asm, which later will be translated by the driver to special hardware IL.&amp;nbsp;It's quite possible to decrease instruction count displayed by the FXC and slowdown shader at the same time. Instead of relying on FXC numbers it's better to check real performance (FPS/ms) or/and check numbers generated by special tools (&lt;a href="http://developer.amd.com/gpu/shader/Pages/default.aspx"&gt;ShaderAnalyzer&lt;/a&gt;&amp;nbsp;and&amp;nbsp;&lt;a href="http://developer.nvidia.com/object/nvshaderperf_home.html"&gt;ShaderPerf&lt;/a&gt;). They also display GPR count, which is quite important as it shows how much stuff can be run in parallel.&lt;br /&gt;
&lt;br /&gt;
&lt;b&gt;Hardware instructions&lt;/b&gt;&lt;br /&gt;
&lt;ul&gt;&lt;li&gt;MADD is a hardware instruction on GPU. Convert code like "( x - a ) * b" to "x * c + d". This can save 1 ALU instruction.&lt;/li&gt;
&lt;li&gt;Saturate, negation and abs are instruction modifiers and are free. Yes, there is a free dinner :). Sometimes equations can be changed to use saturate instead of clamp/min/max. Negation and abs can help to decrease number of used constant registers.&lt;/li&gt;
&lt;li&gt;Some instructions are executed on the transcendental units. Transcendental units compute everything as scalars and there are like one transcendental unit per 2-8 ALUs. It's a good idea to avoid&amp;nbsp;excessive usage of&amp;nbsp;instructions like sin, cos, log, sqrt,&amp;nbsp;pow (very bad - calculated using three instructions).&lt;/li&gt;
&lt;/ul&gt;&lt;br /&gt;
&lt;b&gt;Vectorize with care&lt;/b&gt;&lt;br /&gt;
Some GPUs have vector ALU units (AMD/ATI cards and NVIDIA cards older than G80) and some have scalar (NVIDIA G80, G92, Fermi). A vector ALU means that a scalar instruction takes same time as a vector one, which computes 4 components at once. Usually people try to vectorize everything in shaders, which can add some additional computations and actually result in slower shader on scalar ALU hardware. It's a good idea to mask vector computations. For example in a blur shader there is no need to calculate alpha channel, so just use float3 for accumulation. We could go further and&amp;nbsp;even write two shader versions: one for vector ALUs and one for scalar ones.&amp;nbsp;No point in vectorizing instructions computed on transcendal units (sin, cos, log, pow...) - they are always scalar.&lt;br /&gt;
&lt;br /&gt;
&lt;b&gt;Clip/texkill&lt;/b&gt;&lt;br /&gt;
Consider adding&amp;nbsp;clip/texkill (or alpha test, which can be faster on old hardware) when alpha blending is enabled. Think deferred lights, particles, volumetric light shafts. This can remove some work from ROP units if You don't have uber tight geometry.&lt;br /&gt;
&lt;br /&gt;
&lt;b&gt;Interpolators&lt;/b&gt;&lt;br /&gt;
Shader bottlenecks aren't only about ALU, GPR and texture fetches. On rare occasions (or on some hardware) they can become a bottleneck.&amp;nbsp;Sometimes when using short pixel&amp;nbsp;shaders it's better to move computations from vertex shader to pixel shader if it can help to decrease interpolator count.&lt;br /&gt;
&lt;pre style="background-color: #eeeeee; border-bottom-color: rgb(153, 153, 153); border-bottom-style: dashed; border-bottom-width: 1px; border-left-color: rgb(153, 153, 153); border-left-style: dashed; border-left-width: 1px; border-right-color: rgb(153, 153, 153); border-right-style: dashed; border-right-width: 1px; border-top-color: rgb(153, 153, 153); border-top-style: dashed; border-top-width: 1px; color: black; font-family: 'Andale Mono', 'Lucida Console', Monaco, fixed, monospace; font-size: 12px; line-height: 14px; margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px; overflow-x: auto; overflow-y: auto; padding-bottom: 5px; padding-left: 5px; padding-right: 5px; padding-top: 5px; width: 95%;"&gt;&lt;code&gt;// 8 interpolators and minimal ALU in pixel shader
float4 psMain( SVshOut In ) : COLOR0
{
    float4 color = 0.;
    for ( int i = 0; i &amp;lt; 8; ++i )
    {
        color += In.m_uv[ i ];
    }
    return color;
}

// one interpolator and some ALU in pixel shader
float4 gSomeValMul[ 8 ];
float4 gSomeValAdd[ 8 ];
float4 psMain( SVshOut In ) : COLOR0
{
    float4 color = 0.;
    for ( int i = 0; i &amp;lt; 8; ++i )
    {
        color += In.m_uv * gSomeValMul[ i ] + gSomeValAdd[ i ];
    }
    return color;
}
&lt;/code&gt;&lt;/pre&gt;8 interpolator version runs at 28.17ms (100 runs on geforce 240). 1 interpolator + some ALU version runs at 21.44ms (just as empty pixel shader). This is of course a very specific case. Still it's a good idea to watch out and pack interpolators.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8517115461722256631-4253730048326426927?l=kriscg.blogspot.com' alt='' /&gt;&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/Kriscg/~4/TZAXevgLRF8" height="1" width="1"/&gt;</content><link rel="replies" type="application/atom+xml" href="http://kriscg.blogspot.com/feeds/4253730048326426927/comments/default" title="Post Comments" /><link rel="replies" type="text/html" href="http://kriscg.blogspot.com/2010/10/shader-optimizations.html#comment-form" title="0 Comments" /><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/8517115461722256631/posts/default/4253730048326426927?v=2" /><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/8517115461722256631/posts/default/4253730048326426927?v=2" /><link rel="alternate" type="text/html" href="http://feedproxy.google.com/~r/Kriscg/~3/TZAXevgLRF8/shader-optimizations.html" title="Shader optimizations" /><author><name>KriS</name><uri>http://www.blogger.com/profile/01055035127272189489</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="16" height="16" src="http://img2.blogblog.com/img/b16-rounded.gif" /></author><thr:total>0</thr:total><feedburner:origLink>http://kriscg.blogspot.com/2010/10/shader-optimizations.html</feedburner:origLink></entry><entry gd:etag="W/&quot;DUQDQ3syfyp7ImA9Wx5WFk0.&quot;"><id>tag:blogger.com,1999:blog-8517115461722256631.post-2303401931387806492</id><published>2010-09-27T19:09:00.000+02:00</published><updated>2010-09-27T19:09:32.597+02:00</updated><app:edited xmlns:app="http://www.w3.org/2007/app">2010-09-27T19:09:32.597+02:00</app:edited><category scheme="http://www.blogger.com/atom/ns#" term="occlusion culling" /><category scheme="http://www.blogger.com/atom/ns#" term="rasterization" /><category scheme="http://www.blogger.com/atom/ns#" term="CPU" /><title>Software occlusion culling</title><content type="html">&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://3.bp.blogspot.com/_PbTMzfd4SSg/TJZOMRNJPfI/AAAAAAAAAec/3uUMq2ApM-E/s1600/rast.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="300" src="http://3.bp.blogspot.com/_PbTMzfd4SSg/TJZOMRNJPfI/AAAAAAAAAec/3uUMq2ApM-E/s400/rast.jpg" width="400" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;
Today CPUs are quite fast, so why not use them to draw some triangles? Especially when all the cool kids use it them for software occlusion culling. Time to take back some of CPU time from gameplay programmers and use it to draw pretty pictures.&lt;br /&gt;
&lt;br /&gt;
Software occlusion culling using rasterization isn't a new idea (&lt;a href="http://www.cs.unc.edu/~zhangh/hom.html"&gt;HOM&lt;/a&gt;). Basically it's filling software z-buffer and testing some objects against it (usually screen space&amp;nbsp;bounding&amp;nbsp;boxes). Rasterization is usually done in small resolution (&lt;a href="http://www.slideshare.net/repii/parallel-graphics-in-frostbite-current-future-siggraph-2009-1860503"&gt;DICE uses 256x114&lt;/a&gt;). Testing can be also done using hierarchical z-buffer (min depth or min/max depth hierarchy).&lt;br /&gt;
&lt;br /&gt;
How to write one? Step one - transformation &lt;a href="http://www.cbloom.com/3d/techdocs/pipeline.txt"&gt;pipeline&lt;/a&gt;. It can be a bottleneck if it isn't properly done.&amp;nbsp;Step two -&amp;nbsp;&lt;a href="http://www.flipcode.com/archives/Real-time_3D_Clipping_Sutherland-Hodgeman.shtml"&gt;clipper&lt;/a&gt;. Clipper code quality isn't so important. Just remember to clamp coordinates or clip x and y coordinates&amp;nbsp;after projection divide. Step three - &lt;a href="http://chrishecker.com/Miscellaneous_Technical_Articles"&gt;scanline&lt;/a&gt; or &lt;a href="http://www.devmaster.net/forums/showthread.php?t=1884"&gt;half-space&lt;/a&gt; rasterizator. Half-spaces very nicely map to vector instructions, many threads and play well with cache. Half-space approach&amp;nbsp;was a win over scanlines when I&amp;nbsp;wrote a software renderer on SPU with many threads and interpolants. In this case I prototyped software occlusion culling for "min-spec"&amp;nbsp;PC (1-2 core CPU), so there is only 1 thread, one interpolant and resolution is quite small. In this case scanlines were about 2-3 times faster than half-spaces.&lt;br /&gt;
&lt;br /&gt;
Rasterization for software occlusion culling can be quite fast. Resolution is small, so int32 gives plenty of &amp;nbsp;precision (no need to use float for positions).&amp;nbsp;For depth only rendering perspective&amp;nbsp;interpolation is very easy - it's enough to interpolate 1/z' (z' = z/w)&amp;nbsp;and store it in software z-buffer.&amp;nbsp;This means no division or multiplication in inner loop. Moreover when doing visibility for directional&amp;nbsp;shadows there is no perspective, so there is no need for calculating reciprocal of z'.&amp;nbsp;There are some differences between hi res and small res zbuffer. To fix it pixel center should be shifted using dzdx and dzdy. In&amp;nbsp;practice&amp;nbsp;it's enough to add some eps when testing objects.&lt;br /&gt;
&lt;br /&gt;
Some rasterization performance results.&amp;nbsp;Rasterization with full transformation pipeline and clipping. Optimized with some SSE intrinsics. Randomly placed 500 quads (each consists of 2 triangles). No special optimizations for quads and all are fully visible. 256x128 resolution and 1 thread.&amp;nbsp;CPU / quad pixel screen size:&lt;br /&gt;
&lt;br /&gt;
&lt;table border="1" bordercolor="AAAAAAAA" cellpadding="2" cellspacing="0"&gt;&lt;tbody&gt;
&lt;tr&gt; &lt;td&gt;&lt;/td&gt; &lt;td&gt;256x128&lt;/td&gt; &lt;td&gt;61x61&lt;/td&gt; &lt;td&gt;21x21&lt;/td&gt; &lt;td&gt;fillrate&lt;/td&gt; &lt;td&gt;vertex rate&lt;/td&gt; &lt;/tr&gt;
&lt;tr&gt; &lt;td&gt;i7 860 (2.8ghz)&lt;/td&gt; &lt;td&gt;6.56 ms&lt;/td&gt; &lt;td&gt;1.75 ms&lt;/td&gt; &lt;td&gt;0.53 ms&lt;/td&gt; &lt;td&gt;2.50 GPix/s&lt;/td&gt;&lt;td&gt;0.025 GV/s&lt;/td&gt; &lt;/tr&gt;
&lt;tr&gt; &lt;td&gt;core2 quad Q8200 (2.33ghz)&lt;/td&gt;&lt;td&gt;9.20 ms&lt;/td&gt;&lt;td&gt;2.30 ms&lt;/td&gt;&lt;td&gt;0.67 ms&lt;/td&gt;&lt;td&gt;1.76&amp;nbsp;GPix/s&lt;/td&gt;&lt;td&gt;0.019&amp;nbsp;GV/s&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;&lt;br /&gt;
This shows true power of i7 - almost 1 pixel filled per 1 cycle :). In real test case, there should be like 10 fullscreen triangles, 100 big and a lot of small ones (around 20 pixels), so it looks like 1-2ms is enough for filling software z-buffer.&amp;nbsp;It could be optimized for big triangles by writing code for quick rejection of empty tiles and code for filling fully covered tiles (just like &lt;a href="http://software.intel.com/en-us/articles/rasterization-on-larrabee/"&gt;Larabee does&lt;/a&gt;). This dramatically increases performance for large triangles.&lt;br /&gt;
&lt;br /&gt;
Some object testing performance results. Transformation time not included - should be already done for frustum culling and it's quite small (0.33ms for i7 and 0.48 for core2 quad). Clipping. Optimized with some SSE intrinsics.&amp;nbsp;Randomly placed 3k quads (each fully visible). Worst case - no early out (cleared z-buffer).&amp;nbsp;256x128 resolution. 1 thread.&amp;nbsp;CPU / quad pixel screen size:&lt;br /&gt;
&lt;br /&gt;
&lt;table border="1" bordercolor="AAAAAAAA" cellpadding="2" cellspacing="0"&gt;&lt;tbody&gt;
&lt;tr&gt; &lt;td&gt;&lt;/td&gt; &lt;td&gt;120x120&lt;/td&gt; &lt;td&gt;30x30&lt;/td&gt; &lt;td&gt;10x10&lt;/td&gt; &lt;/tr&gt;
&lt;tr&gt; &lt;td&gt;i7 860&amp;nbsp;(2.8ghz)&lt;/td&gt; &lt;td&gt;2.26 ms&lt;/td&gt; &lt;td&gt;0.07 ms&lt;/td&gt; &lt;td&gt;0.02 ms&lt;/td&gt; &lt;/tr&gt;
&lt;tr&gt; &lt;td&gt;core2&amp;nbsp;quad&amp;nbsp;Q8200&amp;nbsp;(2.33ghz)&lt;/td&gt;&lt;td&gt;3.30ms&lt;/td&gt;&lt;td&gt;0.09 ms&lt;/td&gt;&lt;td&gt;0.03 ms&lt;/td&gt; &lt;/tr&gt;
&lt;/tbody&gt; &lt;/table&gt;&lt;br /&gt;
Also looks reasonably fast and in real test case numbers should be around 1-2ms. It could be&amp;nbsp;further&amp;nbsp;optimized by using some kind of depth hierarchy (downscaling z-buffer is very fast - something like 0.05ms for full mip-map chain).&lt;br /&gt;
&lt;br /&gt;
Software&amp;nbsp;occlusion&amp;nbsp;culling is quite cool - You can have skinned occluders :). It's easy to write, easy for artists to grasp. There is no precomputation, no frame lag etc. On x86 and single thread software occlusion culling rather won't be faster than&amp;nbsp;&lt;a href="http://www.flipcode.com/harmless/issue01.htm#beamtrees"&gt;beamtrees&lt;/a&gt;, but IMHO on consoles&amp;nbsp;it can be faster (no tree data structure traversal) and for sure it's easier to parallelize. Maybe one day I'll try to add it to our engine at work and see how does it handle real test cases.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8517115461722256631-2303401931387806492?l=kriscg.blogspot.com' alt='' /&gt;&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/Kriscg/~4/9m1YRqFP0AU" height="1" width="1"/&gt;</content><link rel="replies" type="application/atom+xml" href="http://kriscg.blogspot.com/feeds/2303401931387806492/comments/default" title="Post Comments" /><link rel="replies" type="text/html" href="http://kriscg.blogspot.com/2010/09/software-occlusion-culling.html#comment-form" title="4 Comments" /><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/8517115461722256631/posts/default/2303401931387806492?v=2" /><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/8517115461722256631/posts/default/2303401931387806492?v=2" /><link rel="alternate" type="text/html" href="http://feedproxy.google.com/~r/Kriscg/~3/9m1YRqFP0AU/software-occlusion-culling.html" title="Software occlusion culling" /><author><name>KriS</name><uri>http://www.blogger.com/profile/01055035127272189489</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="16" height="16" src="http://img2.blogblog.com/img/b16-rounded.gif" /></author><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="http://3.bp.blogspot.com/_PbTMzfd4SSg/TJZOMRNJPfI/AAAAAAAAAec/3uUMq2ApM-E/s72-c/rast.jpg" height="72" width="72" /><thr:total>4</thr:total><feedburner:origLink>http://kriscg.blogspot.com/2010/09/software-occlusion-culling.html</feedburner:origLink></entry><entry gd:etag="W/&quot;A0QAQnY8cSp7ImA9Wx9XEE0.&quot;"><id>tag:blogger.com,1999:blog-8517115461722256631.post-6454382393018540049</id><published>2010-08-14T10:16:00.009+02:00</published><updated>2011-01-03T00:02:23.879+01:00</updated><app:edited xmlns:app="http://www.w3.org/2007/app">2011-01-03T00:02:23.879+01:00</app:edited><category scheme="http://www.blogger.com/atom/ns#" term="deferred lighting" /><category scheme="http://www.blogger.com/atom/ns#" term="Aggregated" /><title>Aggregated deferred lighting</title><content type="html">Random idea about a new way to do deferred lighting. The idea is to decouple lighting from geometry normals. In order to do that, lighting information is stored as aggregated lights ( direction + color ).&lt;br /&gt;
&lt;br /&gt;
1st pass - z-prepass ( just render depth )&lt;br /&gt;
2nd pass - render lighting geometry / quads / tiles.... Output aggregated virtual directional lights for every pixel. This means weighted average of light directions and weighted sum of light colors for every pixel.&lt;br /&gt;
3rd pass - render geometry and shade using buffer with aggregated directional lights (and maybe add standard forward directional light)&lt;br /&gt;
&lt;br /&gt;
2nd pass render target layout:&lt;br /&gt;
&lt;pre style="background-color: #eeeeee; border: 1px dashed #999999; color: black; font-family: Andale Mono, Lucida Console, Monaco, fixed, monospace; font-size: 12px; line-height: 14px; overflow: auto; padding: 5px; width: 90%;"&gt;&lt;code&gt;RT0: aggregated light color RGB
RT1: aggregated light direction XYZ&lt;/code&gt;&lt;/pre&gt;&lt;br /&gt;
We want to achieve this:&lt;br /&gt;
&lt;pre style="background-color: #eeeeee; border: 1px dashed #999999; color: black; font-family: Andale Mono, Lucida Console, Monaco, fixed, monospace; font-size: 12px; line-height: 14px; overflow: auto; padding: 5px; width: 90%;"&gt;&lt;code&gt;AggregatedLightColor = 0.
AggregatedLightDir   = 0.

for every light
    AggregatedLightColor += LightColor * LightAttenuation
    AggregatedLightDir   += LightDir * intensity(LightColor * LightAttenuation)
&lt;/code&gt;&lt;/pre&gt;&lt;br /&gt;
In order to do this, we need:&lt;br /&gt;
1. Init RT0 and RT1 with 0x00000000&lt;br /&gt;
2. Setup additive blending states&lt;br /&gt;
3. Output from light pixel shader:&lt;br /&gt;
&lt;pre style="background-color: #eeeeee; border: 1px dashed #999999; color: black; font-family: Andale Mono, Lucida Console, Monaco, fixed, monospace; font-size: 12px; line-height: 14px; overflow: auto; padding: 5px; width: 90%;"&gt;&lt;code&gt;ColorRT0 = LightColor * LightAttenuation
ColorRT1 = LightDirection * dot( ColorRT0, ToGrayscaleVec )
&lt;/code&gt;&lt;/pre&gt;&lt;br /&gt;
Cons?&lt;br /&gt;
&lt;ul&gt;&lt;li&gt;Light aggregation as virtual directional lights per pixel is an approximation. Moreover we can't properly blend normals by using their arithmetic averages. It means that with many lights per pixel (with opposing directions) it won't be too accurate (but it shouldn't be too visible).&lt;/li&gt;
&lt;/ul&gt;&lt;br /&gt;
&lt;br /&gt;
Benefits?&lt;br /&gt;
&lt;ul&gt;&lt;li&gt;Flexibility. You can use almost any lighting model&lt;/li&gt;
&lt;li&gt;You can render lighting in lower resolution as high frequency normal map details are added later. There will be artifacts at depth discontinuities, but maybe for some type of content (think desaturated and gray as Gears of War or Killzone 2 :)) they won't be to visible&lt;/li&gt;
&lt;li&gt;Less bandwidth and memory usage (if we compare it to deferred lighting and shading, which stores full specular color, not just it's intensity).&lt;br /&gt;
&lt;/li&gt;
&lt;li&gt;Z prepass is faster than rendering GBuffer or normals + exponent&lt;/li&gt;
&lt;li&gt;A bit simpler calculations. No need for encoding / decoding material properties (normal, exponent,...).&lt;br /&gt;
&lt;/li&gt;
&lt;/ul&gt;&lt;br /&gt;
Now it's time to find some free time and code a demo in order to compare it to deferred lighting/shading in real application :).&lt;br /&gt;
&lt;br /&gt;
P.S. decoupling can be also done by storing lighting as spherical harmonics or cubemaps:&amp;nbsp;&lt;a href="http://deadvoxels.blogspot.com/2009/08/has-someone-tried-this-before.html"&gt;link1&lt;/a&gt;&amp;nbsp;&lt;a href="http://solid-angle.blogspot.com/2009/12/screen-space-spherical-harmonic.html"&gt;link2&lt;/a&gt;&amp;nbsp;&lt;a href="http://www.gamedev.net/community/forums/topic.asp?topic_id=571695"&gt;link3&lt;/a&gt;&amp;nbsp;( thanks Hogdman from gd.net forums ). Downside of that method is lack of proper specular, because of low frequency lighting data and this method will be slower.&lt;br /&gt;
&lt;br /&gt;
P.S. 2 It looks like it would be better to store normals as angles (RT1.xy - weighted 2 angles, RT1.z - sum of weights). It would ensure proper aggregated light direction interpolation.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
UPDATE: I prototyped this method and it doesn't work too well :). Comparison screenshot with hard case for idea - two points lights with very different color influencing same area. Left - normal lighting and right - aggregated to direction and color:&lt;br /&gt;
&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://4.bp.blogspot.com/_PbTMzfd4SSg/TIa9yMiDukI/AAAAAAAAAdk/CtEW5F_9aBw/s1600/aggr.png" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="200" src="http://4.bp.blogspot.com/_PbTMzfd4SSg/TIa9yMiDukI/AAAAAAAAAdk/CtEW5F_9aBw/s400/aggr.png" width="400" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8517115461722256631-6454382393018540049?l=kriscg.blogspot.com' alt='' /&gt;&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/Kriscg/~4/O_SNWYpLCxk" height="1" width="1"/&gt;</content><link rel="replies" type="application/atom+xml" href="http://kriscg.blogspot.com/feeds/6454382393018540049/comments/default" title="Post Comments" /><link rel="replies" type="text/html" href="http://kriscg.blogspot.com/2010/08/aggregated-deferred-lighting.html#comment-form" title="7 Comments" /><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/8517115461722256631/posts/default/6454382393018540049?v=2" /><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/8517115461722256631/posts/default/6454382393018540049?v=2" /><link rel="alternate" type="text/html" href="http://feedproxy.google.com/~r/Kriscg/~3/O_SNWYpLCxk/aggregated-deferred-lighting.html" title="Aggregated deferred lighting" /><author><name>KriS</name><uri>http://www.blogger.com/profile/01055035127272189489</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="16" height="16" src="http://img2.blogblog.com/img/b16-rounded.gif" /></author><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="http://4.bp.blogspot.com/_PbTMzfd4SSg/TIa9yMiDukI/AAAAAAAAAdk/CtEW5F_9aBw/s72-c/aggr.png" height="72" width="72" /><thr:total>7</thr:total><feedburner:origLink>http://kriscg.blogspot.com/2010/08/aggregated-deferred-lighting.html</feedburner:origLink></entry><entry gd:etag="W/&quot;DEMMQXg_cSp7ImA9Wx5QGUg.&quot;"><id>tag:blogger.com,1999:blog-8517115461722256631.post-2104155741869385332</id><published>2010-08-13T11:07:00.007+02:00</published><updated>2010-09-08T16:34:40.649+02:00</updated><app:edited xmlns:app="http://www.w3.org/2007/app">2010-09-08T16:34:40.649+02:00</app:edited><category scheme="http://www.blogger.com/atom/ns#" term="COJ" /><category scheme="http://www.blogger.com/atom/ns#" term="deferred lighting" /><category scheme="http://www.blogger.com/atom/ns#" term="light" /><category scheme="http://www.blogger.com/atom/ns#" term="geometry" /><title>Rendering light geometry in deferred shading/lighting</title><content type="html">Interesting idea from Call Of Juarez 2 about rendering deferred light geometry. When deferred light geometry intersects with camera you need to switch culling and turn off zbuffer. In COJ2, instead of testing intersection on CPU and switching states, they just push out light geometry vertices:&lt;br /&gt;
&lt;br /&gt;
&lt;pre style="background-color: #eeeeee; border: 1px dashed #999999; color: black; font-family: Andale Mono, Lucida Console, Monaco, fixed, monospace; font-size: 12px; line-height: 14px; overflow: auto; padding: 5px; width: 90%;"&gt;&lt;code&gt;// vertex shader
float3 posCS = mul( in.pos, worldToCamera ).xyz;
posCS.z = max( posCS.z, nearPlaneZ + offset );
out.pos = mul( float4( posCS, 1. ), cameraToScreen );
&lt;/code&gt;&lt;/pre&gt;&lt;br /&gt;
Could be a win if You are CPU bound.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8517115461722256631-2104155741869385332?l=kriscg.blogspot.com' alt='' /&gt;&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/Kriscg/~4/iw-7krjJyGY" height="1" width="1"/&gt;</content><link rel="replies" type="application/atom+xml" href="http://kriscg.blogspot.com/feeds/2104155741869385332/comments/default" title="Post Comments" /><link rel="replies" type="text/html" href="http://kriscg.blogspot.com/2010/08/rendering-light-geometry-in-deferred.html#comment-form" title="3 Comments" /><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/8517115461722256631/posts/default/2104155741869385332?v=2" /><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/8517115461722256631/posts/default/2104155741869385332?v=2" /><link rel="alternate" type="text/html" href="http://feedproxy.google.com/~r/Kriscg/~3/iw-7krjJyGY/rendering-light-geometry-in-deferred.html" title="Rendering light geometry in deferred shading/lighting" /><author><name>KriS</name><uri>http://www.blogger.com/profile/01055035127272189489</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="16" height="16" src="http://img2.blogblog.com/img/b16-rounded.gif" /></author><thr:total>3</thr:total><feedburner:origLink>http://kriscg.blogspot.com/2010/08/rendering-light-geometry-in-deferred.html</feedburner:origLink></entry><entry gd:etag="W/&quot;DEUDQHg8fCp7ImA9Wx5QGUg.&quot;"><id>tag:blogger.com,1999:blog-8517115461722256631.post-3959720384274904990</id><published>2010-07-28T19:01:00.017+02:00</published><updated>2010-09-08T16:31:11.674+02:00</updated><app:edited xmlns:app="http://www.w3.org/2007/app">2010-09-08T16:31:11.674+02:00</app:edited><category scheme="http://www.blogger.com/atom/ns#" term="2010" /><category scheme="http://www.blogger.com/atom/ns#" term="papers" /><category scheme="http://www.blogger.com/atom/ns#" term="Siggraph" /><title>Siggraph 2010 papers</title><content type="html">&lt;span class="Apple-style-span" style="font-family: inherit;"&gt;A small list of Siggraph 2010 papers (I'll try to keep it up to date):&lt;/span&gt;&lt;br /&gt;
&lt;span class="Apple-style-span" style="font-family: inherit;"&gt;&lt;br /&gt;
&lt;/span&gt;&lt;br /&gt;
&lt;ul&gt;&lt;li&gt;&lt;span class="Apple-style-span" style="font-family: inherit;"&gt;Some &lt;/span&gt;&lt;a href="http://kesen.realtimerendering.com/sig2010.html"&gt;&lt;span class="Apple-style-span" style="font-family: inherit;"&gt;academic stuff&lt;/span&gt;&lt;/a&gt;&lt;span class="Apple-style-span" style="font-family: inherit;"&gt;&amp;nbsp;( &amp;lt; 1 fps :) )&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://renderwonk.com/publications/s2010-shading-course/"&gt;&lt;span class="Apple-style-span" style="font-family: inherit;"&gt;Physically based shading&lt;/span&gt;&lt;/a&gt;&lt;span class="Apple-style-span" style="font-family: inherit;"&gt;&amp;nbsp;course&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://renderwonk.com/publications/s2010-color-course/"&gt;&lt;span class="Apple-style-span" style="font-family: inherit;"&gt;Color course&lt;/span&gt;&lt;/a&gt;&lt;span class="Apple-style-span" style="font-family: inherit;"&gt;. Includes very interesting stuff about tone mapping from &lt;/span&gt;&lt;a href="http://research.tri-ace.com/"&gt;&lt;span class="Apple-style-span" style="font-family: inherit;"&gt;tri-Ace R&amp;amp;D&lt;/span&gt;&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="Apple-style-span" style="font-family: inherit;"&gt;Proceedings at &lt;/span&gt;&lt;a href="http://portal.acm.org/toc.cfm?id=1837026&amp;amp;type=proceeding&amp;amp;coll=GUIDE&amp;amp;dl=GUIDE&amp;amp;CFID=96305533&amp;amp;CFTOKEN=55150628"&gt;&lt;span class="Apple-style-span" style="font-family: inherit;"&gt;acm.org&lt;/span&gt;&lt;/a&gt;&lt;span class="Apple-style-span" style="font-family: inherit;"&gt;. Requires an account to read papers, but abstracts can be read without it. Includes an&amp;nbsp;interesting idea about &lt;/span&gt;&lt;a href="http://portal.acm.org/citation.cfm?id=1837047"&gt;&lt;span class="Apple-style-span" style="font-family: inherit;"&gt;frame interpolation&lt;/span&gt;&lt;/a&gt;&lt;span class="Apple-style-span" style="font-family: inherit;"&gt; in order to render at 30fps, but update screen at 60 fps&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://developer.nvidia.com/object/siggraph-2010-home.html"&gt;&lt;span class="Apple-style-span" style="font-family: inherit;"&gt;NVIDIA @ Siggraph 2010&lt;/span&gt;&lt;/a&gt;&lt;span class="Apple-style-span" style="font-family: inherit;"&gt;.OpenGL 4, CUDA and new NVIDIA tools&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://bps10.idav.ucdavis.edu/"&gt;&lt;span class="Apple-style-span" style="font-family: inherit;"&gt;Beyond programmable shading&lt;/span&gt;&lt;/a&gt;&lt;span class="Apple-style-span" style="font-family: inherit;"&gt;. Presentations about rendering pipeline future (includes micropolygon rendering) and great presentation explaining how GPU shader cores work&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://www.khronos.org/library/detail/2010-siggraph-opengl-bof/"&gt;&lt;span class="Apple-style-span" style="font-family: inherit;"&gt;OpenGL&lt;/span&gt;&lt;/a&gt;&lt;span class="Apple-style-span" style="font-family: inherit;"&gt;&amp;nbsp;and &lt;/span&gt;&lt;a href="http://www.khronos.org/library/detail/2010-siggraph-opencl-bof/"&gt;&lt;span class="Apple-style-span" style="font-family: inherit;"&gt;OpenCL&lt;/span&gt;&lt;/a&gt;&lt;span class="Apple-style-span" style="font-family: inherit;"&gt;&amp;nbsp;from Khronos group&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://repi.blogspot.com/2010/08/siggraph-2010-talks-by-dice.html"&gt;&lt;span class="Apple-style-span" style="font-family: inherit;"&gt;DICE Siggraph papers&lt;/span&gt;&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://graphics.cs.williams.edu/courses/SRG10/"&gt;&lt;span class="Apple-style-span" style="font-family: inherit;"&gt;Stylized rendering in Games&lt;/span&gt;&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://graphicsrunner.blogspot.com/2010/08/water-using-flow-maps.html"&gt;&lt;span class="Apple-style-span" style="font-family: inherit;"&gt;Blog post&lt;/span&gt;&lt;/a&gt;&lt;span class="Apple-style-span" style="font-family: inherit;"&gt; about Valve presentation "&lt;/span&gt;&lt;span class="Apple-style-span" style="color: #333333; line-height: 16px;"&gt;&lt;span class="Apple-style-span" style="font-family: inherit;"&gt;Water Flow in Portal 2"&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="Apple-style-span" style="color: #333333; line-height: 16px;"&gt;&lt;span class="Apple-style-span" style="font-family: inherit;"&gt;&lt;a href="http://www.yakiimo3d.com/2010/07/31/siggraph-2010-3-new-intel-directx11-demos/"&gt;Intel DX11 papers and demos&lt;/a&gt;&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="Apple-style-span" style="color: #333333; line-height: 16px;"&gt;&lt;a href="http://www.graphics.cornell.edu/~jaroslav/gicourse2010/index.htm"&gt;Global illumination across industries&lt;/a&gt; Siggraph 2010 course&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="Apple-style-span" style="color: #333333; line-height: 16px;"&gt;&lt;a href="http://advances.realtimerendering.com/s2010/#materials"&gt;Advances in Real-Time Rendering&lt;/a&gt;&amp;nbsp;Siggraph 2010&amp;nbsp;course (&lt;a href="http://alex.vlachos.com/resume/"&gt;"Water Flow in Portal 2"&lt;/a&gt; from that course)&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="Apple-style-span" style="line-height: 16px;"&gt;Siggraph links at &lt;a href="http://www.realtimerendering.com/sig2010.html"&gt;Real Time Rendering&lt;/a&gt;&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8517115461722256631-3959720384274904990?l=kriscg.blogspot.com' alt='' /&gt;&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/Kriscg/~4/tuQ0pjXDy80" height="1" width="1"/&gt;</content><link rel="replies" type="application/atom+xml" href="http://kriscg.blogspot.com/feeds/3959720384274904990/comments/default" title="Post Comments" /><link rel="replies" type="text/html" href="http://kriscg.blogspot.com/2010/07/siggraph-2010-papers.html#comment-form" title="0 Comments" /><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/8517115461722256631/posts/default/3959720384274904990?v=2" /><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/8517115461722256631/posts/default/3959720384274904990?v=2" /><link rel="alternate" type="text/html" href="http://feedproxy.google.com/~r/Kriscg/~3/tuQ0pjXDy80/siggraph-2010-papers.html" title="Siggraph 2010 papers" /><author><name>KriS</name><uri>http://www.blogger.com/profile/01055035127272189489</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="16" height="16" src="http://img2.blogblog.com/img/b16-rounded.gif" /></author><thr:total>0</thr:total><feedburner:origLink>http://kriscg.blogspot.com/2010/07/siggraph-2010-papers.html</feedburner:origLink></entry><entry gd:etag="W/&quot;DEUNQXcyeip7ImA9Wx5QGUg.&quot;"><id>tag:blogger.com,1999:blog-8517115461722256631.post-1379037857535321774</id><published>2010-07-07T22:17:00.003+02:00</published><updated>2010-09-08T16:31:30.992+02:00</updated><app:edited xmlns:app="http://www.w3.org/2007/app">2010-09-08T16:31:30.992+02:00</app:edited><category scheme="http://www.blogger.com/atom/ns#" term="multiple inheritance" /><category scheme="http://www.blogger.com/atom/ns#" term="GCC" /><category scheme="http://www.blogger.com/atom/ns#" term="C++" /><category scheme="http://www.blogger.com/atom/ns#" term="VC++" /><title>VC++ and multiple inheritance</title><content type="html">Today at work we were optimizing memory usage. At some moment we found out that size (on stack) of our basic data structures is x bytes bigger than summed size of their members. Every basic data structure was written following &lt;a href="http://www.cc.gatech.edu/classes/AY2008/cs6330_summer/slides/policies.pdf"&gt;Alexandrescu policy based design&lt;/a&gt; - using inheritance from some templated empty classes. Let's see a simple example:&lt;br /&gt;
&lt;br /&gt;
&lt;pre style="background-color: #eeeeee; border: 1px dashed #999999; color: black; font-family: Andale Mono, Lucida Console, Monaco, fixed, monospace; font-size: 12px; line-height: 14px; overflow: auto; padding: 5px; width: 90%;"&gt;&lt;code&gt;#include &amp;lt;stdio.h&amp;gt;

class A { };
class B { };
class C : public A, B 
{ 
    int test; 
};

int main()
{
    printf( "%d\n", sizeof( C ) );
    return 0;
}
&lt;/code&gt;&lt;/pre&gt;&lt;br /&gt;
Compiler uses 4 byte aligment. Will this program print 4? That depends. Compiled by GCC it will print 4, but compiled by VC++ (2005-2010) it will print 8.&lt;br /&gt;
&lt;br /&gt;
Every class in C++ &lt;a href="http://www2.research.att.com/~bs/bs_faq2.html#sizeof-empty"&gt;has to be at least 1 byte of size&lt;/a&gt; in order to have a valid memory adress. With multiple inheritance sizeof(C) = sizeof(A) + sizeof(B) + some aligment. So VC++ behavior is correct, but not optimal. It's strange that it was &lt;a href="http://connect.microsoft.com/VisualStudio/feedback/details/101525/multiple-inheritance-wrong-sizeof#details"&gt;reported to MS in 2005&lt;/a&gt; and still they didn't fix it.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8517115461722256631-1379037857535321774?l=kriscg.blogspot.com' alt='' /&gt;&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/Kriscg/~4/19QrFs0GRH8" height="1" width="1"/&gt;</content><link rel="replies" type="application/atom+xml" href="http://kriscg.blogspot.com/feeds/1379037857535321774/comments/default" title="Post Comments" /><link rel="replies" type="text/html" href="http://kriscg.blogspot.com/2010/07/vc-and-multiple-inheritance.html#comment-form" title="4 Comments" /><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/8517115461722256631/posts/default/1379037857535321774?v=2" /><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/8517115461722256631/posts/default/1379037857535321774?v=2" /><link rel="alternate" type="text/html" href="http://feedproxy.google.com/~r/Kriscg/~3/19QrFs0GRH8/vc-and-multiple-inheritance.html" title="VC++ and multiple inheritance" /><author><name>KriS</name><uri>http://www.blogger.com/profile/01055035127272189489</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="16" height="16" src="http://img2.blogblog.com/img/b16-rounded.gif" /></author><thr:total>4</thr:total><feedburner:origLink>http://kriscg.blogspot.com/2010/07/vc-and-multiple-inheritance.html</feedburner:origLink></entry><entry gd:etag="W/&quot;DEQBQn49eip7ImA9Wx5QGUg.&quot;"><id>tag:blogger.com,1999:blog-8517115461722256631.post-2025893285943218690</id><published>2010-06-05T14:39:00.004+02:00</published><updated>2010-09-08T16:32:33.062+02:00</updated><app:edited xmlns:app="http://www.w3.org/2007/app">2010-09-08T16:32:33.062+02:00</app:edited><category scheme="http://www.blogger.com/atom/ns#" term="simulator" /><category scheme="http://www.blogger.com/atom/ns#" term="fedora" /><category scheme="http://www.blogger.com/atom/ns#" term="ibm" /><category scheme="http://www.blogger.com/atom/ns#" term="cell" /><title>CELL SDK installation on Fedora 13 x86_64</title><content type="html">I just installed IBM CELL SDK, CELL simulator and CELLIDE on my PC box with newest Fedora. It was a quite painful process and I couldn't find anywhere full installation instruction. It's a pity that IBM releases such interesting technology without proper support. So here is the full installation guide.&lt;br /&gt;
&lt;br /&gt;
First get all the needed rpm's and iso's from &lt;a href="http://www.ibm.com/developerworks/power/cell/index.html"&gt;IBM&lt;/a&gt; or other &lt;a href="http://www.bsc.es/plantillaH.php?cat_id=583"&gt;website&lt;/a&gt;:&lt;br /&gt;
&lt;pre style="background-color: #eeeeee; border: 1px dashed rgb(153, 153, 153); color: black; font-family: Andale Mono,Lucida Console,Monaco,fixed,monospace; font-size: 12px; line-height: 14px; overflow: auto; padding: 5px; width: 95%;"&gt;&lt;code&gt;systemsim-cell-3.1-25.f9.x86_64.rpm
sysroot_image-3.1-1.noarch.rpm
cell-install-3.1.0-0.0.noarch.rpm
CellSDK-Extras-Fedora_3.1.0.0.0.iso
CellSDK-Devel-Fedora_3.1.0.0.0.iso
&lt;/code&gt;&lt;/pre&gt;Now let's install sdk and simulator.&lt;br /&gt;
&lt;pre style="background-color: #eeeeee; border: 1px dashed rgb(153, 153, 153); color: black; font-family: Andale Mono,Lucida Console,Monaco,fixed,monospace; font-size: 12px; line-height: 14px; overflow: auto; padding: 5px; width: 95%;"&gt;&lt;code&gt;yum install tk rsync sed tcl wget
rpm -ivh cell-install-3.1.0-0.0.noarch.rpm
cd /opt/cell
cellsdk --iso /root/cell/cellsdk/ install
cellsdk_sync_simulator install
&lt;/code&gt;&lt;/pre&gt;Open ~/.bash_rc with You favourite text editor and modify PATH there:&lt;br /&gt;
&lt;pre style="background-color: #eeeeee; border: 1px dashed rgb(153, 153, 153); color: black; font-family: Andale Mono,Lucida Console,Monaco,fixed,monospace; font-size: 12px; line-height: 14px; overflow: auto; padding: 5px; width: 95%;"&gt;&lt;code&gt;export PATH=$PATH:/opt/ibm/systemsim-cell/bin:/opt/cell/toolchain/bin
&lt;/code&gt;&lt;/pre&gt;To run simulator from console use:&lt;br /&gt;
&lt;pre style="background-color: #eeeeee; border: 1px dashed rgb(153, 153, 153); color: black; font-family: Andale Mono,Lucida Console,Monaco,fixed,monospace; font-size: 12px; line-height: 14px; overflow: auto; padding: 5px; width: 95%;"&gt;&lt;code&gt;systemsim -g
&lt;/code&gt;&lt;/pre&gt;Remember to use fast simulator mode, it's very useful even on newest  i7 :). Now let's setup cellide (You don't need to install Fedora Eclipse).&lt;br /&gt;
&lt;pre style="background-color: #eeeeee; border: 1px dashed rgb(153, 153, 153); color: black; font-family: Andale Mono,Lucida Console,Monaco,fixed,monospace; font-size: 12px; line-height: 14px; overflow: auto; padding: 5px; width: 95%;"&gt;&lt;code&gt;yum install cellide cell-spu-timing alf-ide-template fdpr-launcher ibm-java2-i386-jre
&lt;/code&gt;&lt;/pre&gt;Time to download fix pack &lt;a href="http://www-933.ibm.com/support/fixcentral/"&gt;3.1-SDKMA-Linux-x86_64-IF01&lt;/a&gt; "intended only for RHEL" :), so Eclipse will detect local cell simulator. Install it.&lt;br /&gt;
&lt;pre style="background-color: #eeeeee; border: 1px dashed rgb(153, 153, 153); color: black; font-family: Andale Mono,Lucida Console,Monaco,fixed,monospace; font-size: 12px; line-height: 14px; overflow: auto; padding: 5px; width: 95%;"&gt;&lt;code&gt;rpm -Uvh cellide-3.1.0-7.i386.rpm
&lt;/code&gt;&lt;/pre&gt;Finally run eclipse with:&lt;br /&gt;
&lt;pre style="background-color: #eeeeee; border: 1px dashed rgb(153, 153, 153); color: black; font-family: Andale Mono,Lucida Console,Monaco,fixed,monospace; font-size: 12px; line-height: 14px; overflow: auto; padding: 5px; width: 95%;"&gt;&lt;code&gt;./eclipse -vm /opt/ibm/java2-i386-50/jre/bin
&lt;/code&gt;&lt;/pre&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8517115461722256631-2025893285943218690?l=kriscg.blogspot.com' alt='' /&gt;&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/Kriscg/~4/bDigh1wyoHI" height="1" width="1"/&gt;</content><link rel="replies" type="application/atom+xml" href="http://kriscg.blogspot.com/feeds/2025893285943218690/comments/default" title="Post Comments" /><link rel="replies" type="text/html" href="http://kriscg.blogspot.com/2010/06/cell-sdk-installation-on-fedora-13.html#comment-form" title="4 Comments" /><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/8517115461722256631/posts/default/2025893285943218690?v=2" /><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/8517115461722256631/posts/default/2025893285943218690?v=2" /><link rel="alternate" type="text/html" href="http://feedproxy.google.com/~r/Kriscg/~3/bDigh1wyoHI/cell-sdk-installation-on-fedora-13.html" title="CELL SDK installation on Fedora 13 x86_64" /><author><name>KriS</name><uri>http://www.blogger.com/profile/01055035127272189489</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="16" height="16" src="http://img2.blogblog.com/img/b16-rounded.gif" /></author><thr:total>4</thr:total><feedburner:origLink>http://kriscg.blogspot.com/2010/06/cell-sdk-installation-on-fedora-13.html</feedburner:origLink></entry><entry gd:etag="W/&quot;DEMEQ38_eyp7ImA9Wx5QGUg.&quot;"><id>tag:blogger.com,1999:blog-8517115461722256631.post-5508148402461398246</id><published>2010-05-28T18:28:00.001+02:00</published><updated>2010-09-08T16:33:22.143+02:00</updated><app:edited xmlns:app="http://www.w3.org/2007/app">2010-09-08T16:33:22.143+02:00</app:edited><category scheme="http://www.blogger.com/atom/ns#" term="tonemapping" /><category scheme="http://www.blogger.com/atom/ns#" term="post processing" /><category scheme="http://www.blogger.com/atom/ns#" term="HDR" /><category scheme="http://www.blogger.com/atom/ns#" term="color grading" /><category scheme="http://www.blogger.com/atom/ns#" term="LUT" /><title>Color grading and tonemapping using photoshop</title><content type="html">There is a very simple and nice idea in &lt;a href="http://udn.epicgames.com/Three/ColorGrading.html"&gt;UDK documentation&lt;/a&gt; about adding color grading to engine in easy way, without writing any special tools. Color grading in UDK uses LUT table (3d texture, with mapping from RGB to RGB). Pretty standard stuff. The interesting part is that this texture is authored in photoshop. LUT slices are added to photoshop layers and game screenshot is set as main layer. When You tweak the screenshot, changes are automatically propagated to layers with LUT table slices. After tweaking screenshot You need just to import authored LUT slices and use them in game. So there is no need to duplicate photoshop functionality and force game artists to work with unfamiliar custom tools.&lt;br /&gt;
&lt;br /&gt;
Moreover we could go further and use HDR source screenshot and HDR LUT table. This way we could add tonemapping with color grading, contrast, saturation... in 30 minutes, without writing any tool code.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8517115461722256631-5508148402461398246?l=kriscg.blogspot.com' alt='' /&gt;&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/Kriscg/~4/jZvKNnxMC-k" height="1" width="1"/&gt;</content><link rel="replies" type="application/atom+xml" href="http://kriscg.blogspot.com/feeds/5508148402461398246/comments/default" title="Post Comments" /><link rel="replies" type="text/html" href="http://kriscg.blogspot.com/2010/05/color-grading-and-tonemapping-using.html#comment-form" title="0 Comments" /><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/8517115461722256631/posts/default/5508148402461398246?v=2" /><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/8517115461722256631/posts/default/5508148402461398246?v=2" /><link rel="alternate" type="text/html" href="http://feedproxy.google.com/~r/Kriscg/~3/jZvKNnxMC-k/color-grading-and-tonemapping-using.html" title="Color grading and tonemapping using photoshop" /><author><name>KriS</name><uri>http://www.blogger.com/profile/01055035127272189489</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="16" height="16" src="http://img2.blogblog.com/img/b16-rounded.gif" /></author><thr:total>0</thr:total><feedburner:origLink>http://kriscg.blogspot.com/2010/05/color-grading-and-tonemapping-using.html</feedburner:origLink></entry><entry gd:etag="W/&quot;DEMGRns6fCp7ImA9Wx5QGUg.&quot;"><id>tag:blogger.com,1999:blog-8517115461722256631.post-7796343611906664086</id><published>2010-05-04T00:58:00.003+02:00</published><updated>2010-09-08T16:33:47.514+02:00</updated><app:edited xmlns:app="http://www.w3.org/2007/app">2010-09-08T16:33:47.514+02:00</app:edited><category scheme="http://www.blogger.com/atom/ns#" term="CPU" /><category scheme="http://www.blogger.com/atom/ns#" term="optimization" /><category scheme="http://www.blogger.com/atom/ns#" term="compiler" /><title>C++ compiler VS human</title><content type="html">There is a very nice article about writing fast math library on &lt;a href="http://www.gamasutra.com/view/feature/4248/designing_fast_crossplatform_simd_.php"&gt;gamasutra&lt;/a&gt;. It shows for example that using operator overloading generates slower code than when using functions.&amp;nbsp;A lot of people believe that compiler will generate better code than a programmer can. They just happily write code and&amp;nbsp;don't check that their compiler generates.&lt;br /&gt;
&lt;br /&gt;
Let's test how good is VC++ against simple algebra and OOP crap :).&amp;nbsp;I decided to be not too harsh, so I omitted here topics like SIMD, FPU or &lt;a href="http://realtimecollisiondetection.net/pubs/GDC03_Ericson_Memory_Optimization.ppt"&gt;&lt;span id="goog_800920421"&gt;&lt;/span&gt;aliasing&lt;span id="goog_800920422"&gt;&lt;/span&gt;&lt;/a&gt;. &amp;nbsp;I have chosen VC++ 2008, because it's the most popular compiler in gamedev industry (and I think it will maintain it's position until the VC++2010 SP1 release :) ).&lt;br /&gt;
&lt;br /&gt;
Default release build settings - /O2 etc. + very simple test cases, just int main(), scanf and printf.&lt;br /&gt;
&lt;br /&gt;
&lt;pre style="background-color: #eeeeee; border: 1px dashed #999999; color: black; font-family: Andale Mono, Lucida Console, Monaco, fixed, monospace; font-size: 12px; line-height: 14px; overflow: auto; padding: 5px; width: 95%;"&gt;&lt;code&gt;// x/x = 1
int y = x / x;

00401018  mov         ecx,dword ptr [esp+8] 
0040101C  mov         eax,ecx 
0040101E  cdq              
0040101F  idiv        eax,ecx
&lt;/code&gt;&lt;/pre&gt;mov and idiv? FAIL.&lt;br /&gt;
&lt;br /&gt;
&lt;pre style="background-color: #eeeeee; border: 1px dashed #999999; color: black; font-family: Andale Mono, Lucida Console, Monaco, fixed, monospace; font-size: 12px; line-height: 14px; overflow: auto; padding: 5px; width: 95%;"&gt;&lt;code&gt;// 0/x = 0
int y = 0 / x;

00401018  xor         eax,eax 
0040101A  cdq              
0040101B  idiv        eax,dword ptr [esp+8] 
&lt;/code&gt;&lt;/pre&gt;Another idiv, but at least compiler uses xor instead of loading 0 from memory.&lt;br /&gt;
&lt;br /&gt;
&lt;pre style="background-color: #eeeeee; border: 1px dashed #999999; color: black; font-family: Andale Mono, Lucida Console, Monaco, fixed, monospace; font-size: 12px; line-height: 14px; overflow: auto; padding: 5px; width: 95%;"&gt;&lt;code&gt;// z = x * x
// w = z * z
// y = w * w
int y = x * x * x * x * x * x * x * x;

00401018  mov         eax,dword ptr [esp+8] 
0040101C  mov         ecx,eax 
0040101E  imul        ecx,eax 
00401021  imul        ecx,eax 
00401024  imul        ecx,eax 
00401027  imul        ecx,eax 
0040102A  imul        ecx,eax 
0040102D  imul        ecx,eax 
00401030  imul        ecx,eax
&lt;/code&gt;&lt;/pre&gt;It was also too difficult for the compiler.&lt;br /&gt;
&lt;br /&gt;
Ok, so maybe let's try some OO code?&lt;br /&gt;
&lt;pre style="background-color: #eeeeee; border: 1px dashed #999999; color: black; font-family: Andale Mono, Lucida Console, Monaco, fixed, monospace; font-size: 12px; line-height: 14px; overflow: auto; padding: 5px; width: 95%;"&gt;&lt;code&gt;#include &amp;lt;stdio.h&amp;gt;

class Object0
{
public:
    void virtual Print() { printf( "a" ); }
};

int main()
{
    Object0 *obj = new Object0;
    obj-&amp;gt;Print();
    return 0;
}

0040101E  mov         dword ptr [eax],offset Object0::`vftable' (402104h)
00401024  mov         edx,dword ptr [eax] 
00401026  mov         ecx,eax 
00401028  mov         eax,dword ptr [edx] 
0040102A  call        eax
&lt;/code&gt;&lt;/pre&gt;vftable? Another failure. Generated code can be fixed by writing obj-&amp;gt;Object1::Print(); or removing virtual keyword.&lt;br /&gt;
&lt;br /&gt;
Remember to hit alt+8 next time to open the disasm window :).&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8517115461722256631-7796343611906664086?l=kriscg.blogspot.com' alt='' /&gt;&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/Kriscg/~4/i_D0a5dqAHc" height="1" width="1"/&gt;</content><link rel="replies" type="application/atom+xml" href="http://kriscg.blogspot.com/feeds/7796343611906664086/comments/default" title="Post Comments" /><link rel="replies" type="text/html" href="http://kriscg.blogspot.com/2010/05/c-compiler-vs-human.html#comment-form" title="0 Comments" /><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/8517115461722256631/posts/default/7796343611906664086?v=2" /><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/8517115461722256631/posts/default/7796343611906664086?v=2" /><link rel="alternate" type="text/html" href="http://feedproxy.google.com/~r/Kriscg/~3/i_D0a5dqAHc/c-compiler-vs-human.html" title="C++ compiler VS human" /><author><name>KriS</name><uri>http://www.blogger.com/profile/01055035127272189489</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="16" height="16" src="http://img2.blogblog.com/img/b16-rounded.gif" /></author><thr:total>0</thr:total><feedburner:origLink>http://kriscg.blogspot.com/2010/05/c-compiler-vs-human.html</feedburner:origLink></entry><entry gd:etag="W/&quot;DEQDQHc5fSp7ImA9Wx5QGUg.&quot;"><id>tag:blogger.com,1999:blog-8517115461722256631.post-1704072819422878400</id><published>2010-02-27T00:45:00.004+01:00</published><updated>2010-09-08T16:32:51.925+02:00</updated><app:edited xmlns:app="http://www.w3.org/2007/app">2010-09-08T16:32:51.925+02:00</app:edited><category scheme="http://www.blogger.com/atom/ns#" term="particle" /><category scheme="http://www.blogger.com/atom/ns#" term="circular buffer" /><category scheme="http://www.blogger.com/atom/ns#" term="particle data structure" /><title>Particle data structure</title><content type="html">Particles are usually some kind of simple structs stored in an array. Then every frame vertex buffer is filled using that array. You really don't want to sort them (at least not all particles and sometimes sorted particles don't look good). You need also to retain insertion order to prevent particle popping. Using simple array and replacing deleted element with the last one isn't a good idea.&lt;br /&gt;
&lt;br /&gt;
Straightforward solution is to use &lt;a href="http://en.wikipedia.org/wiki/Circular_buffer"&gt;circular buffer&lt;/a&gt;. This has some issues - if particles have no uniform lifetime it needs some checks in order to detect dead particles when filling the vertex buffer. There will be also some holes in the circular buffer, so memory won't be used effectively.&lt;br /&gt;
&lt;br /&gt;
Another solution is to use array and defragment it every frame. This means no conditional statements, but could lead to a lot of memory copying (depending on the lifetime of particles).&lt;br /&gt;
&lt;br /&gt;
The interesting thing is that above methods can be combined. Third method uses an assumption that in case of the defragmented array You usually remove particles from the beginning and add to the end. So why not just have large array and two floating pointers for marking begin and end of particle data. Now when You add something - increase end pointer. When You remove an element - increase begin pointer. In case of removal from the middle of data - defragment by shifting data from left or right. When end pointer can't be increased just defragment by copying data to the beginning of the array. This doesn't change the worst case - removal from the array middle, but should greatly speed up the average case.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8517115461722256631-1704072819422878400?l=kriscg.blogspot.com' alt='' /&gt;&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/Kriscg/~4/BOowclO9_v0" height="1" width="1"/&gt;</content><link rel="replies" type="application/atom+xml" href="http://kriscg.blogspot.com/feeds/1704072819422878400/comments/default" title="Post Comments" /><link rel="replies" type="text/html" href="http://kriscg.blogspot.com/2010/02/particle-data-structure.html#comment-form" title="3 Comments" /><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/8517115461722256631/posts/default/1704072819422878400?v=2" /><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/8517115461722256631/posts/default/1704072819422878400?v=2" /><link rel="alternate" type="text/html" href="http://feedproxy.google.com/~r/Kriscg/~3/BOowclO9_v0/particle-data-structure.html" title="Particle data structure" /><author><name>KriS</name><uri>http://www.blogger.com/profile/01055035127272189489</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="16" height="16" src="http://img2.blogblog.com/img/b16-rounded.gif" /></author><thr:total>3</thr:total><feedburner:origLink>http://kriscg.blogspot.com/2010/02/particle-data-structure.html</feedburner:origLink></entry><entry gd:etag="W/&quot;DEMFR3o5eSp7ImA9Wx5QGUg.&quot;"><id>tag:blogger.com,1999:blog-8517115461722256631.post-833356397474520224</id><published>2009-11-18T21:17:00.013+01:00</published><updated>2010-09-08T16:33:36.421+02:00</updated><app:edited xmlns:app="http://www.w3.org/2007/app">2010-09-08T16:33:36.421+02:00</app:edited><category scheme="http://www.blogger.com/atom/ns#" term="premultiplied alpha" /><category scheme="http://www.blogger.com/atom/ns#" term="blending" /><category scheme="http://www.blogger.com/atom/ns#" term="compression" /><title>Premultiplied alpha</title><content type="html">If You don't know what's premultiplied alpha and why it solves all the world's problems, just read &lt;a href="http://home.comcast.net/~tom_forsyth/blog.wiki.html#%5B%5BPremultiplied%20alpha%5D%5D"&gt;Tom Forsyth on premultiplied alpha&lt;/a&gt; or &lt;a href="http://blogs.msdn.com/shawnhar/archive/2009/11/06/premultiplied-alpha.aspx"&gt;Shawn Hargreaves on premultiplied alpha&lt;/a&gt;.&lt;br /&gt;
&lt;br /&gt;
Pros:&lt;br /&gt;
1. Better quality for textures with sharp alpha cutouts.&lt;br /&gt;
2. Removes some blending state changes (which can be important if You are coding for a PC).&lt;br /&gt;
3. Can mix additive blending with alpha blending without a state/shader/texture change.&lt;br /&gt;
&lt;br /&gt;
What are the hidden cons, which aren't mentioned by Tom or Shawn?&lt;br /&gt;
1. Fixed pipeline fog doesn't work with it.&lt;br /&gt;
2. Worse quality for smooth alpha with DXT5 compression.&lt;br /&gt;
&lt;br /&gt;
Why premultiplied alpha doesn't work very well with DXT5 for textures with smooth alpha gradients? In DXT5 texture is divided into the 4x4 pixel blocks. For every block, colors (and alphas) are approximated with equidistant points on a line between two end points (using some index table). Color end points are quantized (5:6:5 bits) and alpha end points are saved with 8 bit precision. This means that for the most textures we get better precision for the alpha channel than for the color channel.&lt;br /&gt;
&lt;br /&gt;
Furthermore compressing values of a broader range gives us better precision. For example if we have alpha filled with 1/255 and RGB values in range [0; 1] then premultiplied texture RGB channels will contain only two different numbers - 0 or 1/125. This means that by using standard alpha blending we get better precision in case we would like to multiply RGB by a factor greater than 1 or tonemap final results.&lt;br /&gt;
&lt;br /&gt;
Standard alpha blending:&lt;br /&gt;
&amp;nbsp; &amp;nbsp; srcAlpha = Compress_color( src.rgb ) * Compress_alpha( src.a )&lt;br /&gt;
&lt;br /&gt;
Premultiplied alpha blending:&lt;br /&gt;
&amp;nbsp; &amp;nbsp; srcAlpha = Compress_color( src.rgb * src.a )&lt;br /&gt;
&lt;br /&gt;
Let's see how does it look in practice. I created a sample texture with a "lighting" gradient in RGB channels and smoke puff in alpha (from left to right: RGB, alpha and premultiplied RGB by alpha):&lt;br /&gt;
&lt;br /&gt;
&lt;div align="left" class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://2.bp.blogspot.com/_PbTMzfd4SSg/SwMfsHn0YsI/AAAAAAAAAAk/sLNU4o3GlJg/s1600/srcTile.png" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"&gt;&lt;img border="0" src="http://2.bp.blogspot.com/_PbTMzfd4SSg/SwMfsHn0YsI/AAAAAAAAAAk/sLNU4o3GlJg/s640/srcTile.png" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Now let's alpha blend two compressed textures (DXT5), zoom and compare results (left image - standard alpha blending, right - premultiplied alpha blending):&lt;br /&gt;
&lt;br /&gt;
&lt;div align="left" class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://1.bp.blogspot.com/_PbTMzfd4SSg/SwMkUeQ-C6I/AAAAAAAAAA0/-x2xf0rfVak/s1600/cmp2.png" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"&gt;&lt;img border="0" src="http://1.bp.blogspot.com/_PbTMzfd4SSg/SwMkUeQ-C6I/AAAAAAAAAA0/-x2xf0rfVak/s640/cmp2.png" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
This looks like a small difference, but can be quite visible in a real game - especially when smoke particles are big or their color is multiplied by a factor greater than 1 or when using some kind of tone-mapping.&lt;br /&gt;
&lt;br /&gt;
BTW there is interesting feature/bug in NVIDIA Photoshop texture tools. You can't save DXT5 with full black alpha (it just creates a DDS without alpha channel). This ensures that those lazy artists use DXT1 compression for additive premultiplied alpha blending :).&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8517115461722256631-833356397474520224?l=kriscg.blogspot.com' alt='' /&gt;&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/Kriscg/~4/A9xlqCbC7qA" height="1" width="1"/&gt;</content><link rel="replies" type="application/atom+xml" href="http://kriscg.blogspot.com/feeds/833356397474520224/comments/default" title="Post Comments" /><link rel="replies" type="text/html" href="http://kriscg.blogspot.com/2009/11/premultiplied-alpha.html#comment-form" title="2 Comments" /><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/8517115461722256631/posts/default/833356397474520224?v=2" /><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/8517115461722256631/posts/default/833356397474520224?v=2" /><link rel="alternate" type="text/html" href="http://feedproxy.google.com/~r/Kriscg/~3/A9xlqCbC7qA/premultiplied-alpha.html" title="Premultiplied alpha" /><author><name>KriS</name><uri>http://www.blogger.com/profile/01055035127272189489</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="16" height="16" src="http://img2.blogblog.com/img/b16-rounded.gif" /></author><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="http://2.bp.blogspot.com/_PbTMzfd4SSg/SwMfsHn0YsI/AAAAAAAAAAk/sLNU4o3GlJg/s72-c/srcTile.png" height="72" width="72" /><thr:total>2</thr:total><feedburner:origLink>http://kriscg.blogspot.com/2009/11/premultiplied-alpha.html</feedburner:origLink></entry><entry gd:etag="W/&quot;DEQMRX8zcSp7ImA9Wx5QGUg.&quot;"><id>tag:blogger.com,1999:blog-8517115461722256631.post-6893562277323227193</id><published>2009-11-18T00:20:00.009+01:00</published><updated>2010-09-08T16:33:04.189+02:00</updated><app:edited xmlns:app="http://www.w3.org/2007/app">2010-09-08T16:33:04.189+02:00</app:edited><category scheme="http://www.blogger.com/atom/ns#" term="Software Rendering" /><category scheme="http://www.blogger.com/atom/ns#" term="Linux" /><category scheme="http://www.blogger.com/atom/ns#" term="PS3" /><category scheme="http://www.blogger.com/atom/ns#" term="SPU" /><title>SPU programming on a retail "fat" PS3 with Linux</title><content type="html">&lt;div align="left"&gt;Larabee is approaching, so it's a good time to learn more about coding a modern multi-core software renderer. Thanks for Sony, everyone with a retail "fat" PS3 can install some Linux distro and have fun with SPU's. Not as fun as with a real devkit - no Visual Studio with ProDG, &amp;nbsp;no access to RSX and no close to the metal SPU libs. You can't even manually assign SPU tasks to physical SPU's.&lt;/div&gt;&lt;div align="left"&gt;&lt;br /&gt;
&lt;/div&gt;&lt;div align="left"&gt;It took me some time to install and configure Linux (YDL 6.1). Small hint - if you have to use a window manager, use Fluxbox. It's much faster on the retail PS3 than GNOME (slow), KDE(very slow) or Enlightenment(very slow). You can also work remotely using ssh/putty/Eclipse IDE (Linux only).&lt;/div&gt;&lt;div align="left"&gt;&lt;br /&gt;
&lt;/div&gt;&lt;div align="left"&gt;For people without the retail PS3 there is a &lt;a href="http://www.alphaworks.ibm.com/tech/cellsystemsim"&gt;cell simulator&lt;/a&gt; on the IBM site (again Linux only). Currently only the toolchain is ported to windows (&lt;a href="http://sourceforge.net/projects/cellwindowssdk/"&gt;windows cell sdk&lt;/a&gt;), so you can compile SPU stuff on Windows, but can't run it there.&lt;/div&gt;&lt;div align="left"&gt;&lt;br /&gt;
&lt;/div&gt;&lt;div align="left"&gt;Be prepared for some posts about software rendering and low level SPU coding :).&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8517115461722256631-6893562277323227193?l=kriscg.blogspot.com' alt='' /&gt;&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/Kriscg/~4/hiLjCoxgLaA" height="1" width="1"/&gt;</content><link rel="replies" type="application/atom+xml" href="http://kriscg.blogspot.com/feeds/6893562277323227193/comments/default" title="Post Comments" /><link rel="replies" type="text/html" href="http://kriscg.blogspot.com/2009/11/spu-programming-on-ps3-with-linux.html#comment-form" title="0 Comments" /><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/8517115461722256631/posts/default/6893562277323227193?v=2" /><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/8517115461722256631/posts/default/6893562277323227193?v=2" /><link rel="alternate" type="text/html" href="http://feedproxy.google.com/~r/Kriscg/~3/hiLjCoxgLaA/spu-programming-on-ps3-with-linux.html" title="SPU programming on a retail &quot;fat&quot; PS3 with Linux" /><author><name>KriS</name><uri>http://www.blogger.com/profile/01055035127272189489</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="16" height="16" src="http://img2.blogblog.com/img/b16-rounded.gif" /></author><thr:total>0</thr:total><feedburner:origLink>http://kriscg.blogspot.com/2009/11/spu-programming-on-ps3-with-linux.html</feedburner:origLink></entry></feed>

