<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">

 <title>Siarhei Siamashka</title>
 <link href="http://ssvb.github.io/"/>
 <link type="application/atom+xml" rel="self" href="http://ssvb.github.io/atom.xml"/>
 <updated>2014-11-12T16:49:18+00:00</updated>
 <id>http://ssvb.github.io/</id>
 <author>
   <name>Siarhei Siamashka</name>
   <email>siarhei.siamashka@gmail.com</email>
 </author>

 
 <entry>
   <id>http://ssvb.github.io/2014/11/11/revisiting-fullhd-x11-desktop-performance-of-the-allwinner-a10</id>
   <link type="text/html" rel="alternate" href="http://ssvb.github.io/2014/11/11/revisiting-fullhd-x11-desktop-performance-of-the-allwinner-a10.html"/>
   <title>Revisiting FullHD X11 desktop performance of the Allwinner A10</title>
   <updated>2014-11-11T00:00:00+00:00</updated>
    <author>
      <name>Siarhei Siamashka</name>
      <uri>http://ssvb.github.io/</uri>
    </author>
   <content type="html">&lt;p&gt;In my &lt;a href=&quot;http://ssvb.github.io/2013/06/27/fullhd-x11-desktop-performance-of-the-allwinner-a10.html&quot;&gt;previous blog post&lt;/a&gt;,
I was talking about a pathologically bad Linux desktop performance with FullHD monitors on Allwinner A10 hardware.&lt;/p&gt;

&lt;p&gt;A lot of time has passed since then. Thanks to the availability of Rockchip
&lt;a href=&quot;https://github.com/ssvb/Rockchip-GPL-Kernel/blob/master/arch/arm/mach-rk29/ddr.c&quot;&gt;sources&lt;/a&gt;
and &lt;a href=&quot;http://www.cnx-software.com/2012/11/04/rockchip-rk3066-rk30xx-processor-documentation-source-code-and-tools/&quot;&gt;documentation&lt;/a&gt;,
we have learned a lot of information about the DRAM controller in Allwinner A10/A13/A20 SoCs.
Both Allwinner and Rockchip are apparently licensing the DRAM controller IP from
the same &lt;a href=&quot;http://www.synopsys.com/dw/ipdir.php?ds=dwc_ddr2-lite_mem&quot;&gt;third-party vendor&lt;/a&gt;.
And their DRAM controller hardware registers are sharing a lot of similarities (though
unfortunately this is not an exact match).&lt;/p&gt;

&lt;p&gt;Having a much better knowledge about the hardware allowed us to revisit
this problem, investigate it in more details and
&lt;a href=&quot;https://github.com/linux-sunxi/u-boot-sunxi/commit/4e1532df5ebc6e0dd56c09dddb3d116979a2c49b&quot;&gt;come up with a solution back in April 2014&lt;/a&gt;.
The only missing part was providing an update in this blog. At least
to make it clear that the problem has been resolved now. So here we go...&lt;/p&gt;
</content>
 </entry>
 
 <entry>
   <id>http://ssvb.github.io/2013/06/27/fullhd-x11-desktop-performance-of-the-allwinner-a10</id>
   <link type="text/html" rel="alternate" href="http://ssvb.github.io/2013/06/27/fullhd-x11-desktop-performance-of-the-allwinner-a10.html"/>
   <title>FullHD X11 desktop performance of the Allwinner A10</title>
   <updated>2013-06-27T00:00:00+00:00</updated>
    <author>
      <name>Siarhei Siamashka</name>
      <uri>http://ssvb.github.io/</uri>
    </author>
   <content type="html">&lt;p&gt;This blog post is assuming that you are a happy owner of one of the devices,
based on the Allwinner A10 SoC (with a single core ARM Cortex-A8 1GHz). But
hopefully the owners of the other low end ARM based devices may also find
something interesting here.&lt;/p&gt;

&lt;p&gt;There are plenty of user friendly linux distributions
available for Allwinner A10 devices (for example, &lt;a href=&quot;https://fedoraproject.org/wiki/Architectures/ARM/AllwinerA10&quot;&gt;Fedora&lt;/a&gt;
is a nice one). Basically you just write an image to the SD card, plug a HDMI cable
into your TV or monitor, connect a keyboard and a mouse, power the device on. And then a
nice GUI wizard guides you through the initial configuration, like setting passwords, etc.
A part of the magic, which allows these user friendly distros to just work out-of-the
box, is the automatic detection of the monitor capabilities via
&lt;a href=&quot;http://en.wikipedia.org/wiki/Extended_display_identification_data&quot;&gt;EDID&lt;/a&gt; and
setting the preferred screen resolution, suggested by the monitor. Many monitors
are &lt;a href=&quot;https://en.wikipedia.org/wiki/1080p&quot;&gt;FullHD&lt;/a&gt; capable, hence you are likely to
end up with a 1920x1080 screen resolution. And that&#39;s where it may become a challenge
for a low end device.&lt;/p&gt;

&lt;p&gt;First of all, 1920x1080 screen has 2.25x times more pixels than 1280x720, and the amount
of the pixels to be processed naturally affects the performance. So expect 1920x1080 graphics
to be at least twice slower than 1280x720 for redrawing anything that covers the whole
screen.&lt;/p&gt;

&lt;p&gt;But additionally, as part of the monitor refresh, pixels are read from the framebuffer
and sent over the HDMI to the monitor 60 times per second. As there is no dedicated video
memory for the framebuffer, the screen refresh is competing with the CPU, DMA and various
hardware accelerators for the access to the system memory. We can estimate how much system
memory bandwidth is wasted for just maintaining the monitor refresh:
            1920x1080 * 4 bytes per pixel * 60Hz = ~500 MB/s&lt;/p&gt;

&lt;p&gt;And we should double this amount if the system is driving two monitors at once (HDMI and VGA), but
the dual monitor support is outside of the scope of this blog post. Anyway, is 500 MB/s significant
or not? Allwinner A10 uses 32-bit DDR3 memory, clocked between 360 MHz and
480 MHz (the default memory clock speed is different for different devices). Which means that
the theoretical memory bandwidth limit is between 2.9 GB/s and 3.8 GB/s. So in theory we should
be perfectly fine?&lt;/p&gt;

&lt;h2&gt;Synthetic tests for the monitor refresh induced memory bandwidth loss&lt;/h2&gt;

&lt;p&gt;We can simply try to boot the system with different combinations of monitor refresh
rate, desktop color depth and memory clock frequency. Then do the measurements for
each with &lt;a href=&quot;https://github.com/ssvb/tinymembench&quot;&gt;tinymembench&lt;/a&gt; and put the results
into tables. The performance of memset appears to be the most affected, hence is it
the most interesting to observe. There are also &quot;backwards memset&quot; performance numbers
for the sake of completeness (it does the same job as memset, but is implemented by
decrementing the pointer after each write instead of incrementing it).&lt;/p&gt;

&lt;table border=1 style=&#39;border-collapse: collapse; empty-cells: show; font-family: arial; font-size: small; white-space: nowrap; background: #F0F0F0;&#39;&gt;
&lt;caption&gt;&lt;b&gt;Table 1. Memory write bandwidth available to the CPU (memset performance)&lt;/b&gt;&lt;/caption&gt;
&lt;tr&gt;&lt;th&gt;&lt;th colspan=6&gt;Memory clock speed
&lt;tr&gt;&lt;th&gt;Video mode&lt;th&gt;360MHz&lt;th&gt;384MHz&lt;th&gt;408MHz&lt;th&gt;432MHz&lt;th&gt;456MHz&lt;th&gt;480MHz
&lt;tr&gt;&lt;td&gt;1920x1080, 32bpp, 60Hz&lt;td bgcolor=&#39;red&#39;&gt;450 MB/s&lt;td bgcolor=&#39;red&#39;&gt;480 MB/s&lt;td bgcolor=&#39;red&#39;&gt;509 MB/s&lt;td bgcolor=&#39;red&#39;&gt;537 MB/s&lt;td bgcolor=&#39;red&#39;&gt;556 MB/s&lt;td bgcolor=&#39;red&#39;&gt;556 MB/s&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;1920x1080, 32bpp, 60Hz (scaler mode)&lt;td bgcolor=&#39;red&#39;&gt;548 MB/s&lt;td bgcolor=&#39;red&#39;&gt;550 MB/s&lt;td bgcolor=&#39;red&#39;&gt;554 MB/s&lt;td bgcolor=&#39;red&#39;&gt;554 MB/s&lt;td bgcolor=&#39;red&#39;&gt;558 MB/s&lt;td bgcolor=&#39;red&#39;&gt;558 MB/s&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;1920x1080, 32bpp, 56Hz&lt;td bgcolor=&#39;red&#39;&gt;449 MB/s&lt;td bgcolor=&#39;red&#39;&gt;479 MB/s&lt;td bgcolor=&#39;red&#39;&gt;510 MB/s&lt;td bgcolor=&#39;red&#39;&gt;522 MB/s&lt;td bgcolor=&#39;red&#39;&gt;533 MB/s&lt;td bgcolor=&#39;#5DDC5D&#39;&gt;812 MB/s&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;1920x1080, 32bpp, 56Hz (scaler mode)&lt;td bgcolor=&#39;red&#39;&gt;514 MB/s&lt;td bgcolor=&#39;red&#39;&gt;620 MB/s&lt;td bgcolor=&#39;#65E465&#39;&gt;764 MB/s&lt;td bgcolor=&#39;#64E364&#39;&gt;769 MB/s&lt;td bgcolor=&#39;#63E263&#39;&gt;774 MB/s&lt;td bgcolor=&#39;#4FCE4F&#39;&gt;896 MB/s&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;1920x1080, 32bpp, 50Hz&lt;td bgcolor=&#39;red&#39;&gt;449 MB/s&lt;td bgcolor=&#39;red&#39;&gt;467 MB/s&lt;td bgcolor=&#39;red&#39;&gt;576 MB/s&lt;td bgcolor=&#39;#5DDC5D&#39;&gt;815 MB/s&lt;td bgcolor=&#39;#37B637&#39;&gt;1041 MB/s&lt;td bgcolor=&#39;#29A829&#39;&gt;1122 MB/s&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;1920x1080, 32bpp, 50Hz (scaler mode)&lt;td bgcolor=&#39;#66E566&#39;&gt;759 MB/s&lt;td bgcolor=&#39;#51D051&#39;&gt;885 MB/s&lt;td bgcolor=&#39;#4BCA4B&#39;&gt;921 MB/s&lt;td bgcolor=&#39;#44C344&#39;&gt;964 MB/s&lt;td bgcolor=&#39;#3BBA3B&#39;&gt;1018 MB/s&lt;td bgcolor=&#39;#28A728&#39;&gt;1130 MB/s&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;1920x1080, 24bpp, 60Hz&lt;td bgcolor=&#39;red&#39;&gt;421 MB/s&lt;td bgcolor=&#39;red&#39;&gt;430 MB/s&lt;td bgcolor=&#39;#58D758&#39;&gt;842 MB/s&lt;td bgcolor=&#39;#42C142&#39;&gt;972 MB/s&lt;td bgcolor=&#39;#31B031&#39;&gt;1074 MB/s&lt;td bgcolor=&#39;#199819&#39;&gt;1219 MB/s&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;1920x1080, 24bpp, 56Hz&lt;td bgcolor=&#39;red&#39;&gt;417 MB/s&lt;td bgcolor=&#39;#55D455&#39;&gt;860 MB/s&lt;td bgcolor=&#39;#47C647&#39;&gt;947 MB/s&lt;td bgcolor=&#39;#39B839&#39;&gt;1030 MB/s&lt;td bgcolor=&#39;#22A122&#39;&gt;1168 MB/s&lt;td bgcolor=&#39;#1B9A1B&#39;&gt;1210 MB/s&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;1920x1080, 24bpp, 50Hz&lt;td bgcolor=&#39;#5DDC5D&#39;&gt;813 MB/s&lt;td bgcolor=&#39;#51D051&#39;&gt;887 MB/s&lt;td bgcolor=&#39;#3AB93A&#39;&gt;1023 MB/s&lt;td bgcolor=&#39;#209F20&#39;&gt;1180 MB/s&lt;td bgcolor=&#39;#159415&#39;&gt;1247 MB/s&lt;td bgcolor=&#39;#149314&#39;&gt;1252 MB/s&lt;/tr&gt;
&lt;/table&gt;




&lt;p&gt;&lt;/p&gt;


&lt;table border=1 style=&#39;border-collapse: collapse; empty-cells: show; font-family: arial; font-size: small; white-space: nowrap; background: #F0F0F0;&#39;&gt;
&lt;caption&gt;&lt;b&gt;Table 2. Memory write bandwidth available to the CPU (backwards memset performance)&lt;/b&gt;&lt;/caption&gt;
&lt;tr&gt;&lt;th&gt;&lt;th colspan=6&gt;Memory clock speed
&lt;tr&gt;&lt;th&gt;Video mode&lt;th&gt;360MHz&lt;th&gt;384MHz&lt;th&gt;408MHz&lt;th&gt;432MHz&lt;th&gt;456MHz&lt;th&gt;480MHz
&lt;tr&gt;&lt;td&gt;1920x1080, 32bpp, 60Hz&lt;td bgcolor=&#39;#72F172&#39;&gt;688 MB/s&lt;td bgcolor=&#39;#5CDB5C&#39;&gt;817 MB/s&lt;td bgcolor=&#39;#51D051&#39;&gt;882 MB/s&lt;td bgcolor=&#39;#51D051&#39;&gt;883 MB/s&lt;td bgcolor=&#39;#42C142&#39;&gt;974 MB/s&lt;td bgcolor=&#39;#37B637&#39;&gt;1040 MB/s&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;1920x1080, 32bpp, 60Hz (scaler mode)&lt;td bgcolor=&#39;#6BEA6B&#39;&gt;726 MB/s&lt;td bgcolor=&#39;#63E263&#39;&gt;779 MB/s&lt;td bgcolor=&#39;#51D051&#39;&gt;882 MB/s&lt;td bgcolor=&#39;#51D051&#39;&gt;884 MB/s&lt;td bgcolor=&#39;#4AC94A&#39;&gt;925 MB/s&lt;td bgcolor=&#39;#39B839&#39;&gt;1030 MB/s&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;1920x1080, 32bpp, 56Hz&lt;td bgcolor=&#39;#64E364&#39;&gt;769 MB/s&lt;td bgcolor=&#39;#5BDA5B&#39;&gt;824 MB/s&lt;td bgcolor=&#39;#53D253&#39;&gt;873 MB/s&lt;td bgcolor=&#39;#47C647&#39;&gt;947 MB/s&lt;td bgcolor=&#39;#3FBE3F&#39;&gt;995 MB/s&lt;td bgcolor=&#39;#29A829&#39;&gt;1123 MB/s&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;1920x1080, 32bpp, 56Hz (scaler mode)&lt;td bgcolor=&#39;#65E465&#39;&gt;762 MB/s&lt;td bgcolor=&#39;#5BDA5B&#39;&gt;825 MB/s&lt;td bgcolor=&#39;#53D253&#39;&gt;874 MB/s&lt;td bgcolor=&#39;#45C445&#39;&gt;959 MB/s&lt;td bgcolor=&#39;#3EBD3E&#39;&gt;996 MB/s&lt;td bgcolor=&#39;#34B334&#39;&gt;1060 MB/s&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;1920x1080, 32bpp, 50Hz&lt;td bgcolor=&#39;#65E465&#39;&gt;763 MB/s&lt;td bgcolor=&#39;#55D455&#39;&gt;863 MB/s&lt;td bgcolor=&#39;#48C748&#39;&gt;941 MB/s&lt;td bgcolor=&#39;#3AB93A&#39;&gt;1021 MB/s&lt;td bgcolor=&#39;#2AA92A&#39;&gt;1119 MB/s&lt;td bgcolor=&#39;#1E9D1E&#39;&gt;1188 MB/s&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;1920x1080, 32bpp, 50Hz (scaler mode)&lt;td bgcolor=&#39;#5FDE5F&#39;&gt;799 MB/s&lt;td bgcolor=&#39;#51D051&#39;&gt;887 MB/s&lt;td bgcolor=&#39;#4BCA4B&#39;&gt;919 MB/s&lt;td bgcolor=&#39;#3EBD3E&#39;&gt;996 MB/s&lt;td bgcolor=&#39;#32B132&#39;&gt;1071 MB/s&lt;td bgcolor=&#39;#1F9E1F&#39;&gt;1183 MB/s&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;1920x1080, 24bpp, 60Hz&lt;td bgcolor=&#39;#5CDB5C&#39;&gt;819 MB/s&lt;td bgcolor=&#39;#4BCA4B&#39;&gt;919 MB/s&lt;td bgcolor=&#39;#40BF40&#39;&gt;986 MB/s&lt;td bgcolor=&#39;#26A526&#39;&gt;1143 MB/s&lt;td bgcolor=&#39;#21A021&#39;&gt;1175 MB/s&lt;td bgcolor=&#39;#209F20&#39;&gt;1177 MB/s&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;1920x1080, 24bpp, 56Hz&lt;td bgcolor=&#39;#56D556&#39;&gt;856 MB/s&lt;td bgcolor=&#39;#48C748&#39;&gt;938 MB/s&lt;td bgcolor=&#39;#2EAD2E&#39;&gt;1097 MB/s&lt;td bgcolor=&#39;#2DAC2D&#39;&gt;1098 MB/s&lt;td bgcolor=&#39;#209F20&#39;&gt;1178 MB/s&lt;td bgcolor=&#39;#169516&#39;&gt;1239 MB/s&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;1920x1080, 24bpp, 50Hz&lt;td bgcolor=&#39;#4AC94A&#39;&gt;925 MB/s&lt;td bgcolor=&#39;#41C041&#39;&gt;983 MB/s&lt;td bgcolor=&#39;#34B334&#39;&gt;1060 MB/s&lt;td bgcolor=&#39;#26A526&#39;&gt;1144 MB/s&lt;td bgcolor=&#39;#1F9E1F&#39;&gt;1182 MB/s&lt;td bgcolor=&#39;#129112&#39;&gt;1263 MB/s&lt;/tr&gt;
&lt;/table&gt;


&lt;p&gt;The &lt;a href=&quot;http://linux-sunxi.org/Fex_Guide#disp_init_configuration&quot;&gt;&quot;scaler mode&quot;&lt;/a&gt; needs an
additional explanation. The display controller in Allwinner A10 consists of two parts:
Display Engine Front End (DEFE) and Display Engine Back End (DEBE). DEBE can provide up
to 4 hardware layers (which are composited together for the final picture on screen) and
supports a large variety of pixel formats. DEFE is connected in front of DEBE and can
optionally provide scaling for 2 of these hardware layers, the drawback is that DEFE
supports only a limited set of pixel formats. All this information can be found in the
&lt;a href=&quot;http://free-electrons.com/~maxime/pub/datasheet/A13%20user%20manual%20v1.2%2020130108.pdf&quot;&gt;Allwinner A13 manual&lt;/a&gt;,
which is &lt;a href=&quot;http://irclog.whitequark.org/linux-sunxi/2013-05-17#3830239&quot;&gt;now available in the unrestricted public access&lt;/a&gt;.
The framebuffer memory is read by the DEFE hardware in the case if &quot;scaler mode&quot; is enabled, and
by the DEBE hardware otherwise. The differences between DEFE and DEBE implementations of
fetching pixels for screen refresh appear to have different impact on memset performance
in practice.&lt;/p&gt;

&lt;p&gt;One thing is obvious even without running any tests, and the measurements just confirm
it: more memory bandwidth drained by screen refresh means less bandwidth left for
the CPU. But the most interesting observation is that the memset performance abruptly
degrades upon reaching a certain threshold. The abnormally low memset performance
results are highlighted red in table 1. But the backwards memset is not affected.
There is certainly something odd in the memory controller or in the display controller.&lt;/p&gt;

&lt;p&gt;Attentive readers may argue that the same resolution and refresh rate can be achieved
using different timings. The detailed modelines used in this test were the following:&lt;/p&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-text&quot; data-lang=&quot;text&quot;&gt;Mode &amp;quot;1920x1080_50&amp;quot; 148.5 1920 2448 2492 2640 1080 1084 1089 1125 +HSync +VSync
Mode &amp;quot;1920x1080_56&amp;quot; 148.5 1920 2165 2209 2357 1080 1084 1089 1125 +HSync +VSync
Mode &amp;quot;1920x1080_60&amp;quot; 148.5 1920 2008 2052 2200 1080 1084 1089 1125 +HSync +VSync&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;


&lt;p&gt;Empirical tests show that in order to have less impact on the memory bandwidth, we
need to maximize pixel clock, minimize vertical blanking and select the target
refresh rate by adjusting horizontal blanking. That is assuming that the monitor
will accept these extreme timings. The &quot;red zones&quot; in table 1 may drift a bit
as a result.&lt;/p&gt;

&lt;h2&gt;Benchmarks by replaying the traces of real applications (cairo-perf-trace)&lt;/h2&gt;

&lt;p&gt;The numbers in the table 1 look scary, but does it have any impact on real
applications in any significant way? Let&#39;s try the &lt;a href=&quot;https://github.com/ssvb/trimmed-cairo-traces&quot;&gt;trimmed cairo traces&lt;/a&gt;
again to see how it affects the performance of software rendered 2D graphics.&lt;/p&gt;

&lt;p&gt;This benchmark is using gcc 4.8.1, pixman 0.30.0, cairo 1.12.14, linux kernel 3.4 with
&lt;a href=&quot;http://lists.infradead.org/pipermail/linux-arm-kernel/2012-February/084359.html&quot;&gt;ARM hugetlb&lt;/a&gt;
patches added. HugeTLB is very interesting by itself, because it provides a nice performance
improvement for memory heavy workloads. But in this particular case it also helps to
make benchmark results reproducible across multiple runs (the variance is apparently
resulting from the difference in physical memory fragmentation and cache associativity
effects). The cairo-perf-trace results from the &quot;red zone&quot; seem to be poorly reproducible
with the standard 4K pages.&lt;/p&gt;

&lt;p&gt;We can&#39;t test all the possible configurations, so just need to pick a few interesting ones:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;1920x1080-60Hz, DDR3 360MHz (default for Mele A2000 HTPC box)&lt;/li&gt;
&lt;li&gt;1920x1080-60Hz, DDR3 480MHz (default for CubieBoard)&lt;/li&gt;
&lt;li&gt;1920x1080-50Hz, DDR3 480MHz (CubieBoard, &#39;disp.screen0_output_mode=1920x1080p50&#39; in the kernel cmdline)&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;&lt;div class=&quot;image&quot;&gt;
&lt;center&gt;&lt;b&gt;Chart 1. The results of cairo-perf-trace using &#39;image&#39; backend (on Allwinner A10, ARM Cortex-A8 @1GHz)&lt;/b&gt;&lt;/center&gt;
&lt;a href=&quot;http://ssvb.github.io/images/2013-06-27-cairo-perf-chart.png&quot;&gt;&lt;img src =&quot;http://ssvb.github.io/images/2013-06-27-cairo-perf-chart-lowres.png&quot; alt=&quot;2013-06-27-cairo-perf-chart.png&quot;&gt;&lt;/a&gt;
&lt;/div&gt;&lt;/p&gt;


&lt;p&gt;The chart 1 is showing the performance improvements relative to Mele A2000 with its more than
conservative default 360MHz memory clock frequency, and using 60Hz monitor refresh rate.
The green bars show how much of the performance improvement can be provided by changing the
memory clock frequency from 360MHz to 480MHz (by replacing the Mele A2000 with a CubieBoard
or just overclocking the memory). The blue bars show the performance improvement resulting
from additionally reducing the monitor refresh rate from 60Hz to 50Hz (and thus moving
out of the &quot;red zone&quot; in table 1).&lt;/p&gt;

&lt;p&gt;The results for the t-swfdec-giant-steps.trace replay show the biggest performance
dependency on the monitor refresh rate, so it definitely deserves some profiling.
Perf reports the following:&lt;/p&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-text&quot; data-lang=&quot;text&quot;&gt;59.93%  cairo-perf-trac  libpixman-1.so.0.30.0  [.] pixman_composite_src_n_8888_asm_neon
 14.06%  cairo-perf-trac  libcairo.so.2.11200.14 [.] _fill_xrgb32_lerp_opaque_spans
 10.20%  cairo-perf-trac  libcairo.so.2.11200.14 [.] _cairo_tor_scan_converter_generate
  3.35%  cairo-perf-trac  libcairo.so.2.11200.14 [.] cell_list_render_edge
  0.82%  cairo-perf-trac  libcairo.so.2.11200.14 [.] _cairo_tor_scan_converter_add_polygon&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;


&lt;p&gt;Bingo! Most of the time is spent in &#39;pixman_composite_src_n_8888_asm_neon&#39; function (solid fill),
which is nothing else but a glorified memset. No surprises that it likes the 50Hz monitor refresh
rate so much.&lt;/p&gt;

&lt;h2&gt;An obligatory note about HugeTLB (and THP) on ARM&lt;/h2&gt;

&lt;p&gt;The chart 1 lists the results with a more than a year old set of HugeTLB patches
applied, but this feature has not reached the mainline linux kernel yet. I&#39;m not
providing a separate cairo-perf-trace chart, but the individual traces are up to 30%
faster when taking HugeTLB+libhugetlbfs into use. And the geometric mean shows ~10%
overall improvement. These results seem to agree with
&lt;a href=&quot;http://lists.infradead.org/pipermail/linux-arm-kernel/2013-February/148835.html&quot;&gt;the reports from the other people&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Let&#39;s hope that ARM and Linaro manage to &lt;a href=&quot;http://lists.infradead.org/pipermail/linux-arm-kernel/2013-June/173051.html&quot;&gt;push this feature in&lt;/a&gt;.
The 256 TLB entries in Cortex-A7 compared to just 32 in Cortex-A8 look very much
like a hardware workaround for a software problem :-) But even older processors
such as Cortex-A8 still need to be fast.&lt;/p&gt;

&lt;p&gt;&lt;b&gt;Update&lt;/b&gt;: turns out that the significantly better benchmark results
can&#39;t be credited to the use of the huge pages alone. The &quot;hugectl&quot; tool from
libhugetlbfs overrides glibc heap allocation and by default does not ever return
memory to the system. While heap shrink/grow operations performed in normal
conditions (without hugectl) are not particularly cheap in some cases.
In any case, the primary purpose of using huge pages via hugectl
was to ensure reproducible cairo-perf-trace benchmark results, and it did
the job. Still TLB misses are a major problem for some operations with 2D
graphics. Something like drawing a vertical scrollbar, where accessing each
new scanline triggers a TLB miss with 4KiB pages. Or image rotation.&lt;/p&gt;

&lt;h2&gt;So what can be done?&lt;/h2&gt;

&lt;p&gt;The 32bpp color depth with 1920x1080 resolution on Allwinner A10 is quite unfortunate
to hit this hardware quirk.&lt;/p&gt;

&lt;p&gt;First a fantastic option :-) We could try to implement backwards solid fill in pixman
and use it on the problematic hardware (using the icky /proc/cpuinfo text parsing to
fish out the relevant bits of the information and do runtime detection). Still the problem
does not go away, some other operations may be affected (memcpy is also affected,
albeit to a lesser extent), memset is used in the other software, ...&lt;/p&gt;

&lt;p&gt;We could also try the 24bpp color depth for the framebuffer. It provides the same
16777216 colors as 32bpp, but is much less affected as seen in table 1. A practical
problem is that this is quite a non-orthodox pixel format, which is poorly supported
by software (even if it works without bugs, it definitely does not enjoy many
optimizations). This implies the use of ShadowFB with a 32bpp shadow framebuffer
backing the real 24bpp framebuffer. But ShadowFB itself solves some problems and
introduces new ones.&lt;/p&gt;

&lt;p&gt;If your monitor supports the 50Hz refresh rate - just go for it! Additionally enabling
the &quot;scaler mode&quot; surely helps (but wastes one scaled layer). The tricky part is
that we want linux distros to remain user friendly and preferably still do automatic
configuration. Automatic configuration means using EDID to check whether the monitor
supports 50Hz. However the monitor manufacturers don&#39;t seem to be very sane and the
EDID data may sometimes look like this:&lt;/p&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-text&quot; data-lang=&quot;text&quot;&gt;[  1133.553] (WW) NVIDIA(GPU-0): The EDID for Samsung SMBX2231 (DFP-1) contradicts itself: mode
[  1133.553] (WW) NVIDIA(GPU-0):     &amp;quot;1920x1080&amp;quot; is specified in the EDID; however, the EDID&amp;#39;s
[  1133.553] (WW) NVIDIA(GPU-0):     valid VertRefresh range (56.000-75.000 Hz) would exclude
[  1133.553] (WW) NVIDIA(GPU-0):     this mode&amp;#39;s VertRefresh (50.0 Hz); ignoring VertRefresh
[  1133.553] (WW) NVIDIA(GPU-0):     check for mode &amp;quot;1920x1080&amp;quot;.&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;


&lt;p&gt;The movie lovers seem to be also having some
&lt;a href=&quot;http://www.codecpage.com/50HzLCD.html&quot;&gt;problems with 56Hz specified as the lowest supported&lt;/a&gt;.
The use of 56Hz for some tests in table 1 is actually to see whether the 56Hz monitor
refresh rate would be any good.&lt;/p&gt;

&lt;p&gt;And as the last resort you can either reduce the screen resolution, or reduce the color
depth to 16bpp. This actually may be the best option, unless you are interested in
viewing high resolution photos with great colors and can&#39;t tolerate any image quality
loss.&lt;/p&gt;

&lt;h2&gt;Final words&lt;/h2&gt;

&lt;p&gt;That&#39;s basically a summary of what has been already known for a while, and I kept
telling this to people in the mailing lists and IRC.
Intuitively, everyone probably understands that higher memory clock frequency
must be somewhat better. But is it important enough to care? Isn&#39;t the CPU
clock frequency the only primary factor that determines system performance?
After all, it is the CPU clock frequency that is advertised in the device
specs and is a popular target for overclockers. Hopefully the colorful tables
and charts here are providing a convincing answer. In any case, if you are
interested in FullHD desktop resolution on Allwinner A10, it makes sense to
try your best to stay away from the &quot;red zone&quot; in table 1.&lt;/p&gt;

&lt;p&gt;The performance of software rendering for 2D graphics is scaling very nicely
with the memory speed increase on ARM processors equipped with a fast NEON
unit (Cortex-A8, Cortex-A9, Cortex-A15). But the cairo-perf-trace benchmarks
are only simulating offscreen rendering, which is just a part of the whole
pipeline. The picture still needs to be delivered to the framebuffer for
the user to see it. And it&#39;s better to be done without screw-ups.&lt;/p&gt;

&lt;p&gt;To be continued about what&#39;s wrong with the ShadowFB layer.&lt;/p&gt;
</content>
 </entry>
 
 <entry>
   <id>http://ssvb.github.io/2013/02/01/new-xf86-video-sunxifb-ddx-driver</id>
   <link type="text/html" rel="alternate" href="http://ssvb.github.io/2013/02/01/new-xf86-video-sunxifb-ddx-driver.html"/>
   <title>New xf86-video-sunxifb DDX driver for Xorg</title>
   <updated>2013-02-01T00:00:00+00:00</updated>
    <author>
      <name>Siarhei Siamashka</name>
      <uri>http://ssvb.github.io/</uri>
    </author>
   <content type="html">&lt;h2&gt;A short introduction&lt;/h2&gt;

&lt;p&gt;&lt;a href=&quot;http://en.wikipedia.org/wiki/Allwinner_A1X&quot;&gt;Allwinner A10/A13 SoC&lt;/a&gt; is very interesting
because it is used in a lot of very affordable electronic devices from China, such as USB
dongles, media boxes, tablets, netbooks and even the &lt;a href=&quot;http://cubieboard.org/&quot;&gt;cubieboard.org development board&lt;/a&gt;.
Because of a very competitive price, these devices make a good alternative for &lt;a href=&quot;http://en.wikipedia.org/wiki/Raspberry_Pi&quot;&gt;Raspberry Pi&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;One rather unique and somewhat attractive feature is that this platform does
not have a corporate backing and does not suffer from &quot;too many cooks&quot;
problem :-) All the hardware adaptation support is provided by the
community at &lt;a href=&quot;http://linux-sunxi.org/&quot;&gt;http://linux-sunxi.org/&lt;/a&gt;, where the people
are currently trying to clean up the kernel and fix numerous bugs.&lt;/p&gt;

&lt;h2&gt;3D graphics performance&lt;/h2&gt;

&lt;p&gt;Allwinner A10 uses a single-core &lt;a href=&quot;http://en.wikipedia.org/wiki/Mali_%28GPU%29&quot;&gt;Mali-400 GPU&lt;/a&gt; running
at 320MHz, which provides OpenGL ES 2.0 acceleration. The OpenGL ES implementation itself relies
on the &lt;a href=&quot;http://forums.arm.com/index.php?/topic/16259-how-can-i-upgrade-mali-device-driver/page__p__39744#entry39744&quot;&gt;proprietary closed source libMali.so library&lt;/a&gt;.
But the integration with the X server is provided by the &lt;a href=&quot;http://malideveloper.arm.com/develop-for-mali/drivers/open-source-mali-gpus-linux-exadri2-and-x11-display-drivers/&quot;&gt;open source reference driver xf86-video-mali&lt;/a&gt;.
Many users might assume that it&#39;s a ready-to-use complete solution and a natural choice for their devices.
However this is not quite true. The performance of the system is also largely dependent on the
optimal integration with the display controller hardware, because Mali itself can only render
3D images to memory buffers. Here is a quote from the readme file included with xf86-video-mali:&lt;/p&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-text&quot; data-lang=&quot;text&quot;&gt;xf86-video-mali&amp;quot; is provided as a basis for creating your own X Display
Driver. It requires a recent version of the xorg-server, as well as a
successfull integration of UMP with your display device driver.&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;


&lt;p&gt;As such, a more complete implementation of X11 driver is needed, and my attempt to develop one
(based on xf86-video-fbdev) is available here:
&lt;a href=&quot;https://github.com/ssvb/xf86-video-sunxifb&quot;&gt;xf86-video-sunxifb&lt;/a&gt;. Below is the screenshot
of it running on &lt;a href=&quot;https://plus.google.com/u/0/113201731981878354205/posts/daJfhBRvWjk&quot;&gt;Mele A2000 TV box&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;/images/2013-02-01-mali400-acceleration.png&quot;&gt;&lt;img src =&quot;/images/2013-02-01-mali400-acceleration-lowres.png&quot; alt=&quot;2013-02-01-mali400-acceleration.png&quot;&lt;/img&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The glmark2 2012.12 scores with 1280x720-32@60Hz monitor resolution look like this:&lt;/p&gt;

&lt;table border=1 style=&#39;border-collapse: collapse; empty-cells: show; font-family: arial; font-size: small; white-space: nowrap; background: #F0F0F0;&#39;&gt;
&lt;tr&gt;&lt;th&gt;X11 DDX driver&lt;th&gt;Fullscreen (1280x720)&lt;th&gt;Window (800x600)&lt;th&gt;Partially obscured window (800x600)&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;xf86-video-mali r3p0&lt;td&gt;38&lt;td&gt;65&lt;td&gt;66&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;xf86-video-sunxifb-0.2.0&lt;td&gt;115&lt;td&gt;165&lt;td&gt;50&lt;/tr&gt;
&lt;/table&gt;


&lt;p&gt;As expected from the implementation which is aware of the hardware overlays
supported by the display controller, the performance of xf86-video-sunxifb
in fullscreen mode or working with fully visible windows is significantly better
than xf86-video-mali. Though rendering to partially obscured window
currently goes through the fallback path involving many memory copy
operations, and the overhead of these memory copy operations is even
higher than for xf86-video-mali (mostly because of the use of
the shadow framebuffer).&lt;/p&gt;

&lt;h2&gt;2D graphics performance&lt;/h2&gt;

&lt;p&gt;Now this is the most interesting part, because surprisingly 2D tends
to be rather problematic for many drivers. Below is the chart based
on the results from &lt;a href=&quot;https://github.com/ssvb/trimmed-cairo-traces&quot;&gt;cairo-perf-trace running trimmed-cairo-traces&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;/images/2013-02-01-cairo-perf-chart-sunxifb.png&quot;&gt;&lt;img src =&quot;/images/2013-02-01-cairo-perf-chart-sunxifb-lowres.png&quot; alt=&quot;2013-02-01-cairo-perf-chart-sunxifb.png&quot;&lt;/img&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Looks like xf86-video-sunxifb is implementing some great performance optimizations?
I wish this was the case, but in fact it is basically just the functionality entirely
provided by the original xf86-video-fbdev code, which was used as
the base for xf86-video-sunxifb. It merely tries not to get in the
way and just lets ARM NEON software rendering code from &lt;a href=&quot;http://www.pixman.org/&quot;&gt;pixman&lt;/a&gt;
run without too much extra overhead.&lt;/p&gt;

&lt;p&gt;So what is wrong with the xf86-video-mali? Appears that it suffers from the same
problem as many other X11 drivers for ARM hardware. DRI2 extension
(the thing which is used for the integration of GLES acceleration)
needs some hardware-specific buffers allocation
(&lt;a href=&quot;http://malideveloper.arm.com/develop-for-mali/drivers/open-source-mali-gpus-ump-user-space-drivers-source-code-2/&quot;&gt;UMP&lt;/a&gt;
in the case of xf86-video-mali). And EXA framework (a convenience
layer for adding 2D acceleration hooks) supports overriding pixmap
buffers allocation as part of its functionality. So the guys apparently
decided that it&#39;s a good idea to override the allocation of absolutely
all pixmaps without exception and not just the ones needed for DRI2. This was a total
2D performance disaster for the &lt;a href=&quot;http://ssvb.github.com/2012/05/04/xorg-drivers-and-software-rendering.html&quot;&gt;SGX PVR driver&lt;/a&gt;.
And it is also killing 2D performance for xf86-video-mali. Because
the sources of xf86-video-mali are available, it was possible to run one
more somewhat artificial test. With a minor tweak, xf86-video-mali can
be changed to do allocations of pixmaps in cached UMP buffers (let&#39;s for
a moment just ignore the potential cache coherency issues for the buffers
shared with Mali hardware via DRI2 and only look at the performance). The
benchmark results for this modified xf86-video-mali driver are shown as
green bars on the chart above. In some cases (t-firefox-fishtank), the
performance for cached UMP allocations managed to catch up with
xf86-video-sunxifb (and xf86-video-fbdev). But many other traces are still
slow, which suggests that uncached memory allocation is not the only
reason for poor performance. The UMP itself also requires expensive ioctls
and has very heavy overhead. So sorry, the following suggestion
from xf86-video-mali readme file is simply not going to fly:&lt;/p&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-text&quot; data-lang=&quot;text&quot;&gt;The provided &amp;quot;xf86-video-mali&amp;quot; driver contains an EXA module which has been
integrated with the UMP system. Your 2D driver may therefore require an
integration with UMP as well. The suggestion is to pass the secure ID down to
the kernel device driver for your hardware, but it is also possible to get the
CPU-mapped address for the memory by calling ump_mapped_pointer_get.

Please refer to UMP documentation for more information regarding this.&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;


&lt;p&gt;BTW, if anyone has some doubts and wonders if these colored bars
in the chart are really correlated with reality, I suggest checking
my youtube video about &lt;a href=&quot;http://www.youtube.com/watch?v=Vzmckw3fAQo&quot;&gt;Linux on ARM Chromebook: xf86-video-armsoc vs. xf86-video-fbdev&lt;/a&gt;.
The xf86-video-armsoc driver has all the same 2D performance problems :-(&lt;/p&gt;

&lt;p&gt;It really puzzles me why nearly all the X11 drivers for ARM hardware
are making the same mistake. An &lt;a href=&quot;http://maemo.org/packages/view/xserver-xorg-video-fbdev/&quot;&gt;old X11 DDX driver from Nokia N900&lt;/a&gt;
at least could do separate allocation for DRI2 buffers and normal
pixmaps, while being not the best and cleanest implementation for sure.&lt;/p&gt;

&lt;h2&gt;The future of xf86-video-sunxifb&lt;/h2&gt;

&lt;p&gt;XV and XRANDR still need to be implemented. And of course there is still
a lot of room for real 2D performance improvements :-)&lt;/p&gt;

&lt;p&gt;&lt;small&gt;Edited within a few hours after posting to fix some obvious typos, broken
links and poor wording&lt;/small&gt;&lt;/p&gt;
</content>
 </entry>
 
 <entry>
   <id>http://ssvb.github.io/2012/05/04/xorg-drivers-and-software-rendering</id>
   <link type="text/html" rel="alternate" href="http://ssvb.github.io/2012/05/04/xorg-drivers-and-software-rendering.html"/>
   <title>Xorg drivers, software rendering for 2D graphics and cairo 1.12 performance</title>
   <updated>2012-05-04T00:00:00+00:00</updated>
    <author>
      <name>Siarhei Siamashka</name>
      <uri>http://ssvb.github.io/</uri>
    </author>
   <content type="html">&lt;p&gt;Recently &lt;a href=&quot;http://en.wikipedia.org/wiki/Cairo_%28graphics%29&quot;&gt;cairo graphics library&lt;/a&gt; got an &lt;a href=&quot;http://cairographics.org/news/cairo-1.12.0/&quot;&gt;update to version 1.12&lt;/a&gt;.
It brings some nice performance improvements as demonstrated in
&lt;a href=&quot;http://ickle.wordpress.com/2012/03/28/cairo-1-12-let-the-releases-roll/&quot;&gt;three&lt;/a&gt;
&lt;a href=&quot;http://ickle.wordpress.com/2012/03/30/cairo-performance-on-ion/&quot;&gt;blog&lt;/a&gt;
&lt;a href=&quot;http://ickle.wordpress.com/2012/04/02/cairo-performance-on-radeon/&quot;&gt;posts&lt;/a&gt; from Chris Wilson.
These blog posts additionally showcase &lt;a href=&quot;http://www.phoronix.com/scan.php?page=news_item&amp;amp;px=OTUyOQ&quot;&gt;Intel SNA&lt;/a&gt;, which
happens to be quite an impressive &lt;a href=&quot;http://www.x.org/wiki/Development/Documentation/Glossary#DDX&quot;&gt;DDX&lt;/a&gt; driver. It
provides 2D graphics hardware acceleration for X applications via
&lt;a href=&quot;http://en.wikipedia.org/wiki/X_Rendering_Extension&quot;&gt;XRender extension&lt;/a&gt; and is
clearly doing this faster than software rendering.&lt;/p&gt;

&lt;p&gt;It may really surprise some people, but graphics drivers are generally doing
not so great for 2D acceleration on linux desktop systems. This has
been known at least since 2003, when Carsten Haitzler (aka Rasterman) started
&lt;a href=&quot;http://comments.gmane.org/gmane.comp.xfree86.devel/2786&quot;&gt;a thread about XRender performance&lt;/a&gt;
and posted &lt;a href=&quot;http://www.rasterman.com/files/render_bench.tar.gz&quot;&gt;render_bench&lt;/a&gt;
test program. Also &lt;a href=&quot;http://blogs.gnome.org/otte/2010/06/26/fun-with-benchmarks/&quot;&gt;hardware acceleration did not have a clear advantage over software rendering&lt;/a&gt;
two years ago for many cairo traces (which are &lt;a href=&quot;http://cworth.org/intel/performance_measurement/&quot;&gt;much more relevant for 2D benchmarking&lt;/a&gt;
than render_bench). There are some old slides from 2010 presented by Intel
folks about &lt;a href=&quot;http://www.lca2010.org.nz/slides/50153.pdf&quot;&gt;&quot;Making the GPU do its Job&quot;&lt;/a&gt; explaining
the challenges they were facing at that time. But now this long quest seems to be over and we got really
good 2D drivers at least for Intel hardware.&lt;/p&gt;

&lt;p&gt;But enough with the historical overview. The purpose of this blog post
is to look into cairo &quot;image backend&quot; in a bit more details and try to find an
explanation why it managed to be competitive for such a long time (and is
still able to wipe the floor with some poorly implemented GPU accelerated
drivers even now). Cairo image backend uses &lt;a href=&quot;http://pixman.org/&quot;&gt;pixman library&lt;/a&gt;
as a software rasteriser. To speed up the graphics operations, pixman uses SIMD
optimizations. The most relevant are SSE2 on x86 and NEON on ARM. There are also
optimizations for MIPS32 DSP ASE, Loongson SIMD and ARM IWMMXT being worked on. The
latest pixman 0.25.2 development snapshot allows to
&lt;a href=&quot;http://cgit.freedesktop.org/pixman/commit/?id=fcea053561893d116a79f41a113993f1f61b58cf&quot;&gt;selectively disable SIMD optimizations&lt;/a&gt;
without recompiling the library, which is convenient for benchmarking or testing.
I&#39;m going to run &lt;a href=&quot;http://cworth.org/intel/performance_measurement/&quot;&gt;cairo-perf-trace benchmark&lt;/a&gt;
on a few devices I have at home, testing image backend both with and without SIMD optimizations
enabled. This allows to to see how much of the performance is gained by using &quot;SIMD acceleration&quot;
in pixman and benchmark it against &quot;GPU acceleration&quot; in the xorg drivers.&lt;/p&gt;

&lt;h2&gt;Test setup&lt;/h2&gt;

&lt;p&gt;32bpp desktop color depth is used in all tests. Cairo 1.12.0 and pixman 0.25.2 are compiled with gcc 4.7.0 with &quot;-O2&quot;
optimizations and &quot;-march/-mcpu/-mtune&quot; options set to match the target processor. The standard
set of &lt;a href=&quot;http://cgit.freedesktop.org/cairo-traces/tree/benchmark&quot;&gt;cairo benchmark traces&lt;/a&gt; is used,
but &quot;ocitysmap&quot; trace is removed (it is a memory hog and runs out of memory on 512MB systems without swap).
The detailed instructions are available in the last section of this blog post.&lt;/p&gt;

&lt;h2&gt;ARM Cortex-A9 1.2GHz (Origenboard)&lt;/h2&gt;

&lt;p&gt;&lt;a href=&quot;/images/2012-05-04-cairo-perf-chart-cortex-a9.png&quot;&gt;&lt;img src =&quot;/images/2012-05-04-cairo-perf-chart-cortex-a9-lowres.png&quot; alt=&quot;2012-05-04-cairo-perf-chart-cortex-a9.png&quot;&lt;/img&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;On the chart above everything is compared to cairo image backend when SIMD optimizations are disabled in pixman (PIXMAN_DISABLE environment variable is set to &quot;arm-simd arm-iwmmxt arm-neon&quot;). The
green bars on the left show the performance improvement gained by enabling ARM NEON in pixman when running the tests with cairo image backend. The
blue bars on the right show the performance of xlib cairo backend when the rendering is done on the X server side by xf86-video-fbdev driver
(which in turn uses pixman with NEON optimization enabled).&lt;/p&gt;

&lt;p&gt;Looking at these colored bars, we can see that xlib backend is generally performing worse than image backend. It is understandable,
because we have some inter-process communication overhead between the test application and X server, X11 protocol marshalling, etc.
But a few tests (firefox-asteroids, gnome-terminal-vim, gvim, xfce4-terminal-a1) showed an improvement. The explanation here is
that this system has a dual-core processor. So the X server running on one CPU core is acting as a 2D accelerator, and the
test application has another CPU core free for use. If we look at the CPU usage in htop while running the tests, then we see that
the CPU core running Xorg server is ~100% loaded, and the other CPU core running cairo-perf-trace process is typically just ~15-30% loaded.&lt;/p&gt;

&lt;p&gt;So in the end, xlib backend is not so bad on multi-core systems. We just need to ensure that we are
not hit by any unnecessary overhead on the inter-process communication. Are we actually doing well here? Not even close!
Just look at &lt;a href=&quot;http://cgit.freedesktop.org/xorg/xserver/tree/fb/fbpict.c?id=xorg-server-1.12.1#n38&quot;&gt;this part of code&lt;/a&gt;.
There we see how X server is wrapping its internal &lt;a href=&quot;http://cgit.freedesktop.org/xorg/xserver/tree/render/picturestr.h?id=xorg-server-1.12.1#n123&quot;&gt;Picture&lt;/a&gt;
structures into temporary &lt;a href=&quot;http://cgit.freedesktop.org/pixman/tree/pixman/pixman-private.h?id=pixman-0.25.2#n65&quot;&gt;pixman_image_t&lt;/a&gt; structures,
involving lots of overhead, validity checks and malloc/free activity. No surprise that we are taking a serious performance
hit, firefox-canvas trace being the worst.&lt;/p&gt;

&lt;p&gt;The colored bars on the performance chart above surely look nice, but the system needs to
be snappy and responsive on normal use. Believe me or not, it is quite ok. For example, I
can use text editors in the terminal and move windows around without perceivable lags. But what
about the ARM system with similar specs, also used with the xf86-video-fbdev driver
and reviewed in a &lt;a href=&quot;http://www.phoronix.com/scan.php?page=news_item&amp;amp;px=MTA5MDg&quot;&gt;Phoronix article&lt;/a&gt;?
Don&#39;t know, but looks like somebody has just screwed up something. When we are moving
windows around, it&#39;s just memcpy/memmove alike operation. Origenboard can reach
~700-750 MB/s speed for memcpy, OMAP4460 should be quite similar.
Even with FullHD resolution and 32bpp desktop color depth (16bpp is more common on ARM systems),
we are moving around up to 1920 * 1080 * 4 = ~8.3 MB of pixel data. Dividing memcpy speed
by data size, we get ~80-90 FPS. Even if we assume that shadow framebuffer is getting
in the way and further divide the FPS number by 2, that&#39;s still more than enough not to
experience any problems on moving or scrolling windows. Sure, this is fully occupying
one CPU core for something as dumb as just memory copy, but another CPU core is free
and the whole system is not affected that badly.&lt;/p&gt;

&lt;p&gt;Finally what about GPU acceleration? This board uses Exynos4210 SoC, which has Mali-400 MP4 GPU.
Right now I&#39;m waiting for &lt;a href=&quot;http://limadriver.org/&quot;&gt;limadriver&lt;/a&gt; or
&lt;a href=&quot;http://www.phoronix.com/scan.php?page=news_item&amp;amp;px=MTA3MDE&quot;&gt;FIMG2D&lt;/a&gt;
based DDX. There are proprietary drivers for Mali GPU, but I don&#39;t want to taint this system
with proprietary blobs yet, and also don&#39;t want to taint myself by agreeing to any licenses
accompanying them.&lt;/p&gt;

&lt;h2&gt;ARM Cortex-A8 1GHz, GPU SGX530 200MHz (IGEPv2 board)&lt;/h2&gt;

&lt;p&gt;The same tests as for Cortex-A9, but also adding the results for 2D graphics hardware acceleration provided
by the latest &lt;a href=&quot;http://tigraphics.blogspot.com/2012/04/1q-sgx-driver-update-package-available.html&quot;&gt;2012 1Q SGX driver release&lt;/a&gt;.
First of all, not all tests are even able to run with sgx pvr xorg driver. Looks like it has
a limit of just around ~60MB for the total pixmap data allocated on X server side and this
prevents many cairo traces from running:&lt;/p&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-text&quot; data-lang=&quot;text&quot;&gt;X Error of failed request:  BadAlloc (insufficient resources for operation)
  Major opcode of failed request:  53 (X_CreatePixmap)&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;


&lt;p&gt;I tried to &lt;a href=&quot;http://www.beagleboard.org/irclogs/index.php?date=2012-04-28#T22:28:52&quot;&gt;increase this limit&lt;/a&gt;
by using an undocumented option &quot;PixmapPoolSizeMB&quot; in xorg.conf, but that did not help much and caused
some additional stability issues. In the end I decided not to touch this stuff and run it as-is in
default configuration (only upgrading pixman from &lt;a href=&quot;http://lists.x.org/archives/xorg-announce/2010-August/001388.html&quot;&gt;ancient version 0.18.4&lt;/a&gt;
to &lt;a href=&quot;http://lists.x.org/archives/xorg-announce/2012-March/001872.html&quot;&gt;0.25.2&lt;/a&gt;).
Hence the pvr driver only has results for 8 out of 21 tests on the chart below due
to the restricted pixmap pool size.&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;/images/2012-05-04-cairo-perf-chart-cortex-a8.png&quot;&gt;&lt;img src =&quot;/images/2012-05-04-cairo-perf-chart-cortex-a8-lowres.png&quot; alt=&quot;2012-05-04-cairo-perf-chart-cortex-a8.png&quot;&lt;/img&gt;&lt;/a&gt;
Ouch! The performance results do not look good for the pvr driver. It was never able
to get any close to the fbdev driver, let alone to the client side rendering via cairo
image backend. And this time fbdev driver was always slower than image backend, which is not
surprising because there is only one ARM Cortex-A8 core in this device.&lt;/p&gt;

&lt;p&gt;But let&#39;s forget about the traces of real applications for a moment. Is the pvr driver
even able to accelerate anything? Now we can take a look at synthetic benchmarks like
render_bench (with a &lt;a href=&quot;https://github.com/ssvb/render_bench/commit/a72b75c23bf56053b901380a6a067cf1324d0011&quot;&gt;bugfix&lt;/a&gt; applied),
which stresses simple scaled and non-scaled compositing using &lt;a href=&quot;http://en.wikipedia.org/wiki/Alpha_compositing&quot;&gt;Over operator&lt;/a&gt;.
In other words, that&#39;s one of the most basic operations for 2D graphics (commonly used for translucency effects),
which is expected to be properly accelerated by any driver. Test results for the fbdev driver and for the pvr
driver (with and without &quot;NoAccel&quot; option set in xorg.conf) are listed in the table below (&lt;a href=&quot;https://github.com/ssvb/ssvb.github.com/tree/master/files/2012-05-04/render-bench-cortex-a8&quot;&gt;render_bench logs are here&lt;/a&gt;).
Each test was also repeated with and without NEON SIMD optimizations enabled in pixman. And an interesting bonus comparison is
imlib2 vs. pixman C implementation (CFLAGS=&quot;-O2 -mcpu=cortex-a8 -mfloat-abi=softfp -mfpu=neon&quot; for both pixman and imlib2):&lt;/p&gt;

&lt;table border=1 style=&#39;border-collapse: collapse; empty-cells: show; font-family: arial; font-size: small; white-space: nowrap; background: #F0F0F0;&#39;&gt;
&lt;tr&gt;&lt;th&gt;&lt;th colspan=&#39;3&#39;&gt;pixman 0.25.2 with NEON&lt;th colspan=&#39;3&#39;&gt;pixman 0.25.2 without NEON&lt;th colspan=&#39;2&#39;&gt;imlib2 1.4.4&lt;tr&gt;&lt;th&gt;&lt;th&gt;fbdev&lt;th&gt;pvr&lt;br&gt;(NoAccel)&lt;th&gt;pvr&lt;th&gt;fbdev&lt;th&gt;pvr&lt;br&gt;(NoAccel)&lt;th&gt;pvr&lt;th&gt;built with&lt;br&gt;gcc 4.5.3&lt;th&gt;built with&lt;br&gt;gcc 4.7.0&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;Xrender doing non-scaled&lt;br&gt;Over blends&lt;td bgcolor=lightgreen&gt;0.56 sec&lt;td bgcolor=#F0F0F0&gt;0.76 sec&lt;td bgcolor=#6666FF&gt;1.33 sec&lt;td bgcolor=#F0F0F0&gt;1.58 sec&lt;td bgcolor=#FF3333&gt;3.86 sec&lt;td bgcolor=#6666FF&gt;1.33 sec&lt;td&gt;-&lt;td&gt;-&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;Xrender (offscreen) doing&lt;br&gt;non-scaled Over blends&lt;td bgcolor=lightgreen&gt;0.44 sec&lt;td bgcolor=#F0F0F0&gt;0.44 sec&lt;td bgcolor=#6666FF&gt;1.23 sec&lt;td bgcolor=#F0F0F0&gt;1.40 sec&lt;td bgcolor=#FF3333&gt;1.41 sec&lt;td bgcolor=#6666FF&gt;1.23 sec&lt;td bgcolor=#F0F0F0&gt;1.16 sec&lt;td bgcolor=#F0F0F0&gt;1.21 sec&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;Xrender doing 1/2 scaled&lt;br&gt;Over blends&lt;td bgcolor=#F0F0F0&gt;0.42 sec&lt;td bgcolor=lightgreen&gt;0.40 sec&lt;td bgcolor=#F0F0F0&gt;0.42 sec&lt;td bgcolor=#F0F0F0&gt;0.55 sec&lt;td bgcolor=#F0F0F0&gt;1.02 sec&lt;td bgcolor=#FF3333&gt;1.07 sec&lt;td&gt;-&lt;td&gt;-&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;Xrender (offscreen) doing&lt;br&gt;1/2 scaled Over blends&lt;td bgcolor=lightgreen&gt;0.27 sec&lt;td bgcolor=#F0F0F0&gt;0.27 sec&lt;td bgcolor=#F0F0F0&gt;0.32 sec&lt;td bgcolor=#F0F0F0&gt;0.42 sec&lt;td bgcolor=#F0F0F0&gt;0.43 sec&lt;td bgcolor=#FF3333&gt;0.48 sec&lt;td bgcolor=#F0F0F0&gt;0.40 sec&lt;td bgcolor=#F0F0F0&gt;0.42 sec&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;Xrender doing 2* smooth&lt;br&gt;scaled Over blends&lt;td bgcolor=lightgreen&gt;3.65 sec&lt;td bgcolor=#F0F0F0&gt;8.74 sec&lt;td bgcolor=#F0F0F0&gt;8.76 sec&lt;td bgcolor=#F0F0F0&gt;25.45 sec&lt;td bgcolor=#F0F0F0&gt;50.63 sec&lt;td bgcolor=#FF3333&gt;50.69 sec&lt;td&gt;-&lt;td&gt;-&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;Xrender (offscreen) doing 2*&lt;br&gt;smooth scaled Over blends&lt;td bgcolor=lightgreen&gt;3.44 sec&lt;td bgcolor=#F0F0F0&gt;3.45 sec&lt;td bgcolor=#F0F0F0&gt;3.62 sec&lt;td bgcolor=#F0F0F0&gt;25.02 sec&lt;td bgcolor=#F0F0F0&gt;25.04 sec&lt;td bgcolor=#FF3333&gt;25.25 sec&lt;td bgcolor=#F0F0F0&gt;14.21 sec&lt;td bgcolor=#F0F0F0&gt;12.92 sec&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;Xrender doing 2* nearest&lt;br&gt;scaled Over blends&lt;td bgcolor=lightgreen&gt;2.26 sec&lt;td bgcolor=#F0F0F0&gt;3.68 sec&lt;td bgcolor=#F0F0F0&gt;3.72 sec&lt;td bgcolor=#F0F0F0&gt;4.27 sec&lt;td bgcolor=#F0F0F0&gt;14.00 sec&lt;td bgcolor=#FF3333&gt;14.04 sec&lt;td&gt;-&lt;td&gt;-&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;Xrender (offscreen) doing 2*&lt;br&gt;nearest scaled Over blends&lt;td bgcolor=lightgreen&gt;2.01 sec&lt;td bgcolor=#F0F0F0&gt;2.04 sec&lt;td bgcolor=#F0F0F0&gt;2.24 sec&lt;td bgcolor=#F0F0F0&gt;4.01 sec&lt;td bgcolor=#F0F0F0&gt;4.02 sec&lt;td bgcolor=#F0F0F0&gt;4.15 sec&lt;td bgcolor=#F0F0F0&gt;5.26 sec&lt;td bgcolor=#FF3333&gt;5.65 sec&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;Xrender doing general nearest&lt;br&gt;scaled Over blends&lt;td bgcolor=lightgreen&gt;5.57 sec&lt;td bgcolor=#F0F0F0&gt;7.68 sec&lt;td bgcolor=#F0F0F0&gt;7.72 sec&lt;td bgcolor=#F0F0F0&gt;6.18 sec&lt;td bgcolor=#F0F0F0&gt;19.92 sec&lt;td bgcolor=#FF3333&gt;19.96 sec&lt;td&gt;-&lt;td&gt;-&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;Xrender (offscreen) doing general&lt;br&gt;nearest scaled Over blends&lt;td bgcolor=lightgreen&gt;5.23 sec&lt;td bgcolor=#F0F0F0&gt;5.37 sec&lt;td bgcolor=#F0F0F0&gt;5.60 sec&lt;td bgcolor=#F0F0F0&gt;5.96 sec&lt;td bgcolor=#F0F0F0&gt;5.97 sec&lt;td bgcolor=#F0F0F0&gt;6.04 sec&lt;td bgcolor=#F0F0F0&gt;8.90 sec&lt;td bgcolor=#FF3333&gt;9.59 sec&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;Xrender doing general smooth&lt;br&gt;scaled Over blends&lt;td bgcolor=lightgreen&gt;8.66 sec&lt;td bgcolor=#F0F0F0&gt;18.40 sec&lt;td bgcolor=#F0F0F0&gt;18.42 sec&lt;td bgcolor=#F0F0F0&gt;55.98 sec&lt;td bgcolor=#F0F0F0&gt;111.73 sec&lt;td bgcolor=#FF3333&gt;111.78 sec&lt;td&gt;-&lt;td&gt;-&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;Xrender (offscreen) doing general&lt;br&gt;smooth scaled Over blends&lt;td bgcolor=#F0F0F0&gt;8.44 sec&lt;td bgcolor=lightgreen&gt;8.44 sec&lt;td bgcolor=#F0F0F0&gt;8.58 sec&lt;td bgcolor=#F0F0F0&gt;55.18 sec&lt;td bgcolor=#F0F0F0&gt;55.31 sec&lt;td bgcolor=#F0F0F0&gt;55.50 sec&lt;td bgcolor=#FF3333&gt;57.04 sec&lt;td bgcolor=#F0F0F0&gt;43.71 sec&lt;/tr&gt;
&lt;/table&gt;


&lt;p&gt;The best results in the table above are highlighted with green, the worst results are
highlighted with red. Only non-scaled tests showed the signs of hardware acceleration
(low CPU load, same performance regardless of whether NEON is enabled in pixman or not),
they are highlighted with blue. All the &quot;non-blue&quot; pvr driver tests are using fallbacks
to pixman for software rendering. The other observations:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;fbdev is the fastest driver, showing equal or significantly better performance than the pvr driver&lt;/li&gt;
&lt;li&gt;disabling acceleration in the pvr driver is not enough to get really well performing software rendering (and this may be also true for many other xorg drivers)&lt;/li&gt;
&lt;li&gt;non-offscreen rendering is particularly slow for the pvr driver, especially when NEON is disabled. It suggests that fallbacks to pixman for software rendering may be working with non-cached memory buffers in this case.&lt;/li&gt;
&lt;li&gt;pixman without NEON and imlib2 have similar performance (&quot;2* smooth scaling&quot; stands out, but it probably has its own special optimized path in imlib2), NEON is significantly faster&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;Now let&#39;s have a closer look at the non-scaled test and do some &lt;a href=&quot;http://ssvb.github.com/2011/08/23/yet-another-oprofile-tutorial.html&quot;&gt;profiling&lt;/a&gt;
for it. In the original render_bench test, a 100x100 image is blended over 320x320 window. It means
that the size of the working set is just 450KB, which is a bit too small for today&#39;s standards.
ARM Cortex-A8 has 256KiB of L2 cache, and L2 cache is apparently providing a performance boost
for fbdev driver here (0.56 sec vs. 1.33 sec, which is more than twice better than GPU). In order
to make the test more fair and make CPU cache less useful, let&#39;s increase the window size to
1000x1000, increase the number of repetitions and run only &quot;Xrender doing non-scaled Over
blends&quot; test first for the fbdev driver and then for the pvr:&lt;/p&gt;

&lt;p&gt;&lt;b&gt;=== fbdev driver (Time: 32.588 sec.) ===&lt;/b&gt;&lt;/p&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-text&quot; data-lang=&quot;text&quot;&gt;samples|      %|
------------------
   148407 94.2817 Xorg
              TIMER:0|
      samples|      %|
    ------------------
       112860 76.0476 libpixman-1.so.0.25.2
        13326  8.9794 libshadow.so
        12679  8.5434 Xorg
         6072  4.0915 libc-2.13.so
         1976  1.3315 libfb.so
          787  0.5303 vmlinux
          369  0.2486 [vectors] (tgid:1719 range:0xffff0000-0xffff1000)
          217  0.1462 ld-2.13.so
           87  0.0586 fbdev_drv.so
           13  0.0088 libglx.so
           12  0.0081 libXfont.so.1.4.1
     4044  2.5691 vmlinux&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;


&lt;p&gt;&lt;b&gt;=== pvr driver (Time: 41.911 sec.) ===&lt;/b&gt;&lt;/p&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-text&quot; data-lang=&quot;text&quot;&gt;samples|      %|
------------------
   137617 76.5964 vmlinux
    35739 19.8920 Xorg
              TIMER:0|
      samples|      %|
    ------------------
         7455 20.8596 libsrv_um.so.1.7.783851
         6776 18.9597 Xorg
         4889 13.6797 vmlinux
         4699 13.1481 pvrsrvkm
         3474  9.7205 libc-2.13.so
         2857  7.9941 libpixman-1.so.0.25.2
         2224  6.2229 libexa.so
         1554  4.3482 pvr_drv.so
          743  2.0790 drm
          528  1.4774 libfb.so
          334  0.9346 libdrm.so.2.4.0
          126  0.3526 [vectors] (tgid:1690 range:0xffff0000-0xffff1000)
           38  0.1063 libpvr2d.so.1.7.783851
           14  0.0392 libglx.so&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;


&lt;p&gt;Based on the profiling results above we see that:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Now that CPU cache is not helping much when working with large buffers,
the performance difference between CPU and GPU has reduced significantly.
CPU is still somewhat faster.&lt;/li&gt;
&lt;li&gt;There is &quot;shadow framebuffer&quot; impacting software rendering performance when
drawing on screen, but I&#39;ll write more about it next time.&lt;/li&gt;
&lt;li&gt;Average CPU load is only ~20% when GPU acceleration is used, and the total
amount of CPU time spent in Xorg process needed for completing the test
is ~4x less (148407 oprofile samples vs. 35739) in the case of GPU acceleration.&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;So we can clearly say that hardware acceleration is indeed used in the
pvr driver. It just needs to be improved really a lot before it can
provide any practical benefits and successfully pass the trial by
cairo traces.&lt;/p&gt;

&lt;p&gt;At the risk of boring the readers even more, I&#39;ll provide some more data
with the regards to how CPU caches affect performance.
Pixman library includes a simple crude test program
called &lt;b&gt;lowlevel-blt-bench&lt;/b&gt; in the &quot;test&quot; directory. It can
approximately estimate the performance of various 2D graphics
operations depending on the size of the working set (L1 - data
fits L1 cache, L2 - data fits L2 cache, M - data does not fit any cache).
I have already mentioned it in &lt;a href=&quot;http://ssvb.github.com/2011/09/13/origenboard-memory-performance.html&quot;&gt;my older blog post&lt;/a&gt;
earlier, but probably it will not do much harm repeating a bit. For
this particular IGEPv2 board (Cortex-A8 processor running at 1GHz), I can
measure the following performance numbers (in MPix/s) with lowlevel-blt-bench:&lt;/p&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-text&quot; data-lang=&quot;text&quot;&gt;add_8888_8888 =  L1: 487.07  L2: 441.24  M: 76.53
    over_8888_8888 =  L1: 342.18  L2: 294.20  M: 75.50&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;


&lt;p&gt;Both &quot;Add&quot; and &quot;Over&quot; operators have exactly the same memory access pattern
per each pixel: read the source pixel (4 bytes), read the destination pixel
(4 bytes), do some calculations and write back the result to the
destination (4 bytes). Processing one pixel involves reading 8 bytes
and writing 4 bytes, or 12 bytes total. The expected memory performance
is a bit difficult to predict, because the bandwidth for memory reads and
writes is not equal (memory writes are faster). This device can do ~500-550 MB/s
memcpy (1000-1100 MB/s for total read+write bandwidth) and ~1500-1550 MB/s
memset. Operators &quot;Add&quot; and &quot;Over&quot; stress memory reads a bit more than writes,
so the total cumulative achievable memory bandwidth is slightly worse than
the one for memcpy: ~76 MPix/s * 12 * 4 ~= ~900 MB/s. But what matters the most,
this synthetic benchmark is also showing that the CPU could easily crunch at
least 4x more pixels if the memory subsystem could provide the CPU with the
needed data in time! If we are in a situation when the data is not available
in CPU L1/L2 caches, then the CPU is working at just 1/4 of its capabilities
and idling the rest of the time. I wish we had SMT (or hyperthreading as
called by Intel) supported in ARM processors. In this case the other
hardware thread would be able to do a lot of work in parallel. Did I say
something about a dedicated CPU core being able to act as a 2D accelerator
in the previous Cortex-A9 section? Forget that. Even just an extra hardware
thread might be enough (if we are doing some simple non-scaled 2D stuff like
drawing rectangular windows, using alpha blending for translucency effects and
moving them around).&lt;/p&gt;

&lt;p&gt;As it turns out, CPU is much faster than memory for simple non-scaled 2D
graphics (this includes YUV-&gt;RGB conversion, alpha blending, simple copy,
fill, ...). Caches are helping really a lot, but they are relatively small
and work best when we have good locality for memory accesses.
Cairo library is an immediate mode renderer, which is easy to use, but
also gives the users the freedom to shoot themselves in the foot. For
example, if the user wants to composite many translucent screen sized
layers (bigger than L2 cache) on top of each other, then they will be
rendered exactly this way, going through slow memory interface for each
of these layers over and over again. An obvious optimization is to split
the picture into a number of tiles, each small enough to fit L2 or even
L1 cache, and then do the blending of all the layers within each tile.
This is effective, but requires some effort from the user.&lt;/p&gt;

&lt;p&gt;What is the solution? A modern approach is to simply take away the freedom
from the users (so that they don&#39;t hurt themselves) and enforce a certain
performance friendly rendering model. Some people think that
&lt;a href=&quot;http://qt.nokia.com/learning/online/talks/developerdays2010/tech-talks/scene-graph-a-different-approach-to-graphics-in-qt/&quot;&gt;scene graph&lt;/a&gt;
is the silver bullet.&lt;/p&gt;

&lt;p&gt;But I have strayed from the original topic already. The pvr driver is what we have for 2D
hardware accelerated linux desktop on OMAP3 devices, but it is more like a technical
demo and hardly suitable for any practical use. On a positive side, the work is ongoing
and &lt;a href=&quot;https://github.com/robclark/xf86-video-omap&quot;&gt;xf86-video-omap&lt;/a&gt; may eventually
become a better 2D driver for this hardware. OMAP4470 is even more promising, as it is going
to have a &lt;a href=&quot;http://pandaboard.org/pbirclogs/index.php?date=2012-04-14#T13:17:12&quot;&gt;real 2d blitter hardware&lt;/a&gt;
with the open source drivers for it.&lt;/p&gt;

&lt;p&gt;The current 2D driver may be disappointing, but we should not forget that SGX530 is
primarily a 3D accelerator with mature and well optimized drivers for OpenGL ES 2.0
(the demos and examples run fine). Also it is worth mentioning that cairo has OpenGL
ES 2.0 backend, but it can&#39;t be used on SGX530 yet because of
&lt;a href=&quot;http://comments.gmane.org/gmane.comp.lib.cairo/22605&quot;&gt;missing GL_OES_texture_npot extension&lt;/a&gt;
support.&lt;/p&gt;

&lt;h2&gt;Intel Atom N450 1.67GHz (Samsung N220 netbook)&lt;/h2&gt;

&lt;p&gt;And for the sake of completeness, here are the results from Intel Atom. They just confirm the
results from Chris Wilson and only additionally show the effect of having SSE2 optimizations
for the software rendering.&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;/images/2012-05-04-cairo-perf-chart-atom.png&quot;&gt;&lt;img src =&quot;/images/2012-05-04-cairo-perf-chart-atom-lowres.png&quot; alt=&quot;2012-05-04-cairo-perf-chart-atom.png&quot;&lt;/img&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We can also run lowlevel-blt-bench from pixman for the same &quot;Add&quot; and &quot;Over&quot; operations:&lt;/p&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-text&quot; data-lang=&quot;text&quot;&gt;add_8888_8888 =  L1: 607.08  L2: 375.34  M:259.53
    over_8888_x888 =  L1: 123.73  L2: 117.10  M:113.56&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;


&lt;p&gt;Now the memory bandwidth is only fully utilized for &quot;Add&quot; operator, but
not for &quot;Over&quot;. Using a modified variant of render_bench which calculates and
reports MPix/s statistics, we can put MPix/s rate for different operations
in the following table:&lt;/p&gt;

&lt;table border=1 style=&#39;border-collapse: collapse; empty-cells: show; font-family: arial; font-size: small; white-space: nowrap; background: #F0F0F0;&#39;&gt;
&lt;tr&gt;&lt;th&gt;Compositing operation&lt;th&gt;performance on Intel Atom N450
&lt;tr&gt;&lt;td&gt;pixman non-scaled Add&lt;td bgcolor=&#39;lightgreen&#39;&gt;~260 MPix/s
&lt;tr&gt;&lt;td&gt;pixman non-scaled Over&lt;td bgcolor=&#39;yellow&#39;&gt;~110 MPix/s
&lt;tr&gt;&lt;td&gt;GPU accelerated non-scaled Add&lt;td bgcolor=&#39;lightgreen&#39;&gt;~270 MPix/s
&lt;tr&gt;&lt;td&gt;GPU accelerated non-scaled Over&lt;td bgcolor=&#39;lightgreen&#39;&gt;~270 MPix/s
&lt;tr&gt;&lt;td&gt;GPU accelerated nearest scaled Over&lt;td bgcolor=&#39;lightgreen&#39;&gt;~260 MPix/s
&lt;tr&gt;&lt;td&gt;GPU accelerated bilinear scaled Over&lt;td bgcolor=&#39;lightgreen&#39;&gt;~260 MPix/s
&lt;/table&gt;


&lt;p&gt;All the operations performed on GPU and also &lt;a href=&quot;http://cgit.freedesktop.org/pixman/tree/pixman/pixman-sse2.c?id=pixman-0.22.0#n1321&quot;&gt;software rendered
Add&lt;/a&gt; run at approximately
the same speed, &lt;a href=&quot;http://cgit.freedesktop.org/pixman/tree/pixman/pixman-sse2.c?id=pixman-0.22.0#n630&quot;&gt;software rendered Over&lt;/a&gt;
falls behind. It is integrated graphics, both CPU and GPU are using the same memory,
so it is not surprising that they both have the same memory performance limit. GPU strength
is in handling operations which need more heavy computations. And it is able to fully utilize
memory bandwidth regardless of the use of scaling. This is how a really good hardware
accelerated driver should behave.&lt;/p&gt;

&lt;h2&gt;Reproducing these test results and charts&lt;/h2&gt;

&lt;p&gt;People are generally lazy (me included), so precise step by step instructions may save
time and/or encourage somebody to actually try reproducing the tests on his
system. First we can try:&lt;/p&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-sh&quot; data-lang=&quot;sh&quot;&gt;&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;wget http://cairographics.org/releases/cairo-1.12.0.tar.gz
&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;tar -xzf cairo-1.12.0.tar.gz
&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;cd &lt;/span&gt;cairo-1.12.0
&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;./configure
&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;make
&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;cd &lt;/span&gt;perf
&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;make cairo-perf-chart&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;


&lt;p&gt;This will get us &quot;cairo-perf-chart&quot; tool, which can be used to generate nice PNG charts from cairo-perf-trace logs.
The cairo-perf-trace logs used for the charts in this blog post are &lt;a href=&quot;https://github.com/ssvb/ssvb.github.com/tree/master/files/2012-05-04/cairo-perf-trace&quot;&gt;available here&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Compiling cairo library and running the benchmarks can be done in the following way.
Obviously, the system needs to have a compiler and some of the build dependencies
installed (watch for the error messages from configure scripts). Crosscompilation
is also easy, but I have intentionally left it out in order not to add extra confusion.&lt;/p&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-sh&quot; data-lang=&quot;sh&quot;&gt;&lt;span class=&quot;c&quot;&gt;# set cairo/pixman version and compilation options&lt;/span&gt;

&lt;span class=&quot;nb&quot;&gt;export &lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;CAIRO_VERSION&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;1.12.0
&lt;span class=&quot;nb&quot;&gt;export &lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;PIXMAN_VERSION&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;0.25.2
&lt;span class=&quot;nb&quot;&gt;export &lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;CFLAGS&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&amp;quot;-O2 -g&amp;quot;&lt;/span&gt;
&lt;span class=&quot;nb&quot;&gt;export &lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;CC&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;gcc
&lt;span class=&quot;nb&quot;&gt;export &lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;CAIRO_TEST_TARGET&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;image

&lt;span class=&quot;c&quot;&gt;# setup build environment&lt;/span&gt;

&lt;span class=&quot;nb&quot;&gt;export &lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;PREFIX&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;sb&quot;&gt;`&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;pwd&lt;/span&gt;&lt;span class=&quot;sb&quot;&gt;`&lt;/span&gt;/tmp
mkdir &lt;span class=&quot;nv&quot;&gt;$PREFIX&lt;/span&gt;
&lt;span class=&quot;nb&quot;&gt;export &lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;LD_LIBRARY_PATH&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$PREFIX&lt;/span&gt;/cairo/lib:&lt;span class=&quot;nv&quot;&gt;$PREFIX&lt;/span&gt;/pixman/lib
&lt;span class=&quot;nb&quot;&gt;export &lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;PKG_CONFIG_PATH&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$PREFIX&lt;/span&gt;/cairo/lib/pkgconfig:&lt;span class=&quot;nv&quot;&gt;$PREFIX&lt;/span&gt;/pixman/lib/pkgconfig

&lt;span class=&quot;c&quot;&gt;# download and unpack cairo/pixman sources&lt;/span&gt;

wget http://cairographics.org/snapshots/pixman-&lt;span class=&quot;k&quot;&gt;${&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;PIXMAN_VERSION&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;}&lt;/span&gt;.tar.gz
wget http://cairographics.org/releases/cairo-&lt;span class=&quot;k&quot;&gt;${&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;CAIRO_VERSION&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;}&lt;/span&gt;.tar.gz
tar -xzf pixman-&lt;span class=&quot;k&quot;&gt;${&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;PIXMAN_VERSION&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;}&lt;/span&gt;.tar.gz
tar -xzf cairo-&lt;span class=&quot;k&quot;&gt;${&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;CAIRO_VERSION&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;}&lt;/span&gt;.tar.gz

&lt;span class=&quot;c&quot;&gt;# build pixman and cairo&lt;/span&gt;

&lt;span class=&quot;nb&quot;&gt;pushd &lt;/span&gt;pixman-&lt;span class=&quot;nv&quot;&gt;$PIXMAN_VERSION&lt;/span&gt;
./configure --prefix&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$PREFIX&lt;/span&gt;/pixman &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; make &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; make install &lt;span class=&quot;o&quot;&gt;||&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;exit &lt;/span&gt;1
&lt;span class=&quot;nb&quot;&gt;popd&lt;/span&gt;

&lt;span class=&quot;nb&quot;&gt;pushd &lt;/span&gt;cairo-&lt;span class=&quot;nv&quot;&gt;$CAIRO_VERSION&lt;/span&gt;
./configure --prefix&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$PREFIX&lt;/span&gt;/cairo &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; make &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; make install &lt;span class=&quot;o&quot;&gt;||&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;exit &lt;/span&gt;1
&lt;span class=&quot;nb&quot;&gt;popd&lt;/span&gt;

&lt;span class=&quot;c&quot;&gt;# download and bind cairo traces (warning: this is a HUGE git repository)&lt;/span&gt;

git clone git://anongit.freedesktop.org/cairo-traces
&lt;span class=&quot;nb&quot;&gt;pushd &lt;/span&gt;cairo-traces
make
&lt;span class=&quot;nb&quot;&gt;popd&lt;/span&gt;

&lt;span class=&quot;c&quot;&gt;# run cairo-perf-trace benchmarks&lt;/span&gt;

cairo-&lt;span class=&quot;nv&quot;&gt;$CAIRO_VERSION&lt;/span&gt;/perf/cairo-perf-trace -i3 -r cairo-traces/benchmark &amp;gt; results.txt&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;


&lt;p&gt;This gives us &quot;results.txt&quot; file in raw format, which can be used as an input for
cairo-perf-chart tool. If &lt;b&gt;-r&lt;/b&gt; option is not used, then the output of
cairo-perf-trace is in a more human readable text format. CAIRO_TEST_TARGET environment variable can be set to &quot;image&quot;, &quot;xlib&quot; or any other supported backend.&lt;/p&gt;

&lt;h2&gt;Final words&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Your mileage may vary, but a lot of simple and very common 2D operations do not need a lot of
processing power (even one CPU core is excessive). On the other hand, memory bandwidth is
critical and directly affects performance.&lt;/li&gt;
&lt;li&gt;On multi-core systems, software rendering in X server may play the role of a 2D accelerator to some extent&lt;/li&gt;
&lt;li&gt;Good quality scaling, rotation, radial gradients, convolution filters and the other processing
power hungry operations benefit from GPU acceleration. CPU may obviously also try
multithreaded rendering for these operations to take advantage of all CPU
cores, but multithreaded rendering is still not supported in pixman.&lt;/li&gt;
&lt;li&gt;The pvr xorg driver is not ready for OMAP3 hardware yet, do not use it&lt;/li&gt;
&lt;li&gt;Disabled acceleration does not always mean full speed software rendering, so if your driver
provides an option to disable acceleration, it can&#39;t be fully trusted&lt;/li&gt;
&lt;li&gt;Immediate mode renderers such as cairo are a hard challenge for hardware acceleration&lt;/li&gt;
&lt;/ul&gt;

</content>
 </entry>
 
 <entry>
   <id>http://ssvb.github.io/2012/04/10/cpuburn-arm-cortex-a9</id>
   <link type="text/html" rel="alternate" href="http://ssvb.github.io/2012/04/10/cpuburn-arm-cortex-a9.html"/>
   <title>Is your ARM Cortex-A9 hot enough?</title>
   <updated>2012-04-10T00:00:00+00:00</updated>
    <author>
      <name>Siarhei Siamashka</name>
      <uri>http://ssvb.github.io/</uri>
    </author>
   <content type="html">&lt;p&gt;Inspired by the &lt;a href=&quot;https://plus.google.com/u/0/100242854243155306943/posts/QCpWUZEkF9i&quot;&gt;google+ post&lt;/a&gt; by Koen Kooi, I decided to check whether NEON is also hot in Cortex-A9.
Appears that &lt;a href=&quot;http://packages.debian.org/sid/cpuburn&quot;&gt;cpuburn tool&lt;/a&gt; supports ARM since 2010. And openembedded uses an alternative
&lt;a href=&quot;http://cgit.openembedded.org/openembedded/commit/?id=7bc322831d1ed3487d36dee4687b7fa3b5cc81e4&quot;&gt;cpuburn-neon&lt;/a&gt; implementation.
As we have at least two implementations, naturally one of them might be more efficient on Cortex-A9 than the other.
So I tested both of them on my old OMAP4430 based &lt;a href=&quot;http://pandaboard.org/&quot;&gt;pandaboard&lt;/a&gt;  (I would not miss this board too much
if it actually burns). The results of this comparison are provided in the table at the bottom.&lt;/p&gt;

&lt;p&gt;I could have stopped at this point, but that would be not fun :) So I tried to experiment a bit with Cortex-A9 power consumption myself. Turns out
that Cortex-A9 can actually run a bit hotter. On the NEON side, &lt;b&gt;VLDx&lt;/b&gt; instructions seem to be more power hungry than anything else
by a large margin. And aligned 128-bit reads are the best at generating heat. Using &lt;b&gt;VLD2&lt;/b&gt; variant with
post-increment makes it do a bit more work than the plain &lt;b&gt;VLD1&lt;/b&gt;. Moving to the ARM side, conditional branches and &lt;b&gt;SMLAL&lt;/b&gt;
instructions are also rather hot. Mixing everything together, we get &lt;a href=&quot;http://github.com/downloads/ssvb/ssvb.github.com/ssvb-cpuburn-a9.S&quot;&gt;one more implementation of cpuburn for Cortex-A9&lt;/a&gt;:&lt;/p&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-text&quot; data-lang=&quot;text&quot;&gt;.syntax unified
    .text
    .arch armv7-a
    .fpu neon
    .arm

    .global main
    .global sysconf
    .global fork

/* optimal value for LOOP_UNROLL_FACTOR seems to be BTB size dependent */
#define LOOP_UNROLL_FACTOR   110
/* 64 seems to be a good choice */
#define STEP                 64

.func main
main:

#ifdef __linux__
        mov         r0, 84 /* _SC_NPROCESSORS_ONLN */
        blx         sysconf
        mov         r4, r0
        cmp         r4, #2
        blt         1f
        blx         fork /* have at least 2 cores */
        cmp         r4, #4
        blt         1f
        blx         fork /* have at least 4 cores */
1:
#endif

        ldr         lr, =(STEP * 4 + 15)
        subs        lr, sp, lr
        bic         lr, lr, #15
        mov         ip, #STEP
        mov         r0, #0
        mov         r1, #0
        mov         r2, #0
        mov         r3, #0
        ldr         r4, =0xFFFFFFFF
        b           0f
    .ltorg
0:
    .rept LOOP_UNROLL_FACTOR
        vld2.8      {q0}, [lr, :128], ip
        it          ne
        smlalne     r0, r1, lr, r4
        bne         1f
1:
        vld2.8      {q1}, [lr, :128], ip
        it          ne
        smlalne     r2, r3, lr, r4
        bne         1f
1:
        vld2.8      {q2}, [lr, :128], ip
        vld2.8      {q3}, [lr, :128], ip
        it          ne
        subsne      lr, lr, #(STEP * 4)
    .endr
        bne         0b
.endfunc&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;


&lt;p&gt;Maybe more improvements are still possible if I overlooked some better instructions, tricks with L2-&gt;L1 prefetches or anything else.
Also I have not tried running any tests on Cortex-A8 yet. But Cortex-A8 needs different tuning and I would not be
surprised if the the older cpuburn implementations can actually do a better job there. Finally,
the obligatory warning: &lt;b&gt;This program tries to stress the processor, attempting to generate
as much heat as possible. Improperly cooled or otherwise flawed hardware may potentially overheat and fail. Use at your own risk!&lt;/b&gt;&lt;/p&gt;

&lt;p&gt;As for the table below, each implementation has been tested with both Cortex-A9 cores fully loaded (starting two instances of
cpuburn if needed). Current draw values were measured after running the test non-interrupted for 10-15 minutes.
Honestly, the total ~1640 mA sustained current draw by pandaboard looks quite scary to me. At least I would
not dare to even try additionally stressing GPU and/or the hardware video decoder at the same time.&lt;/p&gt;

&lt;table&gt;
&lt;th&gt;cpuburn implementation, running on both A9 cores
&lt;th&gt;current draw from 5V PSU (whole board, not just CPU)
&lt;tr&gt;&lt;td&gt;idle system (this kernel has no power management)
&lt;td&gt;~550 mA
&lt;tr&gt;&lt;td&gt;&lt;a href=&quot;http://hardwarebug.org/files/burn.S&quot;&gt;cpuburn-neon&lt;/a&gt;
&lt;td&gt;~1130 mA
&lt;tr&gt;&lt;td&gt;&lt;a href=&quot;http://packages.debian.org/sid/cpuburn&quot;&gt;cpuburn-1.4a&lt;/a&gt; (burnCortexA9.s)
&lt;td&gt;~1180 mA
&lt;tr&gt;&lt;td&gt;&lt;a href=&quot;http://github.com/downloads/ssvb/ssvb.github.com/ssvb-cpuburn-a9.S&quot;&gt;ssvb-cpuburn-a9.S&lt;/a&gt;
&lt;td&gt;&lt;b&gt;~1640 mA&lt;/b&gt;
&lt;/table&gt;


&lt;p&gt;&lt;br&gt;&lt;/p&gt;

&lt;h3&gt;And also a cpuburn tweak for ARM Cortex-A8 (added on 2011-04-11)&lt;/h3&gt;

&lt;p&gt;A quick test on Cortex-A8 shows that using &lt;b&gt;SMLAL&lt;/b&gt; is a bad idea there, but extra NEON arithmetic instructions
can be added because Cortex-A8 supports dual issue for NEON.&lt;/p&gt;

&lt;p&gt;This time experimenting with DM3730 based &lt;a href=&quot;http://igep.es/index.php?option=com_content&amp;amp;view=article&amp;amp;id=46&amp;amp;Itemid=55&quot;&gt;IGEPv2 board&lt;/a&gt;
(ARM Cortex-A8 @1GHz) and using &lt;a href=&quot;https://github.com/mrj10/dm3730-temp-sensor&quot;&gt;dm3730-temp-sensor&lt;/a&gt; for temperature measurements:&lt;/p&gt;

&lt;table&gt;
&lt;th&gt;cpuburn implementation
&lt;th&gt;temperature
&lt;tr&gt;&lt;td&gt;idle system (this kernel has no power management)
&lt;td&gt;~57.75 C
&lt;tr&gt;&lt;td&gt;&lt;a href=&quot;http://hardwarebug.org/files/burn.S&quot;&gt;cpuburn-neon&lt;/a&gt;
&lt;td&gt;~92.75 C
&lt;tr&gt;&lt;td&gt;&lt;a href=&quot;http://packages.debian.org/sid/cpuburn&quot;&gt;cpuburn-1.4a&lt;/a&gt; (burnCortexA8.s)
&lt;td&gt;~96.00 C
&lt;tr&gt;&lt;td&gt;&lt;a href=&quot;http://github.com/downloads/ssvb/ssvb.github.com/ssvb-cpuburn-a8.S&quot;&gt;ssvb-cpuburn-a8.S&lt;/a&gt;
&lt;td&gt;&lt;b&gt;~104.25 C&lt;/b&gt;
&lt;/table&gt;


&lt;p&gt;&lt;strike&gt;If the sensor is not lying, then maybe using a plastic case for this board was not a good choice after all.&lt;/strike&gt; The sensor is most likely lying as explained by Nishanth Menon in the &lt;a href=&quot;https://plus.google.com/u/0/113201731981878354205/posts/44WtAFbQcaK&quot;&gt;google+ comments&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;Final words (added on 2011-04-11)&lt;/h3&gt;

&lt;p&gt;Before anybody jumps to wild conclusions, I would like to note that:&lt;ul&gt;
&lt;li&gt;Pandaboard is not a mobile device and it is not designed for really low power consumption. It is a known fact that it &lt;a href=&quot;http://omappedia.org/wiki/PandaBoard_FAQ#What_are_the_specs_of_the_Power_supply_I_should_use_with_a_PandaBoard.3F&quot;&gt;requires a PSU rated at 4A&lt;/a&gt;. I don&#39;t have any idea where most of the heat is dissipated, but it is quite likely that not only OMAP chip is involved.&lt;/li&gt;
&lt;li&gt;Cpuburn is very different from any typical workload and can&#39;t be used for estimating power consumption. It&#39;s just a hardware reliability testing tool&lt;/li&gt;
&lt;/ul&gt;
&lt;br&gt;&lt;/p&gt;
</content>
 </entry>
 
 <entry>
   <id>http://ssvb.github.io/2011/09/13/origenboard-memory-performance</id>
   <link type="text/html" rel="alternate" href="http://ssvb.github.io/2011/09/13/origenboard-memory-performance.html"/>
   <title>Origenboard, memory performance</title>
   <updated>2011-09-13T00:00:00+00:00</updated>
    <author>
      <name>Siarhei Siamashka</name>
      <uri>http://ssvb.github.io/</uri>
    </author>
   <content type="html">&lt;p&gt;Those who have read my old
&lt;a href=&quot;http://ssvb.github.com/2011/07/30/origenboard-early-adopter.html&quot;&gt;Origenboard, early adopter impressions&lt;/a&gt;
blog post may wonder why I bought this board in the first place. As far as I know, there is no
freely available public documentation for Exynos 4210 SoC so the &quot;if you want something done, do
it yourself&quot; approach does not work well, and the support provided at
&lt;a href=&quot;http://www.origenboard.org/&quot;&gt;origenboard.org&lt;/a&gt; has not been very stellar so far.
&lt;a href=&quot;http://focus.ti.com/general/docs/wtbu/wtbuproductcontent.tsp?contentId=53243&amp;amp;navigationId=12843&amp;amp;templateId=6123&quot;&gt;OMAP4&lt;/a&gt;
based &lt;a href=&quot;http://pandaboard.org/&quot;&gt;pandaboard&lt;/a&gt; is a lot more open source friendly, has great community
around it and would have been a no-brainer choice, right?
Well, pandaboard is a great piece of of hardware, but the early boards based on early OMAP4 revisions
used to have a rather
&lt;a href=&quot;http://computerarch.com/log/2011/03/01/pandaboard/&quot;&gt;poor&lt;/a&gt;
&lt;a href=&quot;http://groups.google.com/group/pandaboard/browse_thread/thread/24d80cc66f52b789/b977c1ee5eb5a78c?#b977c1ee5eb5a78c&quot;&gt;memory&lt;/a&gt;
&lt;a href=&quot;http://groups.google.com/group/pandaboard/browse_thread/thread/2d4d82eb530e8195&quot;&gt;performance&lt;/a&gt;.
According to the information from the pandaboard mailing list, &lt;a href=&quot;http://groups.google.com/group/pandaboard/msg/dfd2d2e1336d435b&quot;&gt;OMAP4460 is expected to address these problems&lt;/a&gt;.
Too bad that there are no OMAP4460 powered pandaboards available for sale yet. And that&#39;s why I decided to check the new alternative
solution from Samsung to see what they can offer.&lt;/p&gt;

&lt;h3&gt;But who cares about memory performance?&lt;/h3&gt;

&lt;p&gt;Just any software which works with large sets of data not fitting L1/L2 caches
benefits from fast memory. I&#39;m particularly interested in having fast software
rendered 2D graphics, and this is exactly the case where fast memory is
critical for getting good performance.&lt;/p&gt;

&lt;p&gt;Just to give an example, let&#39;s take some numbers from my older
&lt;a href=&quot;http://www.mail-archive.com/pixman@lists.freedesktop.org/msg00695.html&quot;&gt;post in the pixman mailing list&lt;/a&gt;:&lt;/p&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-text&quot; data-lang=&quot;text&quot;&gt;== Intel Atom N450 @1667MHz, DDR2-667 (64-bit) ==

           add_8888_8888 =  L1: 607.08  L2: 375.34  M:259.53
          over_8888_x888 =  L1: 123.73  L2: 117.10  M:113.56
          over_8888_0565 =  L1: 106.11  L2:  98.91  M: 99.07

== TI OMAP3430/3530, ARM Cortex-A8 @500MHz, LPDDR @166MHz (32-bit) ==

    default build:
           add_8888_8888 =  L1: 227.26  L2:  84.71  M: 44.54
          over_8888_x888 =  L1: 161.06  L2:  88.20  M: 44.86
          over_8888_0565 =  L1: 127.02  L2:  93.99  M: 61.25

    software prefetch disabled (*):
           add_8888_8888 =  L1: 351.44  L2:  97.29  M: 25.35
          over_8888_x888 =  L1: 168.72  L2:  95.04  M: 24.81
          over_8888_0565 =  L1: 128.06  L2:  98.96  M: 32.16&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;


&lt;p&gt;All the numbers are provided by lowlevel-blt-bench test program from &lt;a href=&quot;http://pixman.org/&quot;&gt;pixman&lt;/a&gt;
and are measured in MPix/s.
There are three cases benchmarked for each 2D graphics operation: L1 (data set which fits L1 cache),
L2 (data set which fits L2 cache) and M (data set does not fit caches and has to work with memory).
It becomes very clear that ARM NEON optimized code had been memory bandwidth limited at least on
early OMAP3 devices. And Intel Atom surely had a much better memory bandwidth:
~260 MPix/s * 4 bytes per pixel * (2 reads and 1 write per pixel for add_8888_8888), which is ~3.1 GB/s
total. These are just some microbenchmark numbers, but actual 2D software rendered graphics performance
is also heavily affected by memory speed. And fast memory is important for having responsive
and fast linux desktop even without GPU acceleration. And as far as I know, there are still
&lt;a href=&quot;http://www.phoronix.com/scan.php?page=news_item&amp;amp;px=OTgyMA&quot;&gt;no open source GPU drivers available for mobile devices&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;Introducing yet another memory benchmark program&lt;/h3&gt;

&lt;p&gt;If we want to know whether the memory is fast in our system, we need to benchmark it somehow.
There is a popular &lt;a href=&quot;http://www.cs.virginia.edu/stream/FTP/Code/stream.c&quot;&gt;STREAM&lt;/a&gt; benchmark,
but its results are apparently
&lt;a href=&quot;http://groups.google.com/group/pandaboard/msg/1e5f08c949d4bf5d&quot;&gt;very much compiler dependent when run on ARM&lt;/a&gt;.
Moreover, it uses floating point, making this benchmark unsuitable for
the devices which don&#39;t have FPU (it would test just anything but not memory bandwidth).&lt;/p&gt;

&lt;p&gt;So I tried to make my own memory benchmark program, which tries to measure the peak
bandwidth of sequential memory accesses and the latency of random memory accesses.
Bandwidth is measured by running different assembly code for the aligned memory blocks
and attempting different prefetch strategies. Also this benchmark program integrates
some of my old &lt;a href=&quot;http://permalink.gmane.org/gmane.comp.graphics.pixman/1104&quot;&gt;ARM&lt;/a&gt; and
&lt;a href=&quot;http://permalink.gmane.org/gmane.comp.graphics.pixman/1026&quot;&gt;MIPS32&lt;/a&gt; memory bandwidth
test code.&lt;/p&gt;

&lt;p&gt;There are some potential pitfalls when implementing benchmarks. A popular mistake is
related to forgetting to initialize the buffers and have the results distorted by &lt;a href=&quot;http://en.wikipedia.org/wiki/Copy-on-write&quot;&gt;COW&lt;/a&gt;.
But copying data from one memory buffer to another is also not so simple. Depending
on the relative alignment of the source and destination buffers, the
performance may vary a lot. It was noticed by
Måns Rullgård
(mru)
in the &lt;a href=&quot;http://pandaboard.org/pbirclogs/index.php?date=2010-11-04#T21:52:53&quot;&gt;#pandaboard irc&lt;/a&gt; almost a year ago. And
the effect of offset between the arrays is also mentioned in &lt;a href=&quot;http://www.cs.virginia.edu/stream/ref.html&quot;&gt;STREAM benchmark FAQ&lt;/a&gt;.
Moreover, physical memory fragmentation also plays
some role because the caches in modern processors are physically tagged. So exactly
the same program may provide different results depending on whether it was run on
a freshly rebooted system (with almost no memory fragmentation), or on the system
which has been running for a while. Overall, this looks like some kind of aliasing in the
memory subsystem. And ironically, the performance on a freshly rebooted system
is typically worse.&lt;/p&gt;

&lt;p&gt;An empirical solution is to try to ensure that the addresses
of memory accesses in the source and destination buffer, happening close
to each other, differ in as many bits as possible. So I&#39;m using 0xAAAAAAAA,
0x55555555, 0xCCCCCCCC and 0x33333333 patterns for the lowest bits
in the buffer addresses. And this seems to be quite effective, memory copy
benchmark results are now well reproducible and showing high numbers.&lt;/p&gt;

&lt;p&gt;The initial release of this benchmark program can be downloaded here: &lt;a href=&quot;http://github.com/downloads/ssvb/ssvb-membench/ssvb-membench-0.1.tar.gz&quot;&gt;ssvb-membench-0.1.tar.gz&lt;/a&gt;&lt;br&gt;
And the git repository is at &lt;a href=&quot;http://github.com/ssvb/ssvb-membench&quot;&gt;http://github.com/ssvb/ssvb-membench&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;Origenboard memory benchmark results and performance tuning&lt;/h3&gt;

&lt;p&gt;The table below shows how the memory performance is affected by different settings in&lt;br&gt;
&lt;a href=&quot;http://infocenter.arm.com/help/topic/com.arm.doc.ddi0246f/CHDHIECI.html&quot;&gt;L2C-310 Level 2 Cache Controller, Prefetch Control Register&lt;/a&gt;&lt;/p&gt;

&lt;table&gt;
&lt;th&gt;Prefetch Control Register settings
&lt;th&gt;Memory copy performance
&lt;th&gt;Latency of random accesses in 64 MiB block
&lt;tr&gt;&lt;td&gt;&lt;a href=&quot;http://ssvb.github.com/files/2011-09-13/origen-membench-1.txt&quot;&gt;0x30000007 (linaro kernel default)&lt;/a&gt;
&lt;td&gt;761.86 MB/s&lt;td&gt;167.9 ns
&lt;tr&gt;&lt;td&gt;&lt;a href=&quot;http://ssvb.github.com/files/2011-09-13/origen-membench-2.txt&quot;&gt;0x30000007 + &quot;Double linefill enable&quot;&lt;/a&gt;
&lt;td&gt;1179.17 MB/s&lt;td&gt;183.9 ns
&lt;tr&gt;&lt;td&gt;&lt;a href=&quot;http://ssvb.github.com/files/2011-09-13/origen-membench-3.txt&quot;&gt;0x30000007 + &quot;Double linefill enable&quot; +&lt;br&gt;&quot;Double linefill on WRAP read disable&quot;&lt;/a&gt;
&lt;td&gt;1174.32 MB/s&lt;td&gt;174.0 ns
&lt;/table&gt;


&lt;p&gt;Setting &quot;Double linefill on WRAP read disable&quot; regains some of the random access
latency with no regressions to sequential copy performance. Assuming that there are
no hardware bugs related to this setup, enabling double linefill is a no-brainer.
I have submitted &lt;a href=&quot;http://lists.linaro.org/pipermail/linaro-dev/2011-September/007462.html&quot;&gt;a patch to linaro-dev mailing list&lt;/a&gt;
(&lt;b&gt;update from 2011-09-19:&lt;/b&gt; according to the provided feedback, appears that &lt;a href=&quot;http://lists.linaro.org/pipermail/linaro-dev/2011-September/007506.html&quot;&gt;double linefill is not used for a good reason&lt;/a&gt;).&lt;/p&gt;

&lt;p&gt;Probably some more memory performance tweaks can be still applied and
a better configuration can be found by trying different permutations
of the bits in:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;http://infocenter.arm.com/help/topic/com.arm.doc.ddi0388g/CIHCHFCG.html&quot;&gt;Cortex-A9, Auxiliary Control Register&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;http://infocenter.arm.com/help/topic/com.arm.doc.ddi0246f/Beifcidc.html&quot;&gt;L2C-310 Level 2 Cache Controller, Auxiliary Control Register&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;http://infocenter.arm.com/help/topic/com.arm.doc.ddi0246f/CHDHIECI.html&quot;&gt;L2C-310 Level 2 Cache Controller, Prefetch Control Register&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;


&lt;h3&gt;And finally STREAM benchmark as a bonus&lt;/h3&gt;

&lt;p&gt;&lt;a href=&quot;http://ssvb.github.com/files/2011-09-13/stream-origen.txt&quot;&gt;Origenboard, Samsung Exynos 4210, dual ARM Cortex-A9 @1.2GHz&lt;/a&gt;&lt;/p&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-text&quot; data-lang=&quot;text&quot;&gt;$ gcc -O2 -fopenmp -mcpu=cortex-a9 -o stream stream.c
$ ./stream
-------------------------------------------------------------
Function      Rate (MB/s)   Avg time     Min time     Max time
Copy:        2284.9071       0.0281       0.0280       0.0282
Scale:       2339.6942       0.0274       0.0274       0.0275
Add:         2028.8679       0.0474       0.0473       0.0474
Triad:       1992.7801       0.0482       0.0482       0.0483
-------------------------------------------------------------&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;


&lt;p&gt;&lt;a href=&quot;http://ssvb.github.com/files/2011-09-13/stream-atom.txt&quot;&gt;Intel Atom N450 @1.67GHz&lt;/a&gt;&lt;/p&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-text&quot; data-lang=&quot;text&quot;&gt;$ gcc -O2 -fopenmp -march=atom -mtune=atom -o stream stream.c
$ ./stream
-------------------------------------------------------------
Function      Rate (MB/s)   Avg time     Min time     Max time
Copy:        2236.8130       0.0143       0.0143       0.0144
Scale:       2230.3084       0.0144       0.0143       0.0144
Add:         2656.0587       0.0181       0.0181       0.0182
Triad:       2679.3174       0.0180       0.0179       0.0180
-------------------------------------------------------------&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;


&lt;p&gt;Overall, the &lt;a href=&quot;http://ssvb.github.com/files/2011-09-13/origen-membench-3.txt&quot;&gt;memory performance of Origenboard&lt;/a&gt;
appears to be not very much inferior to the &lt;a href=&quot;http://ssvb.github.com/files/2011-09-13/atom-membench.txt&quot;&gt;memory performance of Intel Atom N450&lt;/a&gt;
(&lt;b&gt;update from 2011-09-19&lt;/b&gt;: when/if we get Exynos 4212 based boards in our hands).&lt;/p&gt;
</content>
 </entry>
 
 <entry>
   <id>http://ssvb.github.io/2011/08/23/yet-another-oprofile-tutorial</id>
   <link type="text/html" rel="alternate" href="http://ssvb.github.io/2011/08/23/yet-another-oprofile-tutorial.html"/>
   <title>Yet another oprofile tutorial</title>
   <updated>2011-08-23T00:00:00+00:00</updated>
    <author>
      <name>Siarhei Siamashka</name>
      <uri>http://ssvb.github.io/</uri>
    </author>
   <content type="html">&lt;p&gt;Recently it came as a surprise to me that many people don&#39;t know how to use
&lt;a href=&quot;http://oprofile.sourceforge.net/&quot;&gt;oprofile&lt;/a&gt; efficiently when working on
performance optimizations. I&#39;m not going to duplicate
&lt;a href=&quot;http://oprofile.sourceforge.net/doc/index.html&quot;&gt;the oprofile manual&lt;/a&gt;
here in details, but at least will try to explain some basic usage.&lt;/p&gt;

&lt;h3&gt;A bit of theory&lt;/h3&gt;

&lt;p&gt;Oprofile does its magic by using statistical sampling. The processor
gets interrupted at regular intervals (the interrupts happen after a
certain amount of time has elapsed, or some hardware performance counter
accumulated a certain amount of events) and oprofile driver identifies which
code had control at that moment. The part of code which was &#39;lucky&#39; to be
interrupted by oprofile, gets an oprofile sample attributed to it. The
parts of code which take a lot of execution time are naturally more
likely to accumulate many oprofile samples. In fact, the amount of collected
oprofile samples for some function tends to be directly proportional
to the execution time taken by this function. This all is somewhat
similar to &lt;a href=&quot;http://en.wikipedia.org/wiki/Monte_Carlo_method&quot;&gt;Monte Carlo method&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The collection of samples done by oprofile for each individual function is a
&lt;a href=&quot;http://en.wikipedia.org/wiki/Poisson_process&quot;&gt;Poisson process&lt;/a&gt;.
Standard deviation for &lt;a href=&quot;http://en.wikipedia.org/wiki/Poisson_distribution&quot;&gt;Poisson distribution&lt;/a&gt;
is the square root of the number of samples. So the more samples got collected,
the lower is the relative error. The following diagram shows
the confidence intervals for &lt;a href=&quot;http://en.wikipedia.org/wiki/Normal_distribution&quot;&gt;normal distribution&lt;/a&gt;
(because Poisson distribution is approximately normal for the large number of samples):&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://en.wikipedia.org/wiki/File:Standard_deviation_diagram.svg&quot;&gt;
&lt;img src =&quot;/images/2011-08-23-500px-Standard_deviation_diagram.svg.png&quot;
alt=&quot;Standard_deviation_diagram.svg from wikipedia, created by Petter Strandmark and licensed under CC BY 2.5&quot;
title=&quot;Standard_deviation_diagram.svg from wikipedia, created by Petter Strandmark and licensed under CC BY 2.5&quot;&lt;/img&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Using the &lt;b&gt;3-sigma rule&lt;/b&gt;, we can be fairly confident that the actual time spent in each
function (measured in oprofile samples) is within &lt;b&gt;3*sqrt(N)&lt;/b&gt; interval for each
function. Where N is the number of samples reported by oprofile for that function.&lt;/p&gt;

&lt;h3&gt;A simple profiling and code optimization workflow&lt;/h3&gt;

&lt;p&gt;Let&#39;s suppose that we have some small command line tool, which can be
run to do something useful. And we want to optimize this tool to spend
less time to do the same work. First of all, it makes sense to identify
the parts of the program, which are the performance bottlenecks and can
be optimized. This can be easily done using oprofile:&lt;/p&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-sh&quot; data-lang=&quot;sh&quot;&gt;&lt;span class=&quot;c&quot;&gt;# opcontrol --deinit&lt;/span&gt;
&lt;span class=&quot;c&quot;&gt;# opcontrol --separate=kernel&lt;/span&gt;
&lt;span class=&quot;c&quot;&gt;# opcontrol --init&lt;/span&gt;
&lt;span class=&quot;c&quot;&gt;# opcontrol --reset&lt;/span&gt;
&lt;span class=&quot;c&quot;&gt;# opcontrol --start&lt;/span&gt;
&lt;span class=&quot;c&quot;&gt;# ./test-program&lt;/span&gt;
&lt;span class=&quot;c&quot;&gt;# opcontrol --stop&lt;/span&gt;
&lt;span class=&quot;c&quot;&gt;# opreport -l ./test-program&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;


&lt;p&gt;Going through all of the above steps will configure and start oprofile, then execute
the program to be profiled (./test-program), then finally stop oprofile and show
the profiling report (and this report contains exactly the information we want, and its
interpretation is explained a bit in the next section).
The opcontrol tool needs to be run as root or via sudo. It is also quite
important to use &lt;b&gt;--separate=kernel&lt;/b&gt; option. This option is
&lt;a href=&quot;http://oprofile.sourceforge.net/doc/controlling.html&quot;&gt;described in details here&lt;/a&gt;,
but basically it ensures that all the CPU activity happening in the kernel
and in the shared libraries is also attributed to the test program and shown
in the log.&lt;/p&gt;

&lt;p&gt;After having oprofile report, it is only a matter of checking what parts
of code are reported to take a lot of time, improving them and finally running
oprofile again to verify the results. This process can be repeated multiple times.
That&#39;s quite simple. Though there are two main cases when it may be difficult to
interpret oprofile logs:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Oprofile reports that just one large function (possibly even &#39;main&#39;) is taking most of the time.&lt;/li&gt;
&lt;li&gt;Oprofile reports a million of tiny functions, each taking only a small fraction of time.&lt;/li&gt;
&lt;/ol&gt;


&lt;p&gt;In the former case it is a good idea to split the large function into
a few smaller ones. If the large function is already calling some
other functions which aree inlined, then naturally disabiling
inlining will provide a bit more interesting profiling report.
Another alternative is to use &lt;a href=&quot;http://oprofile.sourceforge.net/doc/opannotate.html&quot;&gt;source annotation&lt;/a&gt;.
But be sure to read about all the caveats in the &lt;a href=&quot;http://oprofile.sourceforge.net/doc/interpreting.html&quot;&gt;oprofile manual&lt;/a&gt;.
In the latter case, generating a callgraph may provide some insights. Some nice callgraph pictures can be generated by
&lt;a href=&quot;http://code.google.com/p/jrfonseca/wiki/Gprof2Dot&quot;&gt;Gprof2Dot&lt;/a&gt; from the data collected by oprofile.&lt;/p&gt;

&lt;h3&gt;A real practical example&lt;/h3&gt;

&lt;p&gt;I&#39;m going to use &lt;a href=&quot;http://git.kernel.org/?p=bluetooth/bluez.git;a=commit;h=e1ea3e76c72d56041c30b317818e8d7b5a0c7350&quot;&gt;one of my old performance patches&lt;/a&gt;
as an example. Oprofile report for the &#39;sbcenc&#39; program looked like this before optimization:&lt;/p&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-irc&quot; data-lang=&quot;irc&quot;&gt;samples  %        image name               symbol name
26083    25.0856  sbcenc                   sbc_pack_frame
21548    20.7240  sbcenc                   sbc_calc_scalefactors_j
19910    19.1486  sbcenc                   sbc_analyze_4b_8s_neon
14377    13.8272  sbcenc                   sbc_calculate_bits
9990      9.6080  sbcenc                   sbc_enc_process_input_8s_be
8667      8.3356  no-vmlinux               /no-vmlinux
2263      2.1765  sbcenc                   sbc_encode
696       0.6694  libc-2.10.1.so           memcpy&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;


&lt;p&gt;Because of the use of --separate=kernel option, we can see ~8% of cpu time
attributed to no-vmlinux image, which is the time spent in the kernel
mostly doing input/output activity (reading the input file from disk).
Also less than 1% is spent in memcpy function which belongs
to libc-2.10.1.so shared library. Without --separate=kernel option, this
information would not be present in the log.&lt;/p&gt;

&lt;p&gt;Now our focus is on &lt;b&gt;sbc_calc_scalefactors_j&lt;/b&gt; function, which got 21548
oprofile samples collected, and they represent ~20.7% of time spent
in &#39;sbcenc&#39; process. Please note again, that this percentage would not
be a realistic estimate without also having kernel and libc information
in the picture. In the case if the CPU consumption is dominated by the
library functions or by the kernel, the statistics could be severely skewed.&lt;/p&gt;

&lt;p&gt;After performing the optimizations, we get a new profiling report:&lt;/p&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-irc&quot; data-lang=&quot;irc&quot;&gt;samples  %        image name               symbol name
26234    29.9625  sbcenc                   sbc_pack_frame
20057    22.9076  sbcenc                   sbc_analyze_4b_8s_neon
14306    16.3393  sbcenc                   sbc_calculate_bits
9866     11.2682  sbcenc                   sbc_enc_process_input_8s_be
8506      9.7149  no-vmlinux               /no-vmlinux
5219      5.9608  sbcenc                   sbc_calc_scalefactors_j_neon
2280      2.6040  sbcenc                   sbc_encode
661       0.7549  libc-2.10.1.so           memcpy&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;


&lt;p&gt;Which shows that now we have &lt;b&gt;sbc_calc_scalefactors_j_neon&lt;/b&gt; function taking 5219
samples instead of 21548 samples for &lt;b&gt;sbc_calc_scalefactors_j&lt;/b&gt; earlier.
It is approximately ~4.1x speedup for this particular function. Samples are
more important than percents in the log, because the absolute number of samples
represents the actual time spent in the function, and the percents are relative
to the whole process (as the whole program takes less time to execute after
optimization, the percents may naturally drift).&lt;/p&gt;

&lt;p&gt;For another example we can look at the &lt;b&gt;sbc_pack_frame&lt;/b&gt; function
statistics in both logs. The number of samples remained about the
same: 26083 vs. 26234 (see the 3-sigma rule from the &#39;A bit of theory&#39; section).
But the percentage of the time relative to the whole program
increased from ~25% to ~30% even though this function
has not changed itself. That&#39;s a nice side affect
of optimizations: after eliminating the obvious
bottlenecks, the other functions are becoming more
attractive optimization targets too :)&lt;/p&gt;

&lt;p&gt;The precision of measurements can be always increased by running
the test program more than one time between &#39;opcontrol --start&#39;
and &#39;opcontrol --stop&#39; invocations, because more samples will
get accumulated and the relative error will become smaller.&lt;/p&gt;

&lt;p&gt;Still the other methods of benchmarking the code may be
more suitable for very tiny performance tweaks, such as
just saving maybe a few CPU cycles. Some tricks for
benchmarking small sequences of instructions are
described in my older &lt;a href=&quot;http://ssvb.github.com/2011/08/03/discovering-instructions-scheduling-secrets.html&quot;&gt;Discovering instructions scheduling secrets&lt;/a&gt;
blog post.&lt;/p&gt;

&lt;h3&gt;ARM Cortex-A8 performance monitoring unit disaster&lt;/h3&gt;

&lt;p&gt;If you tried to follow the instructions described above, but got
bizarre results, then the chances are quite high that you are using
some hardware with ARM Cortex-A8 processor. The problem is that
ARM Cortex-A8 has a broken performance monitoring unit (this is
described as erratum #628216 in ARM Cortex-A8 errata list).
Earlier revisions were badly broken. The last revisions are a
bit better, but still not suitable for use with oprofile.&lt;/p&gt;

&lt;p&gt;For collecting samples, oprofile relies on the interrupts generated
by the performance monitoring unit. The interrupts are supposed
to happen on overflows of the 32-bit hardware performance counters.
But with the older ARM Cortex-A8 revisions (for example, used in &lt;a href=&quot;http://beagleboard.org/hardware&quot;&gt;beagleboard&lt;/a&gt;),
the PMU state may be occasionally messed up on the counter overflow.
With the newer ARM Cortex-A8 revisions (for example, used in &lt;a href=&quot;http://beagleboard.org/hardware-xM&quot;&gt;beagleboard-xm&lt;/a&gt;),
the counter may just overflow without triggering an interrupt. The outcome is disasterious in both cases.
Skipped interrupt may be difficult to notice because it takes
slightly more than 4 seconds to count from 0 to 0xFFFFFFFF on
a 1GHz processor. So the performance monitoring unit recovers
itself, but each skipped interrupt results in approximately
4 seconds dropped out from the profiling session.
Longer profiling runs have a higher chance of triggering this
hardware bug eventually. And considering that it is important
to collect really a lot of samples for getting good precision,
Cortex-A8 performance monitoring unit using cycle counter is
a really bad option.&lt;/p&gt;

&lt;p&gt;The solution for all these troubles is simple: &lt;a href=&quot;http://en.wikipedia.org/wiki/May_the_Force_be_with_you&quot;&gt;use the timer interrupt, Luke&lt;/a&gt; :)
Hardware performance counters are actually more like a red
herring. Timer interrupt works perfectly fine for the
simple profiling tasks, so there is no point trying to
use the performance monitoring unit no matter what.
Admittedly, I have wasted quite a lot of time myself
&lt;a href=&quot;http://www.mail-archive.com/linux-omap@vger.kernel.org/msg14092.html&quot;&gt;trying to workaround this pesky issue&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;In order to override Cortex-A8 performance monitoring unit with a
simple timer driver, adding &quot;oprofile.timer=1&quot; to the kernel command
line can be used. Or using &quot;timer=1&quot; module parameter if oprofile
is built as a module.&lt;/p&gt;

&lt;p&gt;Also when using the simple timer driver, it makes sense to tweak it a
bit if we don&#39;t want to collect samples at something like a pitiful
default 128 Hz rate. The following hack can be applied to the
linux kernel to solve this:&lt;/p&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-diff&quot; data-lang=&quot;diff&quot;&gt;&lt;span class=&quot;gh&quot;&gt;diff --git a/drivers/oprofile/timer_int.c b/drivers/oprofile/timer_int.c&lt;/span&gt;
&lt;span class=&quot;gh&quot;&gt;index 3ef4462..56fb6c3 100644&lt;/span&gt;
&lt;span class=&quot;gd&quot;&gt;--- a/drivers/oprofile/timer_int.c&lt;/span&gt;
&lt;span class=&quot;gi&quot;&gt;+++ b/drivers/oprofile/timer_int.c&lt;/span&gt;
&lt;span class=&quot;gu&quot;&gt;@@ -20,13 +20,15 @@&lt;/span&gt;
 
 #include &amp;quot;oprof.h&amp;quot;
 
&lt;span class=&quot;gi&quot;&gt;+#define OPROFILE_TIMER_TICK_NSEC 244141 /* ~4096 Hz */&lt;/span&gt;
&lt;span class=&quot;gi&quot;&gt;+&lt;/span&gt;
 static DEFINE_PER_CPU(struct hrtimer, oprofile_hrtimer);
 static int ctr_running;
 
 static enum hrtimer_restart oprofile_hrtimer_notify(struct hrtimer *hrtimer)
 {
    oprofile_add_sample(get_irq_regs(), 0);
&lt;span class=&quot;gd&quot;&gt;-  hrtimer_forward_now(hrtimer, ns_to_ktime(TICK_NSEC));&lt;/span&gt;
&lt;span class=&quot;gi&quot;&gt;+  hrtimer_forward_now(hrtimer, ns_to_ktime(OPROFILE_TIMER_TICK_NSEC));&lt;/span&gt;
    return HRTIMER_RESTART;
 }
 
&lt;span class=&quot;gu&quot;&gt;@@ -40,7 +42,7 @@ static void __oprofile_hrtimer_start(void *unused)&lt;/span&gt;
    hrtimer_init(hrtimer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
    hrtimer-&amp;gt;function = oprofile_hrtimer_notify;
 
&lt;span class=&quot;gd&quot;&gt;-  hrtimer_start(hrtimer, ns_to_ktime(TICK_NSEC),&lt;/span&gt;
&lt;span class=&quot;gi&quot;&gt;+  hrtimer_start(hrtimer, ns_to_ktime(OPROFILE_TIMER_TICK_NSEC),&lt;/span&gt;
              HRTIMER_MODE_REL_PINNED);
 }&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;


&lt;h3&gt;Additional verification for the Poisson based stddev estimate (added on 2011-08-28)&lt;/h3&gt;

&lt;p&gt;Let&#39;s take the following profiling session as an example:&lt;/p&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-sh&quot; data-lang=&quot;sh&quot;&gt;Profiling through timer interrupt
samples  %        image name               symbol name
&lt;span class=&quot;m&quot;&gt;52105&lt;/span&gt;    40.0715  djpeg                    jpeg_idct_islow
&lt;span class=&quot;m&quot;&gt;41281&lt;/span&gt;    31.7473  djpeg                    ycc_rgb_convert
&lt;span class=&quot;m&quot;&gt;15126&lt;/span&gt;    11.6327  djpeg                    decode_mcu
&lt;span class=&quot;m&quot;&gt;15001&lt;/span&gt;    11.5366  djpeg                    h2v1_fancy_upsample
&lt;span class=&quot;m&quot;&gt;2029&lt;/span&gt;      1.5604  djpeg                    decompress_onepass
&lt;span class=&quot;m&quot;&gt;1470&lt;/span&gt;      1.1305  libc-2.12.2.so           memset
&lt;span class=&quot;m&quot;&gt;1118&lt;/span&gt;      0.8598  no-vmlinux               /no-vmlinux
&lt;span class=&quot;m&quot;&gt;967&lt;/span&gt;       0.7437  libc-2.12.2.so           _wordcopy_fwd_dest_aligned
&lt;span class=&quot;m&quot;&gt;333&lt;/span&gt;       0.2561  djpeg                    jpeg_fill_bit_buffer
&lt;span class=&quot;m&quot;&gt;69&lt;/span&gt;        0.0531  libc-2.12.2.so           fwrite
&lt;span class=&quot;m&quot;&gt;69&lt;/span&gt;        0.0531  libc-2.12.2.so           write&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;


&lt;p&gt;Poisson gives us a theoretical estimate for the standard deviation as the square
root of the number of samples. But just to be sure, we can verify it by running
the same profiling session 10 times and calculating &lt;a href=&quot;http://en.wikipedia.org/wiki/Standard_deviation#With_sample_standard_deviation&quot;&gt;sample standard deviation&lt;/a&gt;
for the number of samples attributed to each function.&lt;/p&gt;

&lt;table&gt;
&lt;th&gt;function&lt;th colspan=&quot;10&quot;&gt;time spent in the function, measured in oprofile samples&lt;th&gt;mean&lt;th&gt;sample&lt;br&gt;stddev&lt;th&gt;sqrt(mean)
&lt;tr&gt;&lt;td&gt;jpeg_idct_islow&lt;td&gt;52105&lt;td&gt;52171&lt;td&gt;51968&lt;td&gt;52243&lt;td&gt;52389&lt;td&gt;52126&lt;td&gt;52347&lt;td&gt;52217&lt;td&gt;52078&lt;td&gt;52543&lt;td&gt;52218.7&lt;td&gt;169.2&lt;td&gt;228.5
&lt;tr&gt;&lt;td&gt;decode_mcu&lt;td&gt;15126&lt;td&gt;15119&lt;td&gt;15315&lt;td&gt;15060&lt;td&gt;15108&lt;td&gt;15397&lt;td&gt;15227&lt;td&gt;15017&lt;td&gt;15175&lt;td&gt;15138&lt;td&gt;15168.2&lt;td&gt;115.8&lt;td&gt;123.2
&lt;tr&gt;&lt;td&gt;decompress_onepass&lt;td&gt;2029&lt;td&gt;2042&lt;td&gt;2070&lt;td&gt;2012&lt;td&gt;2057&lt;td&gt;2127&lt;td&gt;2022&lt;td&gt;2074&lt;td&gt;2048&lt;td&gt;1992&lt;td&gt;2047.3&lt;td&gt;37.98&lt;td&gt;45.25
&lt;tr&gt;&lt;td&gt;fill_bit_buffer&lt;td&gt;333&lt;td&gt;311&lt;td&gt;333&lt;td&gt;311&lt;td&gt;334&lt;td&gt;297&lt;td&gt;309&lt;td&gt;309&lt;td&gt;336&lt;td&gt;304&lt;td&gt;317.7&lt;td&gt;14.63&lt;td&gt;17.82
&lt;/table&gt;


&lt;p&gt;By comparing the last two columns in the table, we can see that the values there
are reasonably close to each other. So assuming a stable test environment with no
background activity from the other processes, etc., we can run just one profiling
session and already have a good estimate for the precision of the measurement for
each function. Still it is a good idea to repeat profiling at least one more time
and check if the results are consistent between runs in order to rule out any
possible interference from the external factors or the problems in the whole
setup (see the &#39;ARM Cortex-A8 performance monitoring unit disaster&#39; section).
If the results are not consistent across runs, it makes sense identifying and
eliminating the source of this noise.&lt;/p&gt;

&lt;p&gt;Also the applicability of the Poisson based standard deviation estimate is limited
to the functions which take a reasonably small percentage of time (as wikipedia
article says: &lt;i&gt;&quot;The Poisson distribution can be applied to systems with a large
number of possible events, each of which is rare. A classic example is the nuclear decay
of atoms&quot;&lt;/i&gt;). And if taking a corner case as an example, if oprofile log shows
that all the samples belong only to a single function (&#39;main&#39;), then the precision
of this measurement would be very high and only depend on the timer resolution.
The number of samples would be equal to the time taken by the process multiplied
by the oprofile samples collection frequency. But on a positive side, sqrt(N)
still provides a reliable pessimistic estimate, with the real standard deviation
being lower than that.&lt;/p&gt;
</content>
 </entry>
 
 <entry>
   <id>http://ssvb.github.io/2011/08/22/simd-idct-libjpeg-turbo-bitexactness</id>
   <link type="text/html" rel="alternate" href="http://ssvb.github.io/2011/08/22/simd-idct-libjpeg-turbo-bitexactness.html"/>
   <title>SIMD DCT/IDCT in libjpeg-turbo and bit-exactness</title>
   <updated>2011-08-22T00:00:00+00:00</updated>
    <author>
      <name>Siarhei Siamashka</name>
      <uri>http://ssvb.github.io/</uri>
    </author>
   <content type="html">&lt;p&gt;&lt;a href=&quot;http://libjpeg-turbo.virtualgl.org/&quot;&gt;libjpeg-turbo&lt;/a&gt; is currently the fastest
open source jpeg encoder/decoder to the best of my knowledge. Achieving good
performance in libjpeg-turbo would be impossible without using SIMD instructions
available in modern processors. The optimizations for MMX/SSE2 capable x86
processors existed in libjpeg-turbo for a while, and now &lt;a href=&quot;http://sourceforge.net/mailarchive/message.php?msg_id=27971725&quot;&gt;support for ARM NEON is also coming in the next
libjpeg-turbo 1.2 release&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;One of the important parts of libjpeg-turbo, which benefits from SIMD
optimizations is &lt;a href=&quot;http://en.wikipedia.org/wiki/JPEG#Discrete_cosine_transform&quot;&gt;DCT/iDCT&lt;/a&gt;.
For the obvious practical reasons (easier testing and maintenance and full
compatibility with the older versions), it makes a lot of sense to ensure that SIMD
optimized code produces exactly the same results as C code.
That is, unless there are some really good reasons not to do so (for example,
if the algorithm is a bad match for the instruction set of some particular processor).&lt;/p&gt;

&lt;p&gt;And there are naturally some potential pitfalls on the bit-exactness road. In order
to use SIMD efficiently, it is important to use the smallest possible data type
in calculations. The C code is happy to use 32-bit variables and
&quot;32-bit * 32-bit -&gt; 32-bit&quot; multiplications. But for the SIMD code,
using 16-bit data means that we can pack more information into a single
register and process more of it in parallel, saving CPU cycles. Still using 16-bit
calculations, we need to be sure that there are no unwanted overflows. And doing
things somewhat different from C always has a risk of getting somewhat
different results in the end.&lt;/p&gt;

&lt;p&gt;DCT takes 8x8 blocks of samples with the values in [-128, 127] range and produces
blocks of 8x8 DCT coefficients in [-1024, 1023] range. IDCT can convert
the DCT coefficients back to the original 8-bit samples. Mathematically,
the original samples can be perfectly reconstructed. But practically,
there may be rounding errors and some extra loss of precision due to
quantization. And there is one very important thing to note. Any arbitrary 8x8 block of [-128, 127] samples
passed through DCT produces a 8x8 block of coefficients in [-1024, 1023] range. But any
arbitrary 8x8 block of [-1024, 1023] coefficients does not necessarily produce
8x8 block of [-128, 127] samples when passed through IDCT. Some of the samples
may be well outside [-128, 127] range. Searching on the Internet reveals
&lt;a href=&quot;http://www.sciencedirect.com/science/article/pii/S0923596596000422&quot;&gt;some information&lt;/a&gt;, which says
that the range of IDCT output may be as large as [-1805, 1805]. Obviously, there is no
way for such arbitrarily selected DCT coefficients to have been generated by
the forward DCT with the normal [-128, 127] input in the first place. However, it is
possible to hand craft JPEG bitstreams and embed any arbitrary DCT coefficients
there, so the decoder has to handle them somehow.&lt;/p&gt;

&lt;p&gt;When developing SIMD optimized IDCT implementation, apparently there are two separate cases to consider:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;decoding the files generated by a normal jpeg encoder (DCT coefficients are generated by a normal forward DCT from [-128, 127] samples)&lt;/li&gt;
&lt;li&gt;decoding some bogus out-of-range data (DCT coefficients are generated in some arbitrary way)&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;For the former, the decoding result is better to be well defined and bit-exact when
compared to C impementation. The latter is a bit of gray area. On one hand, still
producing the same results as C would be nice. On the other hand, if producing
the same results as C regresses performance, then it is clearly not so desirable.
Also we may need to look carefully in the spec, just to see how the out-of-range
DCT coefficients data fits into it and whether it is allowed. What if some cleverly
optimized jpeg encoder tries to use them for some purpose?&lt;/p&gt;

&lt;p&gt;But now it&#39;s time for some experiments. Generating hand crafted DCT coefficients
is actually quite easy by modifying libjpeg code and using cjpeg tool. It is a simple matter of just hacking
&lt;a href=&quot;http://libjpeg-turbo.svn.sourceforge.net/viewvc/libjpeg-turbo/tags/1.1.1/jcdctmgr.c?revision=658&amp;amp;view=markup&quot;&gt;convsamp&lt;/a&gt; function
and injecting the samples data there.&lt;/p&gt;

&lt;h3&gt;Quirks in the C code&lt;/h3&gt;

&lt;p&gt;The first victim of these experiments is actually not SIMD, but C implementation. The comment from
&lt;a href=&quot;http://libjpeg-turbo.svn.sourceforge.net/viewvc/libjpeg-turbo/tags/1.1.1/jdmaster.c?revision=658&amp;amp;view=markup&quot;&gt;jdmaster.c&lt;/a&gt;
explains:&lt;/p&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-irc&quot; data-lang=&quot;irc&quot;&gt;MASK is 2 bits wider than legal sample data, ie 10 bits for 8-bit
samples.  Under normal circumstances this is more than enough range and
a correct output will be generated; with bogus input data the mask will
cause wraparound, and we will safely generate a bogus-but-in-range output.&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;


&lt;p&gt;So what happens if we deliberately generate a jpeg file, which can decode to such
very much out of range samples? One of the variants of 8x8 DCT coefficients for this purpose can be the following:&lt;/p&gt;

&lt;table class=&quot;matrix&quot; style=&quot;table-layout:fixed;&quot;&gt;
&lt;tr&gt;&lt;td style=&quot;width: 30px;&quot;&gt;-1024&lt;td&gt;0&lt;td&gt;0&lt;td&gt;0&lt;td&gt;0&lt;td&gt;0&lt;td&gt;0&lt;td&gt;0
&lt;tr&gt;&lt;td&gt;0&lt;td&gt;0&lt;td&gt;0&lt;td&gt;0&lt;td&gt;0&lt;td&gt;0&lt;td&gt;0&lt;td&gt;0
&lt;tr&gt;&lt;td&gt;0&lt;td&gt;0&lt;td&gt;0&lt;td&gt;0&lt;td&gt;0&lt;td&gt;0&lt;td&gt;0&lt;td&gt;0
&lt;tr&gt;&lt;td&gt;0&lt;td&gt;0&lt;td&gt;0&lt;td&gt;0&lt;td&gt;0&lt;td&gt;0&lt;td&gt;0&lt;td&gt;0
&lt;tr&gt;&lt;td&gt;0&lt;td&gt;0&lt;td&gt;0&lt;td&gt;0&lt;td&gt;0&lt;td&gt;0&lt;td&gt;0&lt;td&gt;0
&lt;tr&gt;&lt;td&gt;0&lt;td&gt;0&lt;td&gt;0&lt;td&gt;0&lt;td&gt;0&lt;td&gt;0&lt;td&gt;0&lt;td&gt;0
&lt;tr&gt;&lt;td&gt;0&lt;td&gt;0&lt;td&gt;0&lt;td&gt;0&lt;td&gt;0&lt;td&gt;0&lt;td&gt;0&lt;td&gt;0
&lt;tr&gt;&lt;td&gt;0&lt;td&gt;0&lt;td&gt;0&lt;td&gt;0&lt;td&gt;0&lt;td&gt;0&lt;td&gt;0&lt;td&gt;0
&lt;/table&gt;


&lt;p&gt;And the results of decoding this hand crafted sample are below. You may want to pay
special attention to the leftmost image, because it links to the bogus jpeg file itself
and gets decoded by the jpeg library used by your browser.&lt;/p&gt;

&lt;table class=&quot;standard&quot;&gt;
&lt;td&gt;original file, decoded&lt;br&gt; by your browser
&lt;td&gt;decoded using&lt;br&gt;&lt;a href=&quot;http://libjpeg-turbo.svn.sourceforge.net/viewvc/libjpeg-turbo/tags/1.1.1/jidctint.c?revision=658&amp;view=markup&quot;&gt;jpeg_idct_islow&lt;/a&gt;
&lt;td&gt;decoded using&lt;br&gt;&lt;a href=&quot;http://libjpeg-turbo.svn.sourceforge.net/viewvc/libjpeg-turbo/tags/1.1.1/simd/jiss2int-64.asm?revision=658&amp;view=markup&quot;&gt;jsimd_idct_islow_sse2&lt;/a&gt;
&lt;tr&gt;
&lt;td&gt;&lt;img src =&quot;/images/2011-08-22-range-mask.jpg&quot; alt=&quot;2011-08-22-range-mask.jpg&quot;&lt;/img&gt;
&lt;td&gt;&lt;img src =&quot;/images/2011-08-22-range-mask-c.png&quot; alt=&quot;2011-08-22-range-mask-c.png&quot;&lt;/img&gt;
&lt;td&gt;&lt;img src =&quot;/images/2011-08-22-range-mask-sse2.png&quot; alt=&quot;2011-08-22-range-mask-sse2.png&quot;&lt;/img&gt;
&lt;/table&gt;


&lt;p&gt;The rightmost image (decoded by SSE2 implementation from libjpeg-turbo 1.1.1) does not have
any range limitations and always performs correct clamping to bring the color into [0, 255] range.
So the color of some 8x8 tiles gets saturated to white. The C implementation wraps
around and shows the same tiles as black.&lt;/p&gt;

&lt;h3&gt;Quirks in the SIMD optimized code&lt;/h3&gt;

&lt;p&gt;As mentioned earlier, SIMD relies a lot on 16-bit arithmetics. And looking at
&lt;a href=&quot;http://libjpeg-turbo.svn.sourceforge.net/viewvc/libjpeg-turbo/tags/1.1.1/jidctint.c?revision=658&amp;amp;view=markup&quot;&gt;ISLOW IDCT&lt;/a&gt; C code,
there is an obvious case of potential overflow:&lt;/p&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-c&quot; data-lang=&quot;c&quot;&gt;&lt;span class=&quot;cm&quot;&gt;/* Odd part per figure 8; the matrix is unitary and hence its&lt;/span&gt;
&lt;span class=&quot;cm&quot;&gt;     * transpose is its inverse.  i0..i3 are y7,y5,y3,y1 respectively.&lt;/span&gt;
&lt;span class=&quot;cm&quot;&gt;     */&lt;/span&gt;

    &lt;span class=&quot;n&quot;&gt;tmp0&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;INT32&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;wsptr&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;7&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;];&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;tmp1&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;INT32&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;wsptr&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;5&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;];&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;tmp2&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;INT32&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;wsptr&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;3&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;];&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;tmp3&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;INT32&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;wsptr&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;];&lt;/span&gt;

    &lt;span class=&quot;n&quot;&gt;z1&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tmp0&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tmp3&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;z2&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tmp1&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tmp2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;z3&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tmp0&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tmp2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;z4&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tmp1&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tmp3&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;z5&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;MULTIPLY&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;z3&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;z4&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;FIX_1_175875602&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt; &lt;span class=&quot;cm&quot;&gt;/* sqrt(2) * c3 */&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;


&lt;p&gt;The 16-bit values from wsptr[1], wsptr[3], wsptr[5] and wsptr[7] are all added
together and passed as an argument to MULTIPLY macro, which is supposed to
be able to treat its arguments as 16-bit values (so this sum must fit 16 bits).
And this can easily overflow on the second pass if the DCT coefficients
feeded to IDCT function contain arbitrary [-1024, 1023] input. The comment
stating that&lt;/p&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-irc&quot; data-lang=&quot;irc&quot;&gt;The outputs of the first pass are scaled up by PASS1_BITS bits so that
they are represented to better-than-integral precision. These outputs
require BITS_IN_JSAMPLE + PASS1_BITS + 3 bits; this fits in a 16-bit word
with the recommended scaling.&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;


&lt;p&gt;clearly applies to the case whan handling the &quot;normal&quot; DCT coefficients
data. Because &quot;BITS_IN_JSAMPLE + PASS1_BITS + 3&quot; is equal to 13, we
have enough of headroom to add 4 such values together without
overflowing 16 bits. But again, this is not true for the arbitrary hand
crafted [-1024, 1023] coefficients data. In any case, the C implementation
uses 32-bit variables and we have no luck reproducing this overflow with it :)&lt;/p&gt;

&lt;p&gt;The equivalent SSE2 code is a little bit different:&lt;/p&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-nasm&quot; data-lang=&quot;nasm&quot;&gt;&lt;span class=&quot;c1&quot;&gt;; -- Odd part&lt;/span&gt;

        &lt;span class=&quot;nf&quot;&gt;movdqa&lt;/span&gt;  &lt;span class=&quot;nv&quot;&gt;xmm4&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;XMMWORD&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;XMMBLOCK&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;rsi&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;SI&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;ZEOF_JCOEF&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)]&lt;/span&gt;
        &lt;span class=&quot;nf&quot;&gt;movdqa&lt;/span&gt;  &lt;span class=&quot;nv&quot;&gt;xmm6&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;XMMWORD&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;XMMBLOCK&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;3&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;rsi&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;SI&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;ZEOF_JCOEF&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)]&lt;/span&gt;
        &lt;span class=&quot;nf&quot;&gt;pmullw&lt;/span&gt;  &lt;span class=&quot;nv&quot;&gt;xmm4&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;XMMWORD&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;XMMBLOCK&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;rdx&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;SI&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;ZEOF_ISLOW_MULT_TYPE&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)]&lt;/span&gt;
        &lt;span class=&quot;nf&quot;&gt;pmullw&lt;/span&gt;  &lt;span class=&quot;nv&quot;&gt;xmm6&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;XMMWORD&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;XMMBLOCK&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;3&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;rdx&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;SI&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;ZEOF_ISLOW_MULT_TYPE&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)]&lt;/span&gt;
        &lt;span class=&quot;nf&quot;&gt;movdqa&lt;/span&gt;  &lt;span class=&quot;nv&quot;&gt;xmm1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;XMMWORD&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;XMMBLOCK&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;5&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;rsi&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;SI&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;ZEOF_JCOEF&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)]&lt;/span&gt;
        &lt;span class=&quot;nf&quot;&gt;movdqa&lt;/span&gt;  &lt;span class=&quot;nv&quot;&gt;xmm3&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;XMMWORD&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;XMMBLOCK&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;7&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;rsi&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;SI&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;ZEOF_JCOEF&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)]&lt;/span&gt;
        &lt;span class=&quot;nf&quot;&gt;pmullw&lt;/span&gt;  &lt;span class=&quot;nv&quot;&gt;xmm1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;XMMWORD&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;XMMBLOCK&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;5&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;rdx&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;SI&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;ZEOF_ISLOW_MULT_TYPE&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)]&lt;/span&gt;
        &lt;span class=&quot;nf&quot;&gt;pmullw&lt;/span&gt;  &lt;span class=&quot;nv&quot;&gt;xmm3&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;XMMWORD&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;XMMBLOCK&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;7&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;rdx&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;SI&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;ZEOF_ISLOW_MULT_TYPE&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)]&lt;/span&gt;

        &lt;span class=&quot;nf&quot;&gt;movdqa&lt;/span&gt;  &lt;span class=&quot;nv&quot;&gt;xmm5&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;xmm6&lt;/span&gt;
        &lt;span class=&quot;nf&quot;&gt;movdqa&lt;/span&gt;  &lt;span class=&quot;nv&quot;&gt;xmm7&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;xmm4&lt;/span&gt;
        &lt;span class=&quot;nf&quot;&gt;paddw&lt;/span&gt;   &lt;span class=&quot;nv&quot;&gt;xmm5&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;xmm3&lt;/span&gt;               &lt;span class=&quot;c1&quot;&gt;; xmm5=z3&lt;/span&gt;
        &lt;span class=&quot;nf&quot;&gt;paddw&lt;/span&gt;   &lt;span class=&quot;nv&quot;&gt;xmm7&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;xmm1&lt;/span&gt;               &lt;span class=&quot;c1&quot;&gt;; xmm7=z4&lt;/span&gt;

        &lt;span class=&quot;c1&quot;&gt;; (Original)&lt;/span&gt;
        &lt;span class=&quot;c1&quot;&gt;; z5 = (z3 + z4) * 1.175875602;&lt;/span&gt;
        &lt;span class=&quot;c1&quot;&gt;; z3 = z3 * -1.961570560;  z4 = z4 * -0.390180644;&lt;/span&gt;
        &lt;span class=&quot;c1&quot;&gt;; z3 += z5;  z4 += z5;&lt;/span&gt;
        &lt;span class=&quot;c1&quot;&gt;;&lt;/span&gt;
        &lt;span class=&quot;c1&quot;&gt;; (This implementation)&lt;/span&gt;
        &lt;span class=&quot;c1&quot;&gt;; z3 = z3 * (1.175875602 - 1.961570560) + z4 * 1.175875602;&lt;/span&gt;
        &lt;span class=&quot;c1&quot;&gt;; z4 = z3 * 1.175875602 + z4 * (1.175875602 - 0.390180644);&lt;/span&gt;

        &lt;span class=&quot;nf&quot;&gt;movdqa&lt;/span&gt;    &lt;span class=&quot;nv&quot;&gt;xmm2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;xmm5&lt;/span&gt;
        &lt;span class=&quot;nf&quot;&gt;movdqa&lt;/span&gt;    &lt;span class=&quot;nv&quot;&gt;xmm0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;xmm5&lt;/span&gt;
        &lt;span class=&quot;nf&quot;&gt;punpcklwd&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;xmm2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;xmm7&lt;/span&gt;
        &lt;span class=&quot;nf&quot;&gt;punpckhwd&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;xmm0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;xmm7&lt;/span&gt;
        &lt;span class=&quot;nf&quot;&gt;movdqa&lt;/span&gt;    &lt;span class=&quot;nv&quot;&gt;xmm5&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;xmm2&lt;/span&gt;
        &lt;span class=&quot;nf&quot;&gt;movdqa&lt;/span&gt;    &lt;span class=&quot;nv&quot;&gt;xmm7&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;xmm0&lt;/span&gt;
        &lt;span class=&quot;nf&quot;&gt;pmaddwd&lt;/span&gt;   &lt;span class=&quot;nv&quot;&gt;xmm2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,[&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;rel&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;PW_MF078_F117&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;      &lt;span class=&quot;c1&quot;&gt;; xmm2=z3L&lt;/span&gt;
        &lt;span class=&quot;nf&quot;&gt;pmaddwd&lt;/span&gt;   &lt;span class=&quot;nv&quot;&gt;xmm0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,[&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;rel&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;PW_MF078_F117&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;      &lt;span class=&quot;c1&quot;&gt;; xmm0=z3H&lt;/span&gt;
        &lt;span class=&quot;nf&quot;&gt;pmaddwd&lt;/span&gt;   &lt;span class=&quot;nv&quot;&gt;xmm5&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,[&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;rel&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;PW_F117_F078&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;       &lt;span class=&quot;c1&quot;&gt;; xmm5=z4L&lt;/span&gt;
        &lt;span class=&quot;nf&quot;&gt;pmaddwd&lt;/span&gt;   &lt;span class=&quot;nv&quot;&gt;xmm7&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,[&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;rel&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;PW_F117_F078&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;       &lt;span class=&quot;c1&quot;&gt;; xmm7=z4H&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;


&lt;p&gt;Here only the values of z3 (wsptr[3] + wsptr[7])
and z4 (wsptr[1] + wsptr[5]) are calculated using 16-bit additions and then
used as 16-bit operands for multiplication. The following DCT coefficients
have been hand crafted with the intention to trigger &quot;wsptr[3] + wsptr[7]&quot; overflow:&lt;/p&gt;

&lt;table class=&quot;matrix&quot;&gt;
&lt;tr&gt;&lt;td&gt;0&lt;td&gt;-724&lt;td&gt;0&lt;td&gt;-299&lt;td&gt;0&lt;td&gt;-724&lt;td&gt;0&lt;td&gt;300
&lt;tr&gt;&lt;td&gt;0&lt;td&gt;-1004&lt;td&gt;0&lt;td&gt;-416&lt;td&gt;0&lt;td&gt;-1004&lt;td&gt;0&lt;td&gt;416
&lt;tr&gt;&lt;td&gt;0&lt;td&gt;-946&lt;td&gt;0&lt;td&gt;-391&lt;td&gt;0&lt;td&gt;-946&lt;td&gt;0&lt;td&gt;392
&lt;tr&gt;&lt;td&gt;0&lt;td&gt;-851&lt;td&gt;0&lt;td&gt;-352&lt;td&gt;0&lt;td&gt;-851&lt;td&gt;0&lt;td&gt;352
&lt;tr&gt;&lt;td&gt;0&lt;td&gt;-724&lt;td&gt;0&lt;td&gt;-299&lt;td&gt;0&lt;td&gt;-724&lt;td&gt;0&lt;td&gt;300
&lt;tr&gt;&lt;td&gt;0&lt;td&gt;-569&lt;td&gt;0&lt;td&gt;-235&lt;td&gt;0&lt;td&gt;-569&lt;td&gt;0&lt;td&gt;235
&lt;tr&gt;&lt;td&gt;0&lt;td&gt;-391&lt;td&gt;0&lt;td&gt;-162&lt;td&gt;0&lt;td&gt;-391&lt;td&gt;0&lt;td&gt;162
&lt;tr&gt;&lt;td&gt;0&lt;td&gt;-199&lt;td&gt;0&lt;td&gt;-82&lt;td&gt;0&lt;td&gt;-199&lt;td&gt;0&lt;td&gt;82
&lt;/table&gt;


&lt;p&gt;And the decoding results of the generated sample are below:&lt;/p&gt;

&lt;table class=&quot;standard&quot; style=&quot;align: center;&quot;&gt;
&lt;td&gt;original file, decoded&lt;br&gt; by your browser
&lt;td&gt;decoded using&lt;br&gt;&lt;a href=&quot;http://libjpeg-turbo.svn.sourceforge.net/viewvc/libjpeg-turbo/tags/1.1.1/jidctint.c?revision=658&amp;view=markup&quot;&gt;jpeg_idct_islow&lt;/a&gt;&lt;br&gt;
(correctly clamped)
&lt;td&gt;decoded using&lt;br&gt;&lt;a href=&quot;http://libjpeg-turbo.svn.sourceforge.net/viewvc/libjpeg-turbo/tags/1.1.1/jidctint.c?revision=658&amp;view=markup&quot;&gt;jpeg_idct_islow&lt;/a&gt;
&lt;td&gt;decoded using&lt;br&gt;&lt;a href=&quot;http://libjpeg-turbo.svn.sourceforge.net/viewvc/libjpeg-turbo/tags/1.1.1/simd/jiss2int-64.asm?revision=658&amp;view=markup&quot;&gt;jsimd_idct_islow_sse2&lt;/a&gt;
&lt;tr&gt;
&lt;td&gt;&lt;img src =&quot;/images/2011-08-22-z3-overflow.jpg&quot; alt=&quot;2011-08-22-z3-overflow.jpg&quot;&lt;/img&gt;
&lt;td&gt;&lt;img src =&quot;/images/2011-08-22-z3-overflow-c-clamped.png&quot; alt=&quot;2011-08-22-z3-overflow-c-clamped.png&quot;&lt;/img&gt;
&lt;td&gt;&lt;img src =&quot;/images/2011-08-22-z3-overflow-c.png&quot; alt=&quot;2011-08-22-z3-overflow-c.png&quot;&lt;/img&gt;
&lt;td&gt;&lt;img src =&quot;/images/2011-08-22-z3-overflow-sse2.png&quot; alt=&quot;2011-08-22-z3-overflow-sse2.png&quot;&lt;/img&gt;
&lt;/table&gt;


&lt;p&gt;Funnily enough, the three images on the left are all different (&quot;correctly clamped&quot; is
the case when C code is tweaked to solve the range problem described in the previous
section). Comparing the leftmost image with each one of them can give some
idea about what kind of IDCT implementation might be used on your computer.&lt;/p&gt;

&lt;p&gt;I think it&#39;s necessary to add a disclaimer just in case: this all only applies to decoding bogus out-of-range data.
So the differences in decoding results can&#39;t be immediately considered a bug.&lt;/p&gt;

&lt;h3&gt;ARM NEON&lt;/h3&gt;

&lt;p&gt;This whole blog post is actually the result of my mini-investigation, intended to clear the doubts
that I got shortly after submitting
&lt;a href=&quot;http://sourceforge.net/tracker/?func=detail&amp;amp;aid=3394306&amp;amp;group_id=303195&amp;amp;atid=1278160&quot;&gt;ARM NEON optimized ISLOW iDCT patch&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Just like SSE2 IDCT, ARM NEON code also has some overflows for the out-of-range data, but
should be perfectly fine for the normal jpeg files. And it still can be easily tweaked
to ensure no overflows even when handling any arbitrary [-1024, 1023] DCT coefficients.
But this may cost a few extra CPU cycles.&lt;/p&gt;

&lt;p&gt;And one more final disclaimer: I&#39;m not a hardcore multimedia expert, so may be easily wrong. Comments and corrections are surely welcome.&lt;/p&gt;
</content>
 </entry>
 
 <entry>
   <id>http://ssvb.github.io/2011/08/03/discovering-instructions-scheduling-secrets</id>
   <link type="text/html" rel="alternate" href="http://ssvb.github.io/2011/08/03/discovering-instructions-scheduling-secrets.html"/>
   <title>Discovering instructions scheduling secrets</title>
   <updated>2011-08-03T00:00:00+00:00</updated>
    <author>
      <name>Siarhei Siamashka</name>
      <uri>http://ssvb.github.io/</uri>
    </author>
   <content type="html">&lt;p&gt;Knowing the instructions scheduling rules is quite important when implementing
assembly optimizations. That&#39;s especially true for the simple embedded processors
such as ARM or MIPS, which don&#39;t typically implement &lt;a href=&quot;http://en.wikipedia.org/wiki/Out-of-order_execution&quot;&gt;out-of-order execution&lt;/a&gt;
or where the out-of-order instructions execution is just rudimentary at best. Instruction cycle timings are quite well documented
for some processors such as &lt;a href=&quot;http://infocenter.arm.com/help/topic/com.arm.doc.ddi0211k/Cjaedced.html&quot;&gt;ARM11&lt;/a&gt;
or &lt;a href=&quot;http://infocenter.arm.com/help/topic/com.arm.doc.ddi0344k/Cfacfihf.html&quot;&gt;ARM Cortex-A8&lt;/a&gt;,
even sometimes providing a comprehensive &lt;a href=&quot;http://infocenter.arm.com/help/topic/com.arm.doc.ddi0344k/Babeghic.html&quot;&gt;scheduling example&lt;/a&gt;.
But some processors such as &lt;a href=&quot;http://infocenter.arm.com/help/topic/com.arm.doc.ddi0388f/Cjaedcef.html&quot;&gt;ARM Cortex-A9&lt;/a&gt;
are apparently either too complex or maybe just too new to be described in more detail, and the
instruction cycle timings information is rather poor (more about Cortex-A9 maybe in another blog post).
And some ARM compatible processors even don&#39;t seem to have any public documentation at all.&lt;/p&gt;

&lt;p&gt;Even having a good documentation, there always can be some ambiguity or omission of the fine details.
For example, ARM Cortex-A8 supports limited dual-issue for NEON instructions. But
can it really sustain 2 instructions per cycle execution rate on a long sequence of instructions?
Another example is accumulator forwarding for multiply-accumulate instructions. Using
back to back multiply-accumulate instructions is fine, but will the forwarding still work
if an unrelated instruction is inserted between them?&lt;/p&gt;

&lt;p&gt;The solution is really simple. In addition to just reading and (mis)interpreting the manuals,
it makes a lot of sense to verify every important detail by running some tests
and benchmarks. Especially considering, that it is actually not very difficult at all.
The easy way to do this is just to create some *.S file and add the sequence
of the instructions to be investigated there, placing them in a simple loop. Then compile
and run this test program, measuring how much time it takes to run. Very simple.
And in order to make it easier to convert time into CPU cycles, it makes sense
to set the number of loop iterations to run to be equal to the CPU clock frequency.
In this case, the time of execution of the test program in seconds would be equal
to the number of cycles spent in the loop body.&lt;/p&gt;

&lt;p&gt;Below is a trivial test program (tried on different CPU architectures, not just ARM),
which benchmarks the performance of a long sequence of back-to-back ADD instructions.
Addition is a simple and fast operation, which typically takes just 1 cycle to provide
the result. And because each instruction depends on the result of the previous one,
they can&#39;t dual-issue. So for most processors (with some exceptions) the performance
of this code will be exactly 1 cycle per ADD instruction.&lt;/p&gt;

&lt;h3&gt;ARM&lt;/h3&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-gas&quot; data-lang=&quot;gas&quot;&gt;&lt;span class=&quot;na&quot;&gt;.text&lt;/span&gt;
&lt;span class=&quot;na&quot;&gt;.arch&lt;/span&gt; &lt;span class=&quot;no&quot;&gt;armv7-a&lt;/span&gt;
&lt;span class=&quot;na&quot;&gt;.global&lt;/span&gt; &lt;span class=&quot;no&quot;&gt;main&lt;/span&gt;

&lt;span class=&quot;c&quot;&gt;#ifndef CPU_CLOCK_FREQUENCY&lt;/span&gt;
&lt;span class=&quot;c&quot;&gt;#error CPU_CLOCK_FREQUENCY must be defined&lt;/span&gt;
&lt;span class=&quot;c&quot;&gt;#endif&lt;/span&gt;

&lt;span class=&quot;c&quot;&gt;#define LOOP_UNROLL_FACTOR   100&lt;/span&gt;

&lt;span class=&quot;nl&quot;&gt;main:&lt;/span&gt;
        &lt;span class=&quot;nf&quot;&gt;push&lt;/span&gt;        &lt;span class=&quot;err&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;no&quot;&gt;r4-r12&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;no&quot;&gt;lr&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;}&lt;/span&gt;
        &lt;span class=&quot;nf&quot;&gt;ldr&lt;/span&gt;         &lt;span class=&quot;no&quot;&gt;ip&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;err&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;no&quot;&gt;CPU_CLOCK_FREQUENCY&lt;/span&gt; &lt;span class=&quot;err&quot;&gt;/&lt;/span&gt; &lt;span class=&quot;no&quot;&gt;LOOP_UNROLL_FACTOR&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
        &lt;span class=&quot;nf&quot;&gt;b&lt;/span&gt;           &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;no&quot;&gt;f&lt;/span&gt;

    &lt;span class=&quot;na&quot;&gt;.balign&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;64&lt;/span&gt;
&lt;span class=&quot;err&quot;&gt;1:&lt;/span&gt;
    &lt;span class=&quot;na&quot;&gt;.rept&lt;/span&gt; &lt;span class=&quot;no&quot;&gt;LOOP_UNROLL_FACTOR&lt;/span&gt;
        &lt;span class=&quot;nf&quot;&gt;add&lt;/span&gt;         &lt;span class=&quot;no&quot;&gt;r0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;no&quot;&gt;r0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;no&quot;&gt;r0&lt;/span&gt;
        &lt;span class=&quot;nf&quot;&gt;add&lt;/span&gt;         &lt;span class=&quot;no&quot;&gt;r0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;no&quot;&gt;r0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;no&quot;&gt;r0&lt;/span&gt;
        &lt;span class=&quot;nf&quot;&gt;add&lt;/span&gt;         &lt;span class=&quot;no&quot;&gt;r0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;no&quot;&gt;r0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;no&quot;&gt;r0&lt;/span&gt;
        &lt;span class=&quot;nf&quot;&gt;add&lt;/span&gt;         &lt;span class=&quot;no&quot;&gt;r0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;no&quot;&gt;r0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;no&quot;&gt;r0&lt;/span&gt;
        &lt;span class=&quot;nf&quot;&gt;add&lt;/span&gt;         &lt;span class=&quot;no&quot;&gt;r0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;no&quot;&gt;r0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;no&quot;&gt;r0&lt;/span&gt;
    &lt;span class=&quot;na&quot;&gt;.endr&lt;/span&gt;
        &lt;span class=&quot;nf&quot;&gt;subs&lt;/span&gt;        &lt;span class=&quot;no&quot;&gt;ip&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;no&quot;&gt;ip&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;c&quot;&gt;#1&lt;/span&gt;
        &lt;span class=&quot;nf&quot;&gt;bne&lt;/span&gt;         &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;no&quot;&gt;b&lt;/span&gt;

        &lt;span class=&quot;nf&quot;&gt;mov&lt;/span&gt;         &lt;span class=&quot;no&quot;&gt;r0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;c&quot;&gt;#0&lt;/span&gt;
        &lt;span class=&quot;nf&quot;&gt;pop&lt;/span&gt;         &lt;span class=&quot;err&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;no&quot;&gt;r4-r12&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;no&quot;&gt;pc&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;}&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;


&lt;p&gt;And the results of this benchmark from ARM Cortex-A8 @1GHz:&lt;/p&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-sh&quot; data-lang=&quot;sh&quot;&gt;&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;gcc -DCPU_CLOCK_FREQUENCY&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;m&quot;&gt;1000000000&lt;/span&gt; bench.S &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;time&lt;/span&gt; ./a.out

real    0m5.017s
user    0m5.016s
sys     0m0.000s&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;


&lt;p&gt;A few more explanations about this test program and the interpretation of results. The &#39;.rept LOOP_UNROLL_FACTOR / ... / .endr&#39; block repeats the code contained
inside it LOOP_UNROLL_FACTOR times (more information about gnu assembler macros can be found by reading &#39;info as&#39;).
This helps to reduce the loop overhead so that it becomes insignificant and can be ignored. Unrolling even more is good, though we need to be careful
in order not to exceed the instructions cache size. The end result is that the block of 5 ADD
instructions is executed CPU_CLOCK_FREQUENCY times when running this test program.
If the test program takes 5 seconds to execute, then it means that the sequence of instructions
inside of .rept block needs 5 cycles. If we had a non-integer number of seconds, then it would
mean that something likely went wrong.&lt;/p&gt;

&lt;p&gt;Multiple variations are also possible. Earlier I posted some &lt;a href=&quot;http://lists.freedesktop.org/archives/pixman/attachments/20110410/d6062de3/attachment.obj&quot;&gt;code template for experimenting with NEON instructions scheduling&lt;/a&gt;,
tailored for tuning ARM NEON optimizations specifically for the &lt;a href=&quot;http://cgit.freedesktop.org/pixman&quot;&gt;pixman library&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;MIPS&lt;/h3&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-gas&quot; data-lang=&quot;gas&quot;&gt;&lt;span class=&quot;na&quot;&gt;.text&lt;/span&gt;
&lt;span class=&quot;na&quot;&gt;.set&lt;/span&gt; &lt;span class=&quot;no&quot;&gt;noreorder&lt;/span&gt;

&lt;span class=&quot;c&quot;&gt;#ifndef CPU_CLOCK_FREQUENCY&lt;/span&gt;
&lt;span class=&quot;c&quot;&gt;#error CPU_CLOCK_FREQUENCY must be defined&lt;/span&gt;
&lt;span class=&quot;c&quot;&gt;#endif&lt;/span&gt;

&lt;span class=&quot;c&quot;&gt;#define LOOP_UNROLL_FACTOR  100&lt;/span&gt;

&lt;span class=&quot;na&quot;&gt;.global&lt;/span&gt; &lt;span class=&quot;no&quot;&gt;main&lt;/span&gt;
&lt;span class=&quot;na&quot;&gt;.type&lt;/span&gt; &lt;span class=&quot;no&quot;&gt;main&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;@function&lt;/span&gt;
&lt;span class=&quot;nl&quot;&gt;main:&lt;/span&gt;
        &lt;span class=&quot;nf&quot;&gt;li&lt;/span&gt;      &lt;span class=&quot;no&quot;&gt;$t9&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;no&quot;&gt;CPU_CLOCK_FREQUENCY&lt;/span&gt; &lt;span class=&quot;err&quot;&gt;/&lt;/span&gt; &lt;span class=&quot;no&quot;&gt;LOOP_UNROLL_FACTOR&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;
&lt;span class=&quot;err&quot;&gt;1:&lt;/span&gt;
    &lt;span class=&quot;na&quot;&gt;.rept&lt;/span&gt; &lt;span class=&quot;no&quot;&gt;LOOP_UNROLL_FACTOR&lt;/span&gt;
        &lt;span class=&quot;nf&quot;&gt;addu&lt;/span&gt;    &lt;span class=&quot;no&quot;&gt;$t0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;no&quot;&gt;$t0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;no&quot;&gt;$t0&lt;/span&gt;
        &lt;span class=&quot;nf&quot;&gt;addu&lt;/span&gt;    &lt;span class=&quot;no&quot;&gt;$t0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;no&quot;&gt;$t0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;no&quot;&gt;$t0&lt;/span&gt;
        &lt;span class=&quot;nf&quot;&gt;addu&lt;/span&gt;    &lt;span class=&quot;no&quot;&gt;$t0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;no&quot;&gt;$t0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;no&quot;&gt;$t0&lt;/span&gt;
        &lt;span class=&quot;nf&quot;&gt;addu&lt;/span&gt;    &lt;span class=&quot;no&quot;&gt;$t0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;no&quot;&gt;$t0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;no&quot;&gt;$t0&lt;/span&gt;
        &lt;span class=&quot;nf&quot;&gt;addu&lt;/span&gt;    &lt;span class=&quot;no&quot;&gt;$t0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;no&quot;&gt;$t0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;no&quot;&gt;$t0&lt;/span&gt;
    &lt;span class=&quot;na&quot;&gt;.endr&lt;/span&gt;

        &lt;span class=&quot;nf&quot;&gt;bnez&lt;/span&gt;    &lt;span class=&quot;no&quot;&gt;$t9&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;no&quot;&gt;b&lt;/span&gt;
        &lt;span class=&quot;nf&quot;&gt;addiu&lt;/span&gt;   &lt;span class=&quot;no&quot;&gt;$t9&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;no&quot;&gt;$t9&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;-&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;

        &lt;span class=&quot;nf&quot;&gt;j&lt;/span&gt;       &lt;span class=&quot;no&quot;&gt;$ra&lt;/span&gt;
        &lt;span class=&quot;nf&quot;&gt;li&lt;/span&gt;      &lt;span class=&quot;no&quot;&gt;$v0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;


&lt;p&gt;MIPS74K @480MHz:&lt;/p&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-sh&quot; data-lang=&quot;sh&quot;&gt;&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;gcc -DCPU_CLOCK_FREQUENCY&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;m&quot;&gt;480000000&lt;/span&gt; bench.S &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;time&lt;/span&gt; ./a.out

real    0m10.064s
user    0m10.060s
sys     0m0.003s&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;


&lt;p&gt;MIPS24Kc @680MHz:&lt;/p&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-sh&quot; data-lang=&quot;sh&quot;&gt;&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;gcc -DCPU_CLOCK_FREQUENCY&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;m&quot;&gt;680000000&lt;/span&gt; bench.S &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;time&lt;/span&gt; ./a.out

real    0m5.040s
user    0m5.030s
sys     0m0.000s&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;


&lt;p&gt;This was a variant of the same benchmarking code for MIPS, which shows that MIPS74K has a higher latency and needs 2 cycles per addition.&lt;/p&gt;

&lt;h3&gt;x86, and also taking a look at SMT&lt;/h3&gt;

&lt;p&gt;A similar benchmarking method can be also extended to analyze the efficiency of &lt;a href=&quot;http://en.wikipedia.org/wiki/Simultaneous_multithreading&quot;&gt;SMT&lt;/a&gt;
capable processors (Intel Atom, IBM Cell PPE and friends). Because the resources of a single CPU core are shared
between two hardware threads, there can&#39;t be 100% scalability and it may be interesting
to see how much SMT can actualy help on real or artificial workload. The test program for x86 may look like this:&lt;/p&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-gas&quot; data-lang=&quot;gas&quot;&gt;&lt;span class=&quot;na&quot;&gt;.intel_syntax&lt;/span&gt; &lt;span class=&quot;no&quot;&gt;noprefix&lt;/span&gt;
&lt;span class=&quot;na&quot;&gt;.text&lt;/span&gt;
&lt;span class=&quot;na&quot;&gt;.global&lt;/span&gt; &lt;span class=&quot;no&quot;&gt;main&lt;/span&gt;
&lt;span class=&quot;na&quot;&gt;.global&lt;/span&gt; &lt;span class=&quot;no&quot;&gt;fork&lt;/span&gt;
&lt;span class=&quot;na&quot;&gt;.global&lt;/span&gt; &lt;span class=&quot;no&quot;&gt;wait&lt;/span&gt;

&lt;span class=&quot;c&quot;&gt;#ifndef CPU_CLOCK_FREQUENCY&lt;/span&gt;
&lt;span class=&quot;c&quot;&gt;#error CPU_CLOCK_FREQUENCY must be defined&lt;/span&gt;
&lt;span class=&quot;c&quot;&gt;#endif&lt;/span&gt;

&lt;span class=&quot;c&quot;&gt;#define LOOP_UNROLL_FACTOR  100&lt;/span&gt;

&lt;span class=&quot;nl&quot;&gt;main:&lt;/span&gt;
&lt;span class=&quot;c&quot;&gt;#ifdef TWO_THREADS&lt;/span&gt;
        &lt;span class=&quot;nf&quot;&gt;call&lt;/span&gt;    &lt;span class=&quot;no&quot;&gt;fork&lt;/span&gt;
&lt;span class=&quot;c&quot;&gt;#endif&lt;/span&gt;
        &lt;span class=&quot;nf&quot;&gt;mov&lt;/span&gt;     &lt;span class=&quot;no&quot;&gt;ecx&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;no&quot;&gt;CPU_CLOCK_FREQUENCY&lt;/span&gt; &lt;span class=&quot;err&quot;&gt;/&lt;/span&gt; &lt;span class=&quot;no&quot;&gt;LOOP_UNROLL_FACTOR&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
        &lt;span class=&quot;nf&quot;&gt;jmp&lt;/span&gt;     &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;no&quot;&gt;f&lt;/span&gt;

    &lt;span class=&quot;na&quot;&gt;.balign&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;64&lt;/span&gt;
&lt;span class=&quot;err&quot;&gt;1:&lt;/span&gt;
    &lt;span class=&quot;na&quot;&gt;.rept&lt;/span&gt; &lt;span class=&quot;no&quot;&gt;LOOP_UNROLL_FACTOR&lt;/span&gt;
        &lt;span class=&quot;nf&quot;&gt;addps&lt;/span&gt;   &lt;span class=&quot;no&quot;&gt;xmm1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;no&quot;&gt;xmm1&lt;/span&gt;
        &lt;span class=&quot;nf&quot;&gt;add&lt;/span&gt;     &lt;span class=&quot;no&quot;&gt;eax&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;  &lt;span class=&quot;no&quot;&gt;eax&lt;/span&gt;
        &lt;span class=&quot;nf&quot;&gt;add&lt;/span&gt;     &lt;span class=&quot;no&quot;&gt;eax&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;  &lt;span class=&quot;no&quot;&gt;eax&lt;/span&gt;
        &lt;span class=&quot;nf&quot;&gt;addps&lt;/span&gt;   &lt;span class=&quot;no&quot;&gt;xmm2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;no&quot;&gt;xmm2&lt;/span&gt;
        &lt;span class=&quot;nf&quot;&gt;add&lt;/span&gt;     &lt;span class=&quot;no&quot;&gt;eax&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;  &lt;span class=&quot;no&quot;&gt;eax&lt;/span&gt;
        &lt;span class=&quot;nf&quot;&gt;add&lt;/span&gt;     &lt;span class=&quot;no&quot;&gt;eax&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;  &lt;span class=&quot;no&quot;&gt;eax&lt;/span&gt;
        &lt;span class=&quot;nf&quot;&gt;addps&lt;/span&gt;   &lt;span class=&quot;no&quot;&gt;xmm3&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;no&quot;&gt;xmm3&lt;/span&gt;
        &lt;span class=&quot;nf&quot;&gt;add&lt;/span&gt;     &lt;span class=&quot;no&quot;&gt;eax&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;  &lt;span class=&quot;no&quot;&gt;eax&lt;/span&gt;
        &lt;span class=&quot;nf&quot;&gt;add&lt;/span&gt;     &lt;span class=&quot;no&quot;&gt;eax&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;  &lt;span class=&quot;no&quot;&gt;eax&lt;/span&gt;
    &lt;span class=&quot;na&quot;&gt;.endr&lt;/span&gt;
        &lt;span class=&quot;nf&quot;&gt;dec&lt;/span&gt;     &lt;span class=&quot;no&quot;&gt;ecx&lt;/span&gt;
        &lt;span class=&quot;nf&quot;&gt;jnz&lt;/span&gt;     &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;no&quot;&gt;b&lt;/span&gt;

&lt;span class=&quot;c&quot;&gt;#ifdef TWO_THREADS&lt;/span&gt;
        &lt;span class=&quot;nf&quot;&gt;push&lt;/span&gt;    &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;
        &lt;span class=&quot;nf&quot;&gt;call&lt;/span&gt;    &lt;span class=&quot;no&quot;&gt;wait&lt;/span&gt;
        &lt;span class=&quot;nf&quot;&gt;add&lt;/span&gt;     &lt;span class=&quot;no&quot;&gt;esp&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;4&lt;/span&gt;
&lt;span class=&quot;c&quot;&gt;#endif&lt;/span&gt;
        &lt;span class=&quot;nf&quot;&gt;mov&lt;/span&gt;     &lt;span class=&quot;no&quot;&gt;eax&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;
        &lt;span class=&quot;nf&quot;&gt;ret&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;


&lt;p&gt;And the results of this benchmark from Intel Atom N450 @1.66GHz:&lt;/p&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-sh&quot; data-lang=&quot;sh&quot;&gt;&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;gcc -m32 -DCPU_CLOCK_FREQUENCY&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;m&quot;&gt;1660000000&lt;/span&gt; ht-bench.S &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;time&lt;/span&gt; ./a.out

real    0m6.034s
user    0m6.032s
sys     0m0.000s

&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;gcc -m32 -DCPU_CLOCK_FREQUENCY&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;m&quot;&gt;1660000000&lt;/span&gt; -DTWO_THREADS ht-bench.S &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;time&lt;/span&gt; ./a.out

real    0m9.088s
user    0m18.097s
sys     0m0.028s&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;


&lt;p&gt;When running just one thread, 6 cycles are needed for each 9 instructions
from the loop body (ADDPS instructions can dual issue with ADD instructions,
so the whole loop is limited only by ADD instructions performance). And two
threads need 9 cycles for each 2 * 9 = 18 instructions, reaching the maximum
theoretically possible IPC = 2 for this processor.&lt;/p&gt;

&lt;p&gt;This particular benchmark is quite interesting, because I used it to verify
the hypothesis from some other person, who suggested that at any given CPU cycle,
only the instructions from one hardware thread may be processed (either a single
instruction or a pair of instructions), but never from both. But just because
there are 12 ADD instructions to be executed in 9 cycles and they can&#39;t dual
issue within a single thread, there is no other way for the processor but to occasionally
execute a pair of ADD instructions fetched from different threads simultaneously.&lt;/p&gt;

&lt;p&gt;Though there is still something wrong with Intel Atom hyper-threading
implementation, because actually removing all the ADDPS instructions from the
benchmark program causes performance regression for the multithreaded case.
It regresses to 12 cycles per each 2 * 6 = 12 remaining ADD instructions,
so hyper-threading becomes useless. Two threads running simultaneously need
exactly the same time to complete as would be needed to run just a single
thread twice. So those additional extra ADDPS instructions work as some kind
of &quot;catalyst&quot; and improve multithreaded performance for this particular code
sequence!&lt;/p&gt;

&lt;h3&gt;But what about the hardware performance counters available in modern processors?&lt;/h3&gt;

&lt;p&gt;The hardware performance counters are surely useful. And moreover, they have many
interesting events monitored in addition to just a simple cycle counter, which
surely expose some additional information about what is happening inside of
the processor and help to better understand it.&lt;/p&gt;

&lt;p&gt;However simple time based tests are just fine and may be preferable in some
cases. The most important is when you want to ask somebody else to
run some benchmark on his hardware, but the performance counters are not
accessible from the userspace by default and that person is reluctant
to touch the kernel.&lt;/p&gt;

&lt;p&gt;On the other hand, the simple timer based tests described here are problematic
when something like &lt;a href=&quot;http://en.wikipedia.org/wiki/Intel_Turbo_Boost&quot;&gt;turbo-boost&lt;/a&gt;
is supported by the hardware and is enabled, causing the CPU clock frequency to drift.&lt;/p&gt;
</content>
 </entry>
 
 <entry>
   <id>http://ssvb.github.io/2011/07/30/origenboard-early-adopter</id>
   <link type="text/html" rel="alternate" href="http://ssvb.github.io/2011/07/30/origenboard-early-adopter.html"/>
   <title>Origenboard, early adopter impressions</title>
   <updated>2011-07-30T00:00:00+00:00</updated>
    <author>
      <name>Siarhei Siamashka</name>
      <uri>http://ssvb.github.io/</uri>
    </author>
   <content type="html">&lt;h3&gt;A little bit of rant&lt;/h3&gt;

&lt;p&gt;Since a few days ago, I&#39;m a somewhat happy owner of &lt;a href=&quot;http://www.origenboard.org/&quot;&gt;origenboard&lt;/a&gt; from the first batch.
So why I&#39;m not totally happy yet? Actually I expected that the board would be easy to get up
and running, considering that the same
&lt;a href=&quot;http://www.samsung.com/global/business/semiconductor/productInfo.do?fmly_id=844&amp;amp;partnum=Exynos%204210&quot;&gt;Exynos 4210 SoC&lt;/a&gt;
is used in a rather popular
&lt;a href=&quot;http://en.wikipedia.org/wiki/Samsung_Galaxy_S_II&quot;&gt;Samsung Galaxy S2&lt;/a&gt; smartphone already
available on the market (which means that the SoC intself should not have any serious hardware
problems by now). And also because of &lt;a href=&quot;http://www.youtube.com/watch?v=vLUne-yDzVE&quot;&gt;the demos like this&lt;/a&gt; (which means that at least Linaro should have some usable linux kernel to run these demos on).
So there is no reason not to expect some validation SD card image readily available
for download and some basic getting started instructions, right?&lt;/p&gt;

&lt;p&gt;The reality is that the only support area on &lt;a href=&quot;http://www.origenboard.org/&quot;&gt;origenboard&lt;/a&gt; website is a pre-moderated forum, where
a few other fellow users &lt;a href=&quot;http://www.origenboard.org/forum/viewtopic.php?f=8&amp;amp;t=4&quot;&gt;have asked about the sources of u-boot&lt;/a&gt;.
And my reply to that topic, trying to share the information with them, has not yet passed through moderation as of today.
Hopefully the initial mess will be resolved soon and there will be some usable communication
channel for origenboard users in the future. But considering that there are only &lt;a href=&quot;http://www.origenboard.org/news/?p=18&quot;&gt;30 days of warranty&lt;/a&gt;,
it may be a bit disturbing not to be able to use some validation image and test the board for hardware defects right away.&lt;/p&gt;

&lt;p&gt;Because origenboard website refers to &lt;a href=&quot;http://www.linaro.org/&quot;&gt;Linaro&lt;/a&gt; as the intended provider
of the software part, I tried to see if Linaro can offer something usable for origenboard now.
The information seems to be scarce and scattered there currently (I looked at the downloads area, wiki
pages and asked around on #linaro irc channel). And the downside is that the maturity of
the &lt;a href=&quot;http://git.linaro.org/gitweb?p=people/angus/linux-linaro-2.6.39.git;a=shortlog;h=refs/tags/2.6.39-2011.07&quot;&gt;currently provided linaro kernel 2.6.39-2011.07&lt;/a&gt; appears
to be not very good yet.&lt;/p&gt;

&lt;p&gt;My experience with this board so far is the following:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;http://git.linaro.org/gitweb?p=people/angus/linux-linaro-2.6.39.git;a=shortlog;h=refs/tags/2.6.39-2011.07&quot;&gt;linaro kernel&lt;/a&gt;: USB does not work (so no USB ethernet), only a single CPU core is available. Also there is something on HDMI, but monitor reports &quot;out of range&quot; error&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;http://git.insignal.co.kr/?p=linux-2.6-insignal-dev.git;a=shortlog;h=3645a1cb402be68b83feb9f9c8d7af2728cc8878&quot;&gt;insignal kernel&lt;/a&gt;: USB works, both CPU cores are available (though running at only 1GHz), no HDMI output to monitor at all (and a few random configuration tweaks did not help)&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;But in any case, the insignal kernel at least provides a usable headless configuration.
And this is surely better than nothing. Also on a positive side, the current situation
inspired me to finally start a blog and post about something. Hopefully blogging could
be entertaining for both me and the prospective readers :)&lt;/p&gt;

&lt;h3&gt;Board setup notes&lt;/h3&gt;

&lt;p&gt;The instructions below are not complete, but are supposed to highlight the most important
steps. All of this has been discovered by using trial and error
method and also by bugging relevant people on #linaro irc channel (thanks for their patience). A total
newbie may still get stuck, but this information should be sufficient for those
having some experience installing linux on any other ARM development boards.&lt;/p&gt;

&lt;p&gt;Also this information is likely to get outdated very soon (even if assuming that it was useful in the first place).&lt;/p&gt;

&lt;h4&gt;u-boot and linux kernel sources&lt;/h4&gt;

&lt;p&gt;The combination of u-boot and kernel that I&#39;m using at the moment is the following:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;u-boot: &lt;a href=&quot;http://git.linaro.org/gitweb?p=people/angus/u-boot.git;a=shortlog;h=refs/tags/linaro-origen-2011.07&quot;&gt;linaro-origen-2011.07&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;kernel: &lt;a href=&quot;http://git.insignal.co.kr/?p=linux-2.6-insignal-dev.git;a=shortlog;h=3645a1cb402be68b83feb9f9c8d7af2728cc8878&quot;&gt;insignal 3645a1cb402be68b83feb9f9c8d7af2728cc8878&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;This kernel needs to be patched when used with this particular u-boot (as advised by linaro guys):&lt;/p&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-diff&quot; data-lang=&quot;diff&quot;&gt;&lt;span class=&quot;gh&quot;&gt;diff --git a/arch/arm/mach-s5pv310/mach-origen.c b/arch/arm/mach-s5pv310/mach-origen.c&lt;/span&gt;
&lt;span class=&quot;gh&quot;&gt;index e24e8d1..977f0c9 100644&lt;/span&gt;
&lt;span class=&quot;gd&quot;&gt;--- a/arch/arm/mach-s5pv310/mach-origen.c&lt;/span&gt;
&lt;span class=&quot;gi&quot;&gt;+++ b/arch/arm/mach-s5pv310/mach-origen.c&lt;/span&gt;
&lt;span class=&quot;gu&quot;&gt;@@ -549,7 +549,7 @@ static void __init origen_fixup(struct machine_desc *desc,&lt;/span&gt;
    mi-&amp;gt;nr_banks = 2;
 }
 
&lt;span class=&quot;gd&quot;&gt;-#if 0&lt;/span&gt;
&lt;span class=&quot;gi&quot;&gt;+#if 1&lt;/span&gt;
 MACHINE_START(ORIGEN, &amp;quot;ORIGEN&amp;quot;)
 #else
 MACHINE_START(SMDKV310, &amp;quot;SMDKV310&amp;quot;)
&lt;span class=&quot;gd&quot;&gt;-- &lt;/span&gt;
1.7.3.4&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;


&lt;p&gt;Compile u-boot (to get u-boot-mmc-spl.bin and u-boot.bin):&lt;/p&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-sh&quot; data-lang=&quot;sh&quot;&gt;make &lt;span class=&quot;nv&quot;&gt;CROSS_COMPILE&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;arm-none-linux-gnueabi- mrproper
make &lt;span class=&quot;nv&quot;&gt;CROSS_COMPILE&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;arm-none-linux-gnueabi- origen_config
make &lt;span class=&quot;nv&quot;&gt;CROSS_COMPILE&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;arm-none-linux-gnueabi-&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;


&lt;p&gt;Compile the kernel (to get uImage):&lt;/p&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-sh&quot; data-lang=&quot;sh&quot;&gt;make &lt;span class=&quot;nv&quot;&gt;ARCH&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;arm &lt;span class=&quot;nv&quot;&gt;CROSS_COMPILE&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;arm-none-linux-gnueabi- mrproper
make &lt;span class=&quot;nv&quot;&gt;ARCH&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;arm &lt;span class=&quot;nv&quot;&gt;CROSS_COMPILE&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;arm-none-linux-gnueabi- origen_android_defconfig
make &lt;span class=&quot;nv&quot;&gt;ARCH&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;arm &lt;span class=&quot;nv&quot;&gt;CROSS_COMPILE&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;arm-none-linux-gnueabi- menuconfig
make &lt;span class=&quot;nv&quot;&gt;ARCH&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;arm &lt;span class=&quot;nv&quot;&gt;CROSS_COMPILE&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;arm-none-linux-gnueabi- -j8 uImage
make &lt;span class=&quot;nv&quot;&gt;ARCH&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;arm &lt;span class=&quot;nv&quot;&gt;CROSS_COMPILE&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;arm-none-linux-gnueabi- -j8 modules
scp arch/arm/boot/uImage root@origen:/mnt/mmcblk0p1/uImage
make &lt;span class=&quot;nv&quot;&gt;ARCH&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;arm &lt;span class=&quot;nv&quot;&gt;CROSS_COMPILE&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;arm-none-linux-gnueabi- modules_install &lt;span class=&quot;nv&quot;&gt;INSTALL_MOD_PATH&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;/mnt/origen-nfs-root&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;


&lt;p&gt;Be sure to tweak configuration options as needed (add drivers for USB ethernet adapters, statically compile in ext3 support, disable CONFIG_ANDROID_PARANOID_NETWORK, etc.)&lt;/p&gt;

&lt;h4&gt;SD card layout&lt;/h4&gt;

&lt;p&gt;This section is based on the information from &lt;a href=&quot;https://wiki.linaro.org/Boards/Origen/Setup&quot;&gt;linaro wiki&lt;/a&gt;.
In order to successfully boot the system, u-boot binary needs to be put into certain predefined areas on SD card.&lt;/p&gt;

&lt;table border=1&gt;&lt;tr&gt;
&lt;td colspan=&quot;4&quot; style=&quot;text-align:center&quot;&gt;Raw Sectors (sector size = 512 bytes)&lt;/td&gt;
  &lt;td colspan=&quot;3&quot; style=&quot;text-align:center&quot;&gt;Partitions &lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;0&lt;/td&gt;
  &lt;td&gt;1 to 32&lt;/td&gt;
  &lt;td&gt;33 to 64&lt;/td&gt;
  &lt;td&gt;65 to 1088&lt;/td&gt;
  &lt;td&gt;FAT partition&lt;/td&gt;
  &lt;td&gt;any linux partition&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;MBR&lt;/td&gt;
  &lt;td&gt;u-boot-mmc-spl.bin&lt;/td&gt;
  &lt;td&gt;u-boot environment &lt;/td&gt;
  &lt;td&gt;u-boot.bin &lt;/td&gt;
  &lt;td&gt;uImage (kernel)&lt;/td&gt;
  &lt;td&gt;root filesystem&lt;/td&gt;
&lt;/tr&gt;
&lt;/table&gt;


&lt;p&gt;Writing u-boot into raw sectors of SD card (assuming that SD card is detected as /dev/sdb):&lt;/p&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-sh&quot; data-lang=&quot;sh&quot;&gt;&lt;span class=&quot;c&quot;&gt;# dd if=u-boot-mmc-spl.bin of=/dev/sdb bs=512 seek=1&lt;/span&gt;
&lt;span class=&quot;c&quot;&gt;# dd if=u-boot.bin of=/dev/sdb bs=512 seek=65&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;


&lt;h4&gt;Install rootfs for the distro of your choice and boot the system&lt;/h4&gt;

&lt;p&gt;Typical u-boot environment (when using rootfs from SD card instead of NFS):&lt;/p&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-sh&quot; data-lang=&quot;sh&quot;&gt;&lt;span class=&quot;nv&quot;&gt;baudrate&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;115200
&lt;span class=&quot;nv&quot;&gt;bootargs&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;root&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;/dev/mmcblk0p2 rw rootwait &lt;span class=&quot;nv&quot;&gt;console&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;ttySAC2,115200
&lt;span class=&quot;nv&quot;&gt;bootcmd&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;fatload mmc &lt;span class=&quot;m&quot;&gt;0&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;40007000&lt;/span&gt; uImage&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; bootm 40007000
&lt;span class=&quot;nv&quot;&gt;bootdelay&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;3
&lt;span class=&quot;nv&quot;&gt;stderr&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;serial
&lt;span class=&quot;nv&quot;&gt;stdin&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;serial
&lt;span class=&quot;nv&quot;&gt;stdout&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;serial&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;


&lt;p&gt;But in order to get login prompt on serial console, &lt;b&gt;s3c2410_serial2&lt;/b&gt; (not &lt;b&gt;ttySAC2&lt;/b&gt;) needs to be added to /etc/inittab and /etc/securetty. That&#39;s a bit weird, but I have not tried to look into it yet.&lt;/p&gt;

&lt;p&gt;Finally turn on the board by pressing &lt;b&gt;switch&lt;/b&gt; and then &lt;b&gt;power&lt;/b&gt; button.&lt;/p&gt;

&lt;h4&gt;Update from 2011-09-19&lt;/h4&gt;

&lt;p&gt;Linaro kernel is getting better. Now it supports cpufreq (using 1.2GHz CPU clock frequency is possible), has
a somewhat working USB support (is very slow and sometimes gets stuck for a few seconds), and a somewhat
usable HDMI output which is hardcoded to use 1920x1080 resolution and use only a small 1024x600 area in
the center. Still compared to the initial state, it is a major improvement.&lt;/p&gt;

&lt;p&gt;I guess, everything is going to be in a much better shape in a few more months.&lt;/p&gt;
</content>
 </entry>
 

</feed>
