Siarhei Siamashka

Revisiting FullHD X11 desktop performance of the Allwinner A10

2014-11-11T00:00:00+00:00

In my previous blog post, I was talking about a pathologically bad Linux desktop performance with FullHD monitors on Allwinner A10 hardware.

A lot of time has passed since then. Thanks to the availability of Rockchip sources and documentation, we have learned a lot of information about the DRAM controller in Allwinner A10/A13/A20 SoCs. Both Allwinner and Rockchip are apparently licensing the DRAM controller IP from the same third-party vendor. And their DRAM controller hardware registers are sharing a lot of similarities (though unfortunately this is not an exact match).

Having a much better knowledge about the hardware allowed us to revisit this problem, investigate it in more details and come up with a solution back in April 2014. The only missing part was providing an update in this blog. At least to make it clear that the problem has been resolved now. So here we go...

FullHD X11 desktop performance of the Allwinner A10

2013-06-27T00:00:00+00:00

This blog post is assuming that you are a happy owner of one of the devices, based on the Allwinner A10 SoC (with a single core ARM Cortex-A8 1GHz). But hopefully the owners of the other low end ARM based devices may also find something interesting here.

There are plenty of user friendly linux distributions available for Allwinner A10 devices (for example, Fedora is a nice one). Basically you just write an image to the SD card, plug a HDMI cable into your TV or monitor, connect a keyboard and a mouse, power the device on. And then a nice GUI wizard guides you through the initial configuration, like setting passwords, etc. A part of the magic, which allows these user friendly distros to just work out-of-the box, is the automatic detection of the monitor capabilities via EDID and setting the preferred screen resolution, suggested by the monitor. Many monitors are FullHD capable, hence you are likely to end up with a 1920x1080 screen resolution. And that's where it may become a challenge for a low end device.

First of all, 1920x1080 screen has 2.25x times more pixels than 1280x720, and the amount of the pixels to be processed naturally affects the performance. So expect 1920x1080 graphics to be at least twice slower than 1280x720 for redrawing anything that covers the whole screen.

But additionally, as part of the monitor refresh, pixels are read from the framebuffer and sent over the HDMI to the monitor 60 times per second. As there is no dedicated video memory for the framebuffer, the screen refresh is competing with the CPU, DMA and various hardware accelerators for the access to the system memory. We can estimate how much system memory bandwidth is wasted for just maintaining the monitor refresh: 1920x1080 * 4 bytes per pixel * 60Hz = ~500 MB/s

And we should double this amount if the system is driving two monitors at once (HDMI and VGA), but the dual monitor support is outside of the scope of this blog post. Anyway, is 500 MB/s significant or not? Allwinner A10 uses 32-bit DDR3 memory, clocked between 360 MHz and 480 MHz (the default memory clock speed is different for different devices). Which means that the theoretical memory bandwidth limit is between 2.9 GB/s and 3.8 GB/s. So in theory we should be perfectly fine?

Synthetic tests for the monitor refresh induced memory bandwidth loss

We can simply try to boot the system with different combinations of monitor refresh rate, desktop color depth and memory clock frequency. Then do the measurements for each with tinymembench and put the results into tables. The performance of memset appears to be the most affected, hence is it the most interesting to observe. There are also "backwards memset" performance numbers for the sake of completeness (it does the same job as memset, but is implemented by decrementing the pointer after each write instead of incrementing it).

**Table 1. Memory write bandwidth available to the CPU (memset performance)**
	Memory clock speed
Video mode	360MHz	384MHz	408MHz	432MHz	456MHz	480MHz
1920x1080, 32bpp, 60Hz	450 MB/s	480 MB/s	509 MB/s	537 MB/s	556 MB/s	556 MB/s
1920x1080, 32bpp, 60Hz (scaler mode)	548 MB/s	550 MB/s	554 MB/s	554 MB/s	558 MB/s	558 MB/s
1920x1080, 32bpp, 56Hz	449 MB/s	479 MB/s	510 MB/s	522 MB/s	533 MB/s	812 MB/s
1920x1080, 32bpp, 56Hz (scaler mode)	514 MB/s	620 MB/s	764 MB/s	769 MB/s	774 MB/s	896 MB/s
1920x1080, 32bpp, 50Hz	449 MB/s	467 MB/s	576 MB/s	815 MB/s	1041 MB/s	1122 MB/s
1920x1080, 32bpp, 50Hz (scaler mode)	759 MB/s	885 MB/s	921 MB/s	964 MB/s	1018 MB/s	1130 MB/s
1920x1080, 24bpp, 60Hz	421 MB/s	430 MB/s	842 MB/s	972 MB/s	1074 MB/s	1219 MB/s
1920x1080, 24bpp, 56Hz	417 MB/s	860 MB/s	947 MB/s	1030 MB/s	1168 MB/s	1210 MB/s
1920x1080, 24bpp, 50Hz	813 MB/s	887 MB/s	1023 MB/s	1180 MB/s	1247 MB/s	1252 MB/s

**Table 2. Memory write bandwidth available to the CPU (backwards memset performance)**
	Memory clock speed
Video mode	360MHz	384MHz	408MHz	432MHz	456MHz	480MHz
1920x1080, 32bpp, 60Hz	688 MB/s	817 MB/s	882 MB/s	883 MB/s	974 MB/s	1040 MB/s
1920x1080, 32bpp, 60Hz (scaler mode)	726 MB/s	779 MB/s	882 MB/s	884 MB/s	925 MB/s	1030 MB/s
1920x1080, 32bpp, 56Hz	769 MB/s	824 MB/s	873 MB/s	947 MB/s	995 MB/s	1123 MB/s
1920x1080, 32bpp, 56Hz (scaler mode)	762 MB/s	825 MB/s	874 MB/s	959 MB/s	996 MB/s	1060 MB/s
1920x1080, 32bpp, 50Hz	763 MB/s	863 MB/s	941 MB/s	1021 MB/s	1119 MB/s	1188 MB/s
1920x1080, 32bpp, 50Hz (scaler mode)	799 MB/s	887 MB/s	919 MB/s	996 MB/s	1071 MB/s	1183 MB/s
1920x1080, 24bpp, 60Hz	819 MB/s	919 MB/s	986 MB/s	1143 MB/s	1175 MB/s	1177 MB/s
1920x1080, 24bpp, 56Hz	856 MB/s	938 MB/s	1097 MB/s	1098 MB/s	1178 MB/s	1239 MB/s
1920x1080, 24bpp, 50Hz	925 MB/s	983 MB/s	1060 MB/s	1144 MB/s	1182 MB/s	1263 MB/s

The "scaler mode" needs an additional explanation. The display controller in Allwinner A10 consists of two parts: Display Engine Front End (DEFE) and Display Engine Back End (DEBE). DEBE can provide up to 4 hardware layers (which are composited together for the final picture on screen) and supports a large variety of pixel formats. DEFE is connected in front of DEBE and can optionally provide scaling for 2 of these hardware layers, the drawback is that DEFE supports only a limited set of pixel formats. All this information can be found in the Allwinner A13 manual, which is now available in the unrestricted public access. The framebuffer memory is read by the DEFE hardware in the case if "scaler mode" is enabled, and by the DEBE hardware otherwise. The differences between DEFE and DEBE implementations of fetching pixels for screen refresh appear to have different impact on memset performance in practice.

One thing is obvious even without running any tests, and the measurements just confirm it: more memory bandwidth drained by screen refresh means less bandwidth left for the CPU. But the most interesting observation is that the memset performance abruptly degrades upon reaching a certain threshold. The abnormally low memset performance results are highlighted red in table 1. But the backwards memset is not affected. There is certainly something odd in the memory controller or in the display controller.

Attentive readers may argue that the same resolution and refresh rate can be achieved using different timings. The detailed modelines used in this test were the following:

Mode "1920x1080_50" 148.5 1920 2448 2492 2640 1080 1084 1089 1125 +HSync +VSync
Mode "1920x1080_56" 148.5 1920 2165 2209 2357 1080 1084 1089 1125 +HSync +VSync
Mode "1920x1080_60" 148.5 1920 2008 2052 2200 1080 1084 1089 1125 +HSync +VSync

Empirical tests show that in order to have less impact on the memory bandwidth, we need to maximize pixel clock, minimize vertical blanking and select the target refresh rate by adjusting horizontal blanking. That is assuming that the monitor will accept these extreme timings. The "red zones" in table 1 may drift a bit as a result.

Benchmarks by replaying the traces of real applications (cairo-perf-trace)

The numbers in the table 1 look scary, but does it have any impact on real applications in any significant way? Let's try the trimmed cairo traces again to see how it affects the performance of software rendered 2D graphics.

This benchmark is using gcc 4.8.1, pixman 0.30.0, cairo 1.12.14, linux kernel 3.4 with ARM hugetlb patches added. HugeTLB is very interesting by itself, because it provides a nice performance improvement for memory heavy workloads. But in this particular case it also helps to make benchmark results reproducible across multiple runs (the variance is apparently resulting from the difference in physical memory fragmentation and cache associativity effects). The cairo-perf-trace results from the "red zone" seem to be poorly reproducible with the standard 4K pages.

We can't test all the possible configurations, so just need to pick a few interesting ones:

1920x1080-60Hz, DDR3 360MHz (default for Mele A2000 HTPC box)
1920x1080-60Hz, DDR3 480MHz (default for CubieBoard)
1920x1080-50Hz, DDR3 480MHz (CubieBoard, 'disp.screen0_output_mode=1920x1080p50' in the kernel cmdline)

Chart 1. The results of cairo-perf-trace using 'image' backend (on Allwinner A10, ARM Cortex-A8 @1GHz)

The chart 1 is showing the performance improvements relative to Mele A2000 with its more than conservative default 360MHz memory clock frequency, and using 60Hz monitor refresh rate. The green bars show how much of the performance improvement can be provided by changing the memory clock frequency from 360MHz to 480MHz (by replacing the Mele A2000 with a CubieBoard or just overclocking the memory). The blue bars show the performance improvement resulting from additionally reducing the monitor refresh rate from 60Hz to 50Hz (and thus moving out of the "red zone" in table 1).

The results for the t-swfdec-giant-steps.trace replay show the biggest performance dependency on the monitor refresh rate, so it definitely deserves some profiling. Perf reports the following:

59.93%  cairo-perf-trac  libpixman-1.so.0.30.0  [.] pixman_composite_src_n_8888_asm_neon
 14.06%  cairo-perf-trac  libcairo.so.2.11200.14 [.] _fill_xrgb32_lerp_opaque_spans
 10.20%  cairo-perf-trac  libcairo.so.2.11200.14 [.] _cairo_tor_scan_converter_generate
  3.35%  cairo-perf-trac  libcairo.so.2.11200.14 [.] cell_list_render_edge
  0.82%  cairo-perf-trac  libcairo.so.2.11200.14 [.] _cairo_tor_scan_converter_add_polygon

Bingo! Most of the time is spent in 'pixman_composite_src_n_8888_asm_neon' function (solid fill), which is nothing else but a glorified memset. No surprises that it likes the 50Hz monitor refresh rate so much.

An obligatory note about HugeTLB (and THP) on ARM

The chart 1 lists the results with a more than a year old set of HugeTLB patches applied, but this feature has not reached the mainline linux kernel yet. I'm not providing a separate cairo-perf-trace chart, but the individual traces are up to 30% faster when taking HugeTLB+libhugetlbfs into use. And the geometric mean shows ~10% overall improvement. These results seem to agree with the reports from the other people.

Let's hope that ARM and Linaro manage to push this feature in. The 256 TLB entries in Cortex-A7 compared to just 32 in Cortex-A8 look very much like a hardware workaround for a software problem :-) But even older processors such as Cortex-A8 still need to be fast.

Update: turns out that the significantly better benchmark results can't be credited to the use of the huge pages alone. The "hugectl" tool from libhugetlbfs overrides glibc heap allocation and by default does not ever return memory to the system. While heap shrink/grow operations performed in normal conditions (without hugectl) are not particularly cheap in some cases. In any case, the primary purpose of using huge pages via hugectl was to ensure reproducible cairo-perf-trace benchmark results, and it did the job. Still TLB misses are a major problem for some operations with 2D graphics. Something like drawing a vertical scrollbar, where accessing each new scanline triggers a TLB miss with 4KiB pages. Or image rotation.

So what can be done?

The 32bpp color depth with 1920x1080 resolution on Allwinner A10 is quite unfortunate to hit this hardware quirk.

First a fantastic option :-) We could try to implement backwards solid fill in pixman and use it on the problematic hardware (using the icky /proc/cpuinfo text parsing to fish out the relevant bits of the information and do runtime detection). Still the problem does not go away, some other operations may be affected (memcpy is also affected, albeit to a lesser extent), memset is used in the other software, ...

We could also try the 24bpp color depth for the framebuffer. It provides the same 16777216 colors as 32bpp, but is much less affected as seen in table 1. A practical problem is that this is quite a non-orthodox pixel format, which is poorly supported by software (even if it works without bugs, it definitely does not enjoy many optimizations). This implies the use of ShadowFB with a 32bpp shadow framebuffer backing the real 24bpp framebuffer. But ShadowFB itself solves some problems and introduces new ones.

If your monitor supports the 50Hz refresh rate - just go for it! Additionally enabling the "scaler mode" surely helps (but wastes one scaled layer). The tricky part is that we want linux distros to remain user friendly and preferably still do automatic configuration. Automatic configuration means using EDID to check whether the monitor supports 50Hz. However the monitor manufacturers don't seem to be very sane and the EDID data may sometimes look like this:

[  1133.553] (WW) NVIDIA(GPU-0): The EDID for Samsung SMBX2231 (DFP-1) contradicts itself: mode
[  1133.553] (WW) NVIDIA(GPU-0):     "1920x1080" is specified in the EDID; however, the EDID's
[  1133.553] (WW) NVIDIA(GPU-0):     valid VertRefresh range (56.000-75.000 Hz) would exclude
[  1133.553] (WW) NVIDIA(GPU-0):     this mode's VertRefresh (50.0 Hz); ignoring VertRefresh
[  1133.553] (WW) NVIDIA(GPU-0):     check for mode "1920x1080".

The movie lovers seem to be also having some problems with 56Hz specified as the lowest supported. The use of 56Hz for some tests in table 1 is actually to see whether the 56Hz monitor refresh rate would be any good.

And as the last resort you can either reduce the screen resolution, or reduce the color depth to 16bpp. This actually may be the best option, unless you are interested in viewing high resolution photos with great colors and can't tolerate any image quality loss.

Final words

That's basically a summary of what has been already known for a while, and I kept telling this to people in the mailing lists and IRC. Intuitively, everyone probably understands that higher memory clock frequency must be somewhat better. But is it important enough to care? Isn't the CPU clock frequency the only primary factor that determines system performance? After all, it is the CPU clock frequency that is advertised in the device specs and is a popular target for overclockers. Hopefully the colorful tables and charts here are providing a convincing answer. In any case, if you are interested in FullHD desktop resolution on Allwinner A10, it makes sense to try your best to stay away from the "red zone" in table 1.

The performance of software rendering for 2D graphics is scaling very nicely with the memory speed increase on ARM processors equipped with a fast NEON unit (Cortex-A8, Cortex-A9, Cortex-A15). But the cairo-perf-trace benchmarks are only simulating offscreen rendering, which is just a part of the whole pipeline. The picture still needs to be delivered to the framebuffer for the user to see it. And it's better to be done without screw-ups.

To be continued about what's wrong with the ShadowFB layer.

New xf86-video-sunxifb DDX driver for Xorg

2013-02-01T00:00:00+00:00

A short introduction

Allwinner A10/A13 SoC is very interesting because it is used in a lot of very affordable electronic devices from China, such as USB dongles, media boxes, tablets, netbooks and even the cubieboard.org development board. Because of a very competitive price, these devices make a good alternative for Raspberry Pi.

One rather unique and somewhat attractive feature is that this platform does not have a corporate backing and does not suffer from "too many cooks" problem :-) All the hardware adaptation support is provided by the community at http://linux-sunxi.org/, where the people are currently trying to clean up the kernel and fix numerous bugs.

3D graphics performance

Allwinner A10 uses a single-core Mali-400 GPU running at 320MHz, which provides OpenGL ES 2.0 acceleration. The OpenGL ES implementation itself relies on the proprietary closed source libMali.so library. But the integration with the X server is provided by the open source reference driver xf86-video-mali. Many users might assume that it's a ready-to-use complete solution and a natural choice for their devices. However this is not quite true. The performance of the system is also largely dependent on the optimal integration with the display controller hardware, because Mali itself can only render 3D images to memory buffers. Here is a quote from the readme file included with xf86-video-mali:

xf86-video-mali" is provided as a basis for creating your own X Display
Driver. It requires a recent version of the xorg-server, as well as a
successfull integration of UMP with your display device driver.

As such, a more complete implementation of X11 driver is needed, and my attempt to develop one (based on xf86-video-fbdev) is available here: xf86-video-sunxifb. Below is the screenshot of it running on Mele A2000 TV box

The glmark2 2012.12 scores with 1280x720-32@60Hz monitor resolution look like this:

X11 DDX driver	Fullscreen (1280x720)	Window (800x600)	Partially obscured window (800x600)
xf86-video-mali r3p0	38	65	66
xf86-video-sunxifb-0.2.0	115	165	50

As expected from the implementation which is aware of the hardware overlays supported by the display controller, the performance of xf86-video-sunxifb in fullscreen mode or working with fully visible windows is significantly better than xf86-video-mali. Though rendering to partially obscured window currently goes through the fallback path involving many memory copy operations, and the overhead of these memory copy operations is even higher than for xf86-video-mali (mostly because of the use of the shadow framebuffer).

2D graphics performance

Now this is the most interesting part, because surprisingly 2D tends to be rather problematic for many drivers. Below is the chart based on the results from cairo-perf-trace running trimmed-cairo-traces.

Looks like xf86-video-sunxifb is implementing some great performance optimizations? I wish this was the case, but in fact it is basically just the functionality entirely provided by the original xf86-video-fbdev code, which was used as the base for xf86-video-sunxifb. It merely tries not to get in the way and just lets ARM NEON software rendering code from pixman run without too much extra overhead.

So what is wrong with the xf86-video-mali? Appears that it suffers from the same problem as many other X11 drivers for ARM hardware. DRI2 extension (the thing which is used for the integration of GLES acceleration) needs some hardware-specific buffers allocation (UMP in the case of xf86-video-mali). And EXA framework (a convenience layer for adding 2D acceleration hooks) supports overriding pixmap buffers allocation as part of its functionality. So the guys apparently decided that it's a good idea to override the allocation of absolutely all pixmaps without exception and not just the ones needed for DRI2. This was a total 2D performance disaster for the SGX PVR driver. And it is also killing 2D performance for xf86-video-mali. Because the sources of xf86-video-mali are available, it was possible to run one more somewhat artificial test. With a minor tweak, xf86-video-mali can be changed to do allocations of pixmaps in cached UMP buffers (let's for a moment just ignore the potential cache coherency issues for the buffers shared with Mali hardware via DRI2 and only look at the performance). The benchmark results for this modified xf86-video-mali driver are shown as green bars on the chart above. In some cases (t-firefox-fishtank), the performance for cached UMP allocations managed to catch up with xf86-video-sunxifb (and xf86-video-fbdev). But many other traces are still slow, which suggests that uncached memory allocation is not the only reason for poor performance. The UMP itself also requires expensive ioctls and has very heavy overhead. So sorry, the following suggestion from xf86-video-mali readme file is simply not going to fly:

The provided "xf86-video-mali" driver contains an EXA module which has been
integrated with the UMP system. Your 2D driver may therefore require an
integration with UMP as well. The suggestion is to pass the secure ID down to
the kernel device driver for your hardware, but it is also possible to get the
CPU-mapped address for the memory by calling ump_mapped_pointer_get.

Please refer to UMP documentation for more information regarding this.

BTW, if anyone has some doubts and wonders if these colored bars in the chart are really correlated with reality, I suggest checking my youtube video about Linux on ARM Chromebook: xf86-video-armsoc vs. xf86-video-fbdev. The xf86-video-armsoc driver has all the same 2D performance problems :-(

It really puzzles me why nearly all the X11 drivers for ARM hardware are making the same mistake. An old X11 DDX driver from Nokia N900 at least could do separate allocation for DRI2 buffers and normal pixmaps, while being not the best and cleanest implementation for sure.

The future of xf86-video-sunxifb

XV and XRANDR still need to be implemented. And of course there is still a lot of room for real 2D performance improvements :-)

Edited within a few hours after posting to fix some obvious typos, broken links and poor wording

Xorg drivers, software rendering for 2D graphics and cairo 1.12 performance

2012-05-04T00:00:00+00:00

Recently cairo graphics library got an update to version 1.12. It brings some nice performance improvements as demonstrated in three blog posts from Chris Wilson. These blog posts additionally showcase Intel SNA, which happens to be quite an impressive DDX driver. It provides 2D graphics hardware acceleration for X applications via XRender extension and is clearly doing this faster than software rendering.

It may really surprise some people, but graphics drivers are generally doing not so great for 2D acceleration on linux desktop systems. This has been known at least since 2003, when Carsten Haitzler (aka Rasterman) started a thread about XRender performance and posted render_bench test program. Also hardware acceleration did not have a clear advantage over software rendering two years ago for many cairo traces (which are much more relevant for 2D benchmarking than render_bench). There are some old slides from 2010 presented by Intel folks about "Making the GPU do its Job" explaining the challenges they were facing at that time. But now this long quest seems to be over and we got really good 2D drivers at least for Intel hardware.

But enough with the historical overview. The purpose of this blog post is to look into cairo "image backend" in a bit more details and try to find an explanation why it managed to be competitive for such a long time (and is still able to wipe the floor with some poorly implemented GPU accelerated drivers even now). Cairo image backend uses pixman library as a software rasteriser. To speed up the graphics operations, pixman uses SIMD optimizations. The most relevant are SSE2 on x86 and NEON on ARM. There are also optimizations for MIPS32 DSP ASE, Loongson SIMD and ARM IWMMXT being worked on. The latest pixman 0.25.2 development snapshot allows to selectively disable SIMD optimizations without recompiling the library, which is convenient for benchmarking or testing. I'm going to run cairo-perf-trace benchmark on a few devices I have at home, testing image backend both with and without SIMD optimizations enabled. This allows to to see how much of the performance is gained by using "SIMD acceleration" in pixman and benchmark it against "GPU acceleration" in the xorg drivers.

Test setup

32bpp desktop color depth is used in all tests. Cairo 1.12.0 and pixman 0.25.2 are compiled with gcc 4.7.0 with "-O2" optimizations and "-march/-mcpu/-mtune" options set to match the target processor. The standard set of cairo benchmark traces is used, but "ocitysmap" trace is removed (it is a memory hog and runs out of memory on 512MB systems without swap). The detailed instructions are available in the last section of this blog post.

ARM Cortex-A9 1.2GHz (Origenboard)

On the chart above everything is compared to cairo image backend when SIMD optimizations are disabled in pixman (PIXMAN_DISABLE environment variable is set to "arm-simd arm-iwmmxt arm-neon"). The green bars on the left show the performance improvement gained by enabling ARM NEON in pixman when running the tests with cairo image backend. The blue bars on the right show the performance of xlib cairo backend when the rendering is done on the X server side by xf86-video-fbdev driver (which in turn uses pixman with NEON optimization enabled).

Looking at these colored bars, we can see that xlib backend is generally performing worse than image backend. It is understandable, because we have some inter-process communication overhead between the test application and X server, X11 protocol marshalling, etc. But a few tests (firefox-asteroids, gnome-terminal-vim, gvim, xfce4-terminal-a1) showed an improvement. The explanation here is that this system has a dual-core processor. So the X server running on one CPU core is acting as a 2D accelerator, and the test application has another CPU core free for use. If we look at the CPU usage in htop while running the tests, then we see that the CPU core running Xorg server is ~100% loaded, and the other CPU core running cairo-perf-trace process is typically just ~15-30% loaded.

So in the end, xlib backend is not so bad on multi-core systems. We just need to ensure that we are not hit by any unnecessary overhead on the inter-process communication. Are we actually doing well here? Not even close! Just look at this part of code. There we see how X server is wrapping its internal Picture structures into temporary pixman_image_t structures, involving lots of overhead, validity checks and malloc/free activity. No surprise that we are taking a serious performance hit, firefox-canvas trace being the worst.

The colored bars on the performance chart above surely look nice, but the system needs to be snappy and responsive on normal use. Believe me or not, it is quite ok. For example, I can use text editors in the terminal and move windows around without perceivable lags. But what about the ARM system with similar specs, also used with the xf86-video-fbdev driver and reviewed in a Phoronix article? Don't know, but looks like somebody has just screwed up something. When we are moving windows around, it's just memcpy/memmove alike operation. Origenboard can reach ~700-750 MB/s speed for memcpy, OMAP4460 should be quite similar. Even with FullHD resolution and 32bpp desktop color depth (16bpp is more common on ARM systems), we are moving around up to 1920 * 1080 * 4 = ~8.3 MB of pixel data. Dividing memcpy speed by data size, we get ~80-90 FPS. Even if we assume that shadow framebuffer is getting in the way and further divide the FPS number by 2, that's still more than enough not to experience any problems on moving or scrolling windows. Sure, this is fully occupying one CPU core for something as dumb as just memory copy, but another CPU core is free and the whole system is not affected that badly.

Finally what about GPU acceleration? This board uses Exynos4210 SoC, which has Mali-400 MP4 GPU. Right now I'm waiting for limadriver or FIMG2D based DDX. There are proprietary drivers for Mali GPU, but I don't want to taint this system with proprietary blobs yet, and also don't want to taint myself by agreeing to any licenses accompanying them.

ARM Cortex-A8 1GHz, GPU SGX530 200MHz (IGEPv2 board)

The same tests as for Cortex-A9, but also adding the results for 2D graphics hardware acceleration provided by the latest 2012 1Q SGX driver release. First of all, not all tests are even able to run with sgx pvr xorg driver. Looks like it has a limit of just around ~60MB for the total pixmap data allocated on X server side and this prevents many cairo traces from running:

X Error of failed request:  BadAlloc (insufficient resources for operation)
  Major opcode of failed request:  53 (X_CreatePixmap)

I tried to increase this limit by using an undocumented option "PixmapPoolSizeMB" in xorg.conf, but that did not help much and caused some additional stability issues. In the end I decided not to touch this stuff and run it as-is in default configuration (only upgrading pixman from ancient version 0.18.4 to 0.25.2). Hence the pvr driver only has results for 8 out of 21 tests on the chart below due to the restricted pixmap pool size.

Ouch! The performance results do not look good for the pvr driver. It was never able to get any close to the fbdev driver, let alone to the client side rendering via cairo image backend. And this time fbdev driver was always slower than image backend, which is not surprising because there is only one ARM Cortex-A8 core in this device.

But let's forget about the traces of real applications for a moment. Is the pvr driver even able to accelerate anything? Now we can take a look at synthetic benchmarks like render_bench (with a bugfix applied), which stresses simple scaled and non-scaled compositing using Over operator. In other words, that's one of the most basic operations for 2D graphics (commonly used for translucency effects), which is expected to be properly accelerated by any driver. Test results for the fbdev driver and for the pvr driver (with and without "NoAccel" option set in xorg.conf) are listed in the table below (render_bench logs are here). Each test was also repeated with and without NEON SIMD optimizations enabled in pixman. And an interesting bonus comparison is imlib2 vs. pixman C implementation (CFLAGS="-O2 -mcpu=cortex-a8 -mfloat-abi=softfp -mfpu=neon" for both pixman and imlib2):

	pixman 0.25.2 with NEON			pixman 0.25.2 without NEON			imlib2 1.4.4
	fbdev	pvr (NoAccel)	pvr	fbdev	pvr (NoAccel)	pvr	built with gcc 4.5.3	built with gcc 4.7.0
Xrender doing non-scaled Over blends	0.56 sec	0.76 sec	1.33 sec	1.58 sec	3.86 sec	1.33 sec	-	-
Xrender (offscreen) doing non-scaled Over blends	0.44 sec	0.44 sec	1.23 sec	1.40 sec	1.41 sec	1.23 sec	1.16 sec	1.21 sec
Xrender doing 1/2 scaled Over blends	0.42 sec	0.40 sec	0.42 sec	0.55 sec	1.02 sec	1.07 sec	-	-
Xrender (offscreen) doing 1/2 scaled Over blends	0.27 sec	0.27 sec	0.32 sec	0.42 sec	0.43 sec	0.48 sec	0.40 sec	0.42 sec
Xrender doing 2* smooth scaled Over blends	3.65 sec	8.74 sec	8.76 sec	25.45 sec	50.63 sec	50.69 sec	-	-
Xrender (offscreen) doing 2* smooth scaled Over blends	3.44 sec	3.45 sec	3.62 sec	25.02 sec	25.04 sec	25.25 sec	14.21 sec	12.92 sec
Xrender doing 2* nearest scaled Over blends	2.26 sec	3.68 sec	3.72 sec	4.27 sec	14.00 sec	14.04 sec	-	-
Xrender (offscreen) doing 2* nearest scaled Over blends	2.01 sec	2.04 sec	2.24 sec	4.01 sec	4.02 sec	4.15 sec	5.26 sec	5.65 sec
Xrender doing general nearest scaled Over blends	5.57 sec	7.68 sec	7.72 sec	6.18 sec	19.92 sec	19.96 sec	-	-
Xrender (offscreen) doing general nearest scaled Over blends	5.23 sec	5.37 sec	5.60 sec	5.96 sec	5.97 sec	6.04 sec	8.90 sec	9.59 sec
Xrender doing general smooth scaled Over blends	8.66 sec	18.40 sec	18.42 sec	55.98 sec	111.73 sec	111.78 sec	-	-
Xrender (offscreen) doing general smooth scaled Over blends	8.44 sec	8.44 sec	8.58 sec	55.18 sec	55.31 sec	55.50 sec	57.04 sec	43.71 sec

The best results in the table above are highlighted with green, the worst results are highlighted with red. Only non-scaled tests showed the signs of hardware acceleration (low CPU load, same performance regardless of whether NEON is enabled in pixman or not), they are highlighted with blue. All the "non-blue" pvr driver tests are using fallbacks to pixman for software rendering. The other observations:

fbdev is the fastest driver, showing equal or significantly better performance than the pvr driver
disabling acceleration in the pvr driver is not enough to get really well performing software rendering (and this may be also true for many other xorg drivers)
non-offscreen rendering is particularly slow for the pvr driver, especially when NEON is disabled. It suggests that fallbacks to pixman for software rendering may be working with non-cached memory buffers in this case.
pixman without NEON and imlib2 have similar performance ("2* smooth scaling" stands out, but it probably has its own special optimized path in imlib2), NEON is significantly faster

Now let's have a closer look at the non-scaled test and do some profiling for it. In the original render_bench test, a 100x100 image is blended over 320x320 window. It means that the size of the working set is just 450KB, which is a bit too small for today's standards. ARM Cortex-A8 has 256KiB of L2 cache, and L2 cache is apparently providing a performance boost for fbdev driver here (0.56 sec vs. 1.33 sec, which is more than twice better than GPU). In order to make the test more fair and make CPU cache less useful, let's increase the window size to 1000x1000, increase the number of repetitions and run only "Xrender doing non-scaled Over blends" test first for the fbdev driver and then for the pvr:

=== fbdev driver (Time: 32.588 sec.) ===

samples|      %|
------------------
   148407 94.2817 Xorg
              TIMER:0|
      samples|      %|
    ------------------
       112860 76.0476 libpixman-1.so.0.25.2
        13326  8.9794 libshadow.so
        12679  8.5434 Xorg
         6072  4.0915 libc-2.13.so
         1976  1.3315 libfb.so
          787  0.5303 vmlinux
          369  0.2486 [vectors] (tgid:1719 range:0xffff0000-0xffff1000)
          217  0.1462 ld-2.13.so
           87  0.0586 fbdev_drv.so
           13  0.0088 libglx.so
           12  0.0081 libXfont.so.1.4.1
     4044  2.5691 vmlinux

=== pvr driver (Time: 41.911 sec.) ===

samples|      %|
------------------
   137617 76.5964 vmlinux
    35739 19.8920 Xorg
              TIMER:0|
      samples|      %|
    ------------------
         7455 20.8596 libsrv_um.so.1.7.783851
         6776 18.9597 Xorg
         4889 13.6797 vmlinux
         4699 13.1481 pvrsrvkm
         3474  9.7205 libc-2.13.so
         2857  7.9941 libpixman-1.so.0.25.2
         2224  6.2229 libexa.so
         1554  4.3482 pvr_drv.so
          743  2.0790 drm
          528  1.4774 libfb.so
          334  0.9346 libdrm.so.2.4.0
          126  0.3526 [vectors] (tgid:1690 range:0xffff0000-0xffff1000)
           38  0.1063 libpvr2d.so.1.7.783851
           14  0.0392 libglx.so

Based on the profiling results above we see that:

Now that CPU cache is not helping much when working with large buffers, the performance difference between CPU and GPU has reduced significantly. CPU is still somewhat faster.
There is "shadow framebuffer" impacting software rendering performance when drawing on screen, but I'll write more about it next time.
Average CPU load is only ~20% when GPU acceleration is used, and the total amount of CPU time spent in Xorg process needed for completing the test is ~4x less (148407 oprofile samples vs. 35739) in the case of GPU acceleration.

So we can clearly say that hardware acceleration is indeed used in the pvr driver. It just needs to be improved really a lot before it can provide any practical benefits and successfully pass the trial by cairo traces.

At the risk of boring the readers even more, I'll provide some more data with the regards to how CPU caches affect performance. Pixman library includes a simple crude test program called lowlevel-blt-bench in the "test" directory. It can approximately estimate the performance of various 2D graphics operations depending on the size of the working set (L1 - data fits L1 cache, L2 - data fits L2 cache, M - data does not fit any cache). I have already mentioned it in my older blog post earlier, but probably it will not do much harm repeating a bit. For this particular IGEPv2 board (Cortex-A8 processor running at 1GHz), I can measure the following performance numbers (in MPix/s) with lowlevel-blt-bench:

add_8888_8888 =  L1: 487.07  L2: 441.24  M: 76.53
    over_8888_8888 =  L1: 342.18  L2: 294.20  M: 75.50

Both "Add" and "Over" operators have exactly the same memory access pattern per each pixel: read the source pixel (4 bytes), read the destination pixel (4 bytes), do some calculations and write back the result to the destination (4 bytes). Processing one pixel involves reading 8 bytes and writing 4 bytes, or 12 bytes total. The expected memory performance is a bit difficult to predict, because the bandwidth for memory reads and writes is not equal (memory writes are faster). This device can do ~500-550 MB/s memcpy (1000-1100 MB/s for total read+write bandwidth) and ~1500-1550 MB/s memset. Operators "Add" and "Over" stress memory reads a bit more than writes, so the total cumulative achievable memory bandwidth is slightly worse than the one for memcpy: ~76 MPix/s * 12 * 4 ~= ~900 MB/s. But what matters the most, this synthetic benchmark is also showing that the CPU could easily crunch at least 4x more pixels if the memory subsystem could provide the CPU with the needed data in time! If we are in a situation when the data is not available in CPU L1/L2 caches, then the CPU is working at just 1/4 of its capabilities and idling the rest of the time. I wish we had SMT (or hyperthreading as called by Intel) supported in ARM processors. In this case the other hardware thread would be able to do a lot of work in parallel. Did I say something about a dedicated CPU core being able to act as a 2D accelerator in the previous Cortex-A9 section? Forget that. Even just an extra hardware thread might be enough (if we are doing some simple non-scaled 2D stuff like drawing rectangular windows, using alpha blending for translucency effects and moving them around).

As it turns out, CPU is much faster than memory for simple non-scaled 2D graphics (this includes YUV->RGB conversion, alpha blending, simple copy, fill, ...). Caches are helping really a lot, but they are relatively small and work best when we have good locality for memory accesses. Cairo library is an immediate mode renderer, which is easy to use, but also gives the users the freedom to shoot themselves in the foot. For example, if the user wants to composite many translucent screen sized layers (bigger than L2 cache) on top of each other, then they will be rendered exactly this way, going through slow memory interface for each of these layers over and over again. An obvious optimization is to split the picture into a number of tiles, each small enough to fit L2 or even L1 cache, and then do the blending of all the layers within each tile. This is effective, but requires some effort from the user.

What is the solution? A modern approach is to simply take away the freedom from the users (so that they don't hurt themselves) and enforce a certain performance friendly rendering model. Some people think that scene graph is the silver bullet.

But I have strayed from the original topic already. The pvr driver is what we have for 2D hardware accelerated linux desktop on OMAP3 devices, but it is more like a technical demo and hardly suitable for any practical use. On a positive side, the work is ongoing and xf86-video-omap may eventually become a better 2D driver for this hardware. OMAP4470 is even more promising, as it is going to have a real 2d blitter hardware with the open source drivers for it.

The current 2D driver may be disappointing, but we should not forget that SGX530 is primarily a 3D accelerator with mature and well optimized drivers for OpenGL ES 2.0 (the demos and examples run fine). Also it is worth mentioning that cairo has OpenGL ES 2.0 backend, but it can't be used on SGX530 yet because of missing GL_OES_texture_npot extension support.

Intel Atom N450 1.67GHz (Samsung N220 netbook)

And for the sake of completeness, here are the results from Intel Atom. They just confirm the results from Chris Wilson and only additionally show the effect of having SSE2 optimizations for the software rendering.

We can also run lowlevel-blt-bench from pixman for the same "Add" and "Over" operations:

add_8888_8888 =  L1: 607.08  L2: 375.34  M:259.53
    over_8888_x888 =  L1: 123.73  L2: 117.10  M:113.56

Now the memory bandwidth is only fully utilized for "Add" operator, but not for "Over". Using a modified variant of render_bench which calculates and reports MPix/s statistics, we can put MPix/s rate for different operations in the following table:

Compositing operation	performance on Intel Atom N450
pixman non-scaled Add	~260 MPix/s
pixman non-scaled Over	~110 MPix/s
GPU accelerated non-scaled Add	~270 MPix/s
GPU accelerated non-scaled Over	~270 MPix/s
GPU accelerated nearest scaled Over	~260 MPix/s
GPU accelerated bilinear scaled Over	~260 MPix/s

All the operations performed on GPU and also software rendered Add run at approximately the same speed, software rendered Over falls behind. It is integrated graphics, both CPU and GPU are using the same memory, so it is not surprising that they both have the same memory performance limit. GPU strength is in handling operations which need more heavy computations. And it is able to fully utilize memory bandwidth regardless of the use of scaling. This is how a really good hardware accelerated driver should behave.

Reproducing these test results and charts

People are generally lazy (me included), so precise step by step instructions may save time and/or encourage somebody to actually try reproducing the tests on his system. First we can try:

$ wget http://cairographics.org/releases/cairo-1.12.0.tar.gz
$ tar -xzf cairo-1.12.0.tar.gz
$ cd cairo-1.12.0
$ ./configure
$ make
$ cd perf
$ make cairo-perf-chart

This will get us "cairo-perf-chart" tool, which can be used to generate nice PNG charts from cairo-perf-trace logs. The cairo-perf-trace logs used for the charts in this blog post are available here.

Compiling cairo library and running the benchmarks can be done in the following way. Obviously, the system needs to have a compiler and some of the build dependencies installed (watch for the error messages from configure scripts). Crosscompilation is also easy, but I have intentionally left it out in order not to add extra confusion.

# set cairo/pixman version and compilation options

export CAIRO_VERSION=1.12.0
export PIXMAN_VERSION=0.25.2
export CFLAGS="-O2 -g"
export CC=gcc
export CAIRO_TEST_TARGET=image

# setup build environment

export PREFIX=`pwd`/tmp
mkdir $PREFIX
export LD_LIBRARY_PATH=$PREFIX/cairo/lib:$PREFIX/pixman/lib
export PKG_CONFIG_PATH=$PREFIX/cairo/lib/pkgconfig:$PREFIX/pixman/lib/pkgconfig

# download and unpack cairo/pixman sources

wget http://cairographics.org/snapshots/pixman-${PIXMAN_VERSION}.tar.gz
wget http://cairographics.org/releases/cairo-${CAIRO_VERSION}.tar.gz
tar -xzf pixman-${PIXMAN_VERSION}.tar.gz
tar -xzf cairo-${CAIRO_VERSION}.tar.gz

# build pixman and cairo

pushd pixman-$PIXMAN_VERSION
./configure --prefix=$PREFIX/pixman && make && make install || exit 1
popd

pushd cairo-$CAIRO_VERSION
./configure --prefix=$PREFIX/cairo && make && make install || exit 1
popd

# download and bind cairo traces (warning: this is a HUGE git repository)

git clone git://anongit.freedesktop.org/cairo-traces
pushd cairo-traces
make
popd

# run cairo-perf-trace benchmarks

cairo-$CAIRO_VERSION/perf/cairo-perf-trace -i3 -r cairo-traces/benchmark > results.txt

This gives us "results.txt" file in raw format, which can be used as an input for cairo-perf-chart tool. If -r option is not used, then the output of cairo-perf-trace is in a more human readable text format. CAIRO_TEST_TARGET environment variable can be set to "image", "xlib" or any other supported backend.

Final words

Your mileage may vary, but a lot of simple and very common 2D operations do not need a lot of processing power (even one CPU core is excessive). On the other hand, memory bandwidth is critical and directly affects performance.
On multi-core systems, software rendering in X server may play the role of a 2D accelerator to some extent
Good quality scaling, rotation, radial gradients, convolution filters and the other processing power hungry operations benefit from GPU acceleration. CPU may obviously also try multithreaded rendering for these operations to take advantage of all CPU cores, but multithreaded rendering is still not supported in pixman.
The pvr xorg driver is not ready for OMAP3 hardware yet, do not use it
Disabled acceleration does not always mean full speed software rendering, so if your driver provides an option to disable acceleration, it can't be fully trusted
Immediate mode renderers such as cairo are a hard challenge for hardware acceleration

Is your ARM Cortex-A9 hot enough?

2012-04-10T00:00:00+00:00

Inspired by the google+ post by Koen Kooi, I decided to check whether NEON is also hot in Cortex-A9. Appears that cpuburn tool supports ARM since 2010. And openembedded uses an alternative cpuburn-neon implementation. As we have at least two implementations, naturally one of them might be more efficient on Cortex-A9 than the other. So I tested both of them on my old OMAP4430 based pandaboard (I would not miss this board too much if it actually burns). The results of this comparison are provided in the table at the bottom.

I could have stopped at this point, but that would be not fun :) So I tried to experiment a bit with Cortex-A9 power consumption myself. Turns out that Cortex-A9 can actually run a bit hotter. On the NEON side, VLDx instructions seem to be more power hungry than anything else by a large margin. And aligned 128-bit reads are the best at generating heat. Using VLD2 variant with post-increment makes it do a bit more work than the plain VLD1. Moving to the ARM side, conditional branches and SMLAL instructions are also rather hot. Mixing everything together, we get one more implementation of cpuburn for Cortex-A9:

.syntax unified
    .text
    .arch armv7-a
    .fpu neon
    .arm

    .global main
    .global sysconf
    .global fork

/* optimal value for LOOP_UNROLL_FACTOR seems to be BTB size dependent */
#define LOOP_UNROLL_FACTOR   110
/* 64 seems to be a good choice */
#define STEP                 64

.func main
main:

#ifdef __linux__
        mov         r0, 84 /* _SC_NPROCESSORS_ONLN */
        blx         sysconf
        mov         r4, r0
        cmp         r4, #2
        blt         1f
        blx         fork /* have at least 2 cores */
        cmp         r4, #4
        blt         1f
        blx         fork /* have at least 4 cores */
1:
#endif

        ldr         lr, =(STEP * 4 + 15)
        subs        lr, sp, lr
        bic         lr, lr, #15
        mov         ip, #STEP
        mov         r0, #0
        mov         r1, #0
        mov         r2, #0
        mov         r3, #0
        ldr         r4, =0xFFFFFFFF
        b           0f
    .ltorg
0:
    .rept LOOP_UNROLL_FACTOR
        vld2.8      {q0}, [lr, :128], ip
        it          ne
        smlalne     r0, r1, lr, r4
        bne         1f
1:
        vld2.8      {q1}, [lr, :128], ip
        it          ne
        smlalne     r2, r3, lr, r4
        bne         1f
1:
        vld2.8      {q2}, [lr, :128], ip
        vld2.8      {q3}, [lr, :128], ip
        it          ne
        subsne      lr, lr, #(STEP * 4)
    .endr
        bne         0b
.endfunc

Maybe more improvements are still possible if I overlooked some better instructions, tricks with L2->L1 prefetches or anything else. Also I have not tried running any tests on Cortex-A8 yet. But Cortex-A8 needs different tuning and I would not be surprised if the the older cpuburn implementations can actually do a better job there. Finally, the obligatory warning: This program tries to stress the processor, attempting to generate as much heat as possible. Improperly cooled or otherwise flawed hardware may potentially overheat and fail. Use at your own risk!

As for the table below, each implementation has been tested with both Cortex-A9 cores fully loaded (starting two instances of cpuburn if needed). Current draw values were measured after running the test non-interrupted for 10-15 minutes. Honestly, the total ~1640 mA sustained current draw by pandaboard looks quite scary to me. At least I would not dare to even try additionally stressing GPU and/or the hardware video decoder at the same time.

cpuburn implementation, running on both A9 cores	current draw from 5V PSU (whole board, not just CPU)
idle system (this kernel has no power management)	~550 mA
cpuburn-neon	~1130 mA
cpuburn-1.4a (burnCortexA9.s)	~1180 mA
ssvb-cpuburn-a9.S	~1640 mA

And also a cpuburn tweak for ARM Cortex-A8 (added on 2011-04-11)

A quick test on Cortex-A8 shows that using SMLAL is a bad idea there, but extra NEON arithmetic instructions can be added because Cortex-A8 supports dual issue for NEON.

This time experimenting with DM3730 based IGEPv2 board (ARM Cortex-A8 @1GHz) and using dm3730-temp-sensor for temperature measurements:

cpuburn implementation	temperature
idle system (this kernel has no power management)	~57.75 C
cpuburn-neon	~92.75 C
cpuburn-1.4a (burnCortexA8.s)	~96.00 C
ssvb-cpuburn-a8.S	~104.25 C

~~If the sensor is not lying, then maybe using a plastic case for this board was not a good choice after all.~~ The sensor is most likely lying as explained by Nishanth Menon in the google+ comments.

Final words (added on 2011-04-11)

Before anybody jumps to wild conclusions, I would like to note that:

Pandaboard is not a mobile device and it is not designed for really low power consumption. It is a known fact that it requires a PSU rated at 4A. I don't have any idea where most of the heat is dissipated, but it is quite likely that not only OMAP chip is involved.
Cpuburn is very different from any typical workload and can't be used for estimating power consumption. It's just a hardware reliability testing tool

Origenboard, memory performance

2011-09-13T00:00:00+00:00

Those who have read my old Origenboard, early adopter impressions blog post may wonder why I bought this board in the first place. As far as I know, there is no freely available public documentation for Exynos 4210 SoC so the "if you want something done, do it yourself" approach does not work well, and the support provided at origenboard.org has not been very stellar so far. OMAP4 based pandaboard is a lot more open source friendly, has great community around it and would have been a no-brainer choice, right? Well, pandaboard is a great piece of of hardware, but the early boards based on early OMAP4 revisions used to have a rather poor memory performance. According to the information from the pandaboard mailing list, OMAP4460 is expected to address these problems. Too bad that there are no OMAP4460 powered pandaboards available for sale yet. And that's why I decided to check the new alternative solution from Samsung to see what they can offer.

But who cares about memory performance?

Just any software which works with large sets of data not fitting L1/L2 caches benefits from fast memory. I'm particularly interested in having fast software rendered 2D graphics, and this is exactly the case where fast memory is critical for getting good performance.

Just to give an example, let's take some numbers from my older post in the pixman mailing list:

== Intel Atom N450 @1667MHz, DDR2-667 (64-bit) ==

           add_8888_8888 =  L1: 607.08  L2: 375.34  M:259.53
          over_8888_x888 =  L1: 123.73  L2: 117.10  M:113.56
          over_8888_0565 =  L1: 106.11  L2:  98.91  M: 99.07

== TI OMAP3430/3530, ARM Cortex-A8 @500MHz, LPDDR @166MHz (32-bit) ==

    default build:
           add_8888_8888 =  L1: 227.26  L2:  84.71  M: 44.54
          over_8888_x888 =  L1: 161.06  L2:  88.20  M: 44.86
          over_8888_0565 =  L1: 127.02  L2:  93.99  M: 61.25

    software prefetch disabled (*):
           add_8888_8888 =  L1: 351.44  L2:  97.29  M: 25.35
          over_8888_x888 =  L1: 168.72  L2:  95.04  M: 24.81
          over_8888_0565 =  L1: 128.06  L2:  98.96  M: 32.16

All the numbers are provided by lowlevel-blt-bench test program from pixman and are measured in MPix/s. There are three cases benchmarked for each 2D graphics operation: L1 (data set which fits L1 cache), L2 (data set which fits L2 cache) and M (data set does not fit caches and has to work with memory). It becomes very clear that ARM NEON optimized code had been memory bandwidth limited at least on early OMAP3 devices. And Intel Atom surely had a much better memory bandwidth: ~260 MPix/s * 4 bytes per pixel * (2 reads and 1 write per pixel for add_8888_8888), which is ~3.1 GB/s total. These are just some microbenchmark numbers, but actual 2D software rendered graphics performance is also heavily affected by memory speed. And fast memory is important for having responsive and fast linux desktop even without GPU acceleration. And as far as I know, there are still no open source GPU drivers available for mobile devices.

Introducing yet another memory benchmark program

If we want to know whether the memory is fast in our system, we need to benchmark it somehow. There is a popular STREAM benchmark, but its results are apparently very much compiler dependent when run on ARM. Moreover, it uses floating point, making this benchmark unsuitable for the devices which don't have FPU (it would test just anything but not memory bandwidth).

So I tried to make my own memory benchmark program, which tries to measure the peak bandwidth of sequential memory accesses and the latency of random memory accesses. Bandwidth is measured by running different assembly code for the aligned memory blocks and attempting different prefetch strategies. Also this benchmark program integrates some of my old ARM and MIPS32 memory bandwidth test code.

There are some potential pitfalls when implementing benchmarks. A popular mistake is related to forgetting to initialize the buffers and have the results distorted by COW. But copying data from one memory buffer to another is also not so simple. Depending on the relative alignment of the source and destination buffers, the performance may vary a lot. It was noticed by Måns Rullgård (mru) in the #pandaboard irc almost a year ago. And the effect of offset between the arrays is also mentioned in STREAM benchmark FAQ. Moreover, physical memory fragmentation also plays some role because the caches in modern processors are physically tagged. So exactly the same program may provide different results depending on whether it was run on a freshly rebooted system (with almost no memory fragmentation), or on the system which has been running for a while. Overall, this looks like some kind of aliasing in the memory subsystem. And ironically, the performance on a freshly rebooted system is typically worse.

An empirical solution is to try to ensure that the addresses of memory accesses in the source and destination buffer, happening close to each other, differ in as many bits as possible. So I'm using 0xAAAAAAAA, 0x55555555, 0xCCCCCCCC and 0x33333333 patterns for the lowest bits in the buffer addresses. And this seems to be quite effective, memory copy benchmark results are now well reproducible and showing high numbers.

The initial release of this benchmark program can be downloaded here: ssvb-membench-0.1.tar.gz
And the git repository is at http://github.com/ssvb/ssvb-membench

Origenboard memory benchmark results and performance tuning

The table below shows how the memory performance is affected by different settings in
L2C-310 Level 2 Cache Controller, Prefetch Control Register

Prefetch Control Register settings	Memory copy performance	Latency of random accesses in 64 MiB block
0x30000007 (linaro kernel default)	761.86 MB/s	167.9 ns
0x30000007 + "Double linefill enable"	1179.17 MB/s	183.9 ns
0x30000007 + "Double linefill enable" + "Double linefill on WRAP read disable"	1174.32 MB/s	174.0 ns

Setting "Double linefill on WRAP read disable" regains some of the random access latency with no regressions to sequential copy performance. Assuming that there are no hardware bugs related to this setup, enabling double linefill is a no-brainer. I have submitted a patch to linaro-dev mailing list (update from 2011-09-19: according to the provided feedback, appears that double linefill is not used for a good reason).

Probably some more memory performance tweaks can be still applied and a better configuration can be found by trying different permutations of the bits in:

And finally STREAM benchmark as a bonus

Origenboard, Samsung Exynos 4210, dual ARM Cortex-A9 @1.2GHz

$ gcc -O2 -fopenmp -mcpu=cortex-a9 -o stream stream.c
$ ./stream
-------------------------------------------------------------
Function      Rate (MB/s)   Avg time     Min time     Max time
Copy:        2284.9071       0.0281       0.0280       0.0282
Scale:       2339.6942       0.0274       0.0274       0.0275
Add:         2028.8679       0.0474       0.0473       0.0474
Triad:       1992.7801       0.0482       0.0482       0.0483
-------------------------------------------------------------

Intel Atom N450 @1.67GHz

$ gcc -O2 -fopenmp -march=atom -mtune=atom -o stream stream.c
$ ./stream
-------------------------------------------------------------
Function      Rate (MB/s)   Avg time     Min time     Max time
Copy:        2236.8130       0.0143       0.0143       0.0144
Scale:       2230.3084       0.0144       0.0143       0.0144
Add:         2656.0587       0.0181       0.0181       0.0182
Triad:       2679.3174       0.0180       0.0179       0.0180
-------------------------------------------------------------

Overall, the memory performance of Origenboard appears to be not very much inferior to the memory performance of Intel Atom N450 (update from 2011-09-19: when/if we get Exynos 4212 based boards in our hands).

Yet another oprofile tutorial

2011-08-23T00:00:00+00:00

Recently it came as a surprise to me that many people don't know how to use oprofile efficiently when working on performance optimizations. I'm not going to duplicate the oprofile manual here in details, but at least will try to explain some basic usage.

A bit of theory

Oprofile does its magic by using statistical sampling. The processor gets interrupted at regular intervals (the interrupts happen after a certain amount of time has elapsed, or some hardware performance counter accumulated a certain amount of events) and oprofile driver identifies which code had control at that moment. The part of code which was 'lucky' to be interrupted by oprofile, gets an oprofile sample attributed to it. The parts of code which take a lot of execution time are naturally more likely to accumulate many oprofile samples. In fact, the amount of collected oprofile samples for some function tends to be directly proportional to the execution time taken by this function. This all is somewhat similar to Monte Carlo method.

The collection of samples done by oprofile for each individual function is a Poisson process. Standard deviation for Poisson distribution is the square root of the number of samples. So the more samples got collected, the lower is the relative error. The following diagram shows the confidence intervals for normal distribution (because Poisson distribution is approximately normal for the large number of samples):

Using the 3-sigma rule, we can be fairly confident that the actual time spent in each function (measured in oprofile samples) is within 3*sqrt(N) interval for each function. Where N is the number of samples reported by oprofile for that function.

A simple profiling and code optimization workflow

Let's suppose that we have some small command line tool, which can be run to do something useful. And we want to optimize this tool to spend less time to do the same work. First of all, it makes sense to identify the parts of the program, which are the performance bottlenecks and can be optimized. This can be easily done using oprofile:

# opcontrol --deinit
# opcontrol --separate=kernel
# opcontrol --init
# opcontrol --reset
# opcontrol --start
# ./test-program
# opcontrol --stop
# opreport -l ./test-program

Going through all of the above steps will configure and start oprofile, then execute the program to be profiled (./test-program), then finally stop oprofile and show the profiling report (and this report contains exactly the information we want, and its interpretation is explained a bit in the next section). The opcontrol tool needs to be run as root or via sudo. It is also quite important to use --separate=kernel option. This option is described in details here, but basically it ensures that all the CPU activity happening in the kernel and in the shared libraries is also attributed to the test program and shown in the log.

After having oprofile report, it is only a matter of checking what parts of code are reported to take a lot of time, improving them and finally running oprofile again to verify the results. This process can be repeated multiple times. That's quite simple. Though there are two main cases when it may be difficult to interpret oprofile logs:

Oprofile reports that just one large function (possibly even 'main') is taking most of the time.
Oprofile reports a million of tiny functions, each taking only a small fraction of time.

In the former case it is a good idea to split the large function into a few smaller ones. If the large function is already calling some other functions which aree inlined, then naturally disabiling inlining will provide a bit more interesting profiling report. Another alternative is to use source annotation. But be sure to read about all the caveats in the oprofile manual. In the latter case, generating a callgraph may provide some insights. Some nice callgraph pictures can be generated by Gprof2Dot from the data collected by oprofile.

A real practical example

I'm going to use one of my old performance patches as an example. Oprofile report for the 'sbcenc' program looked like this before optimization:

samples  %        image name               symbol name
26083    25.0856  sbcenc                   sbc_pack_frame
21548    20.7240  sbcenc                   sbc_calc_scalefactors_j
19910    19.1486  sbcenc                   sbc_analyze_4b_8s_neon
14377    13.8272  sbcenc                   sbc_calculate_bits
9990      9.6080  sbcenc                   sbc_enc_process_input_8s_be
8667      8.3356  no-vmlinux               /no-vmlinux
2263      2.1765  sbcenc                   sbc_encode
696       0.6694  libc-2.10.1.so           memcpy

Because of the use of --separate=kernel option, we can see ~8% of cpu time attributed to no-vmlinux image, which is the time spent in the kernel mostly doing input/output activity (reading the input file from disk). Also less than 1% is spent in memcpy function which belongs to libc-2.10.1.so shared library. Without --separate=kernel option, this information would not be present in the log.

Now our focus is on sbc_calc_scalefactors_j function, which got 21548 oprofile samples collected, and they represent ~20.7% of time spent in 'sbcenc' process. Please note again, that this percentage would not be a realistic estimate without also having kernel and libc information in the picture. In the case if the CPU consumption is dominated by the library functions or by the kernel, the statistics could be severely skewed.

After performing the optimizations, we get a new profiling report:

samples  %        image name               symbol name
26234    29.9625  sbcenc                   sbc_pack_frame
20057    22.9076  sbcenc                   sbc_analyze_4b_8s_neon
14306    16.3393  sbcenc                   sbc_calculate_bits
9866     11.2682  sbcenc                   sbc_enc_process_input_8s_be
8506      9.7149  no-vmlinux               /no-vmlinux
5219      5.9608  sbcenc                   sbc_calc_scalefactors_j_neon
2280      2.6040  sbcenc                   sbc_encode
661       0.7549  libc-2.10.1.so           memcpy

Which shows that now we have sbc_calc_scalefactors_j_neon function taking 5219 samples instead of 21548 samples for sbc_calc_scalefactors_j earlier. It is approximately ~4.1x speedup for this particular function. Samples are more important than percents in the log, because the absolute number of samples represents the actual time spent in the function, and the percents are relative to the whole process (as the whole program takes less time to execute after optimization, the percents may naturally drift).

For another example we can look at the sbc_pack_frame function statistics in both logs. The number of samples remained about the same: 26083 vs. 26234 (see the 3-sigma rule from the 'A bit of theory' section). But the percentage of the time relative to the whole program increased from ~25% to ~30% even though this function has not changed itself. That's a nice side affect of optimizations: after eliminating the obvious bottlenecks, the other functions are becoming more attractive optimization targets too :)

The precision of measurements can be always increased by running the test program more than one time between 'opcontrol --start' and 'opcontrol --stop' invocations, because more samples will get accumulated and the relative error will become smaller.

Still the other methods of benchmarking the code may be more suitable for very tiny performance tweaks, such as just saving maybe a few CPU cycles. Some tricks for benchmarking small sequences of instructions are described in my older Discovering instructions scheduling secrets blog post.

ARM Cortex-A8 performance monitoring unit disaster

If you tried to follow the instructions described above, but got bizarre results, then the chances are quite high that you are using some hardware with ARM Cortex-A8 processor. The problem is that ARM Cortex-A8 has a broken performance monitoring unit (this is described as erratum #628216 in ARM Cortex-A8 errata list). Earlier revisions were badly broken. The last revisions are a bit better, but still not suitable for use with oprofile.

For collecting samples, oprofile relies on the interrupts generated by the performance monitoring unit. The interrupts are supposed to happen on overflows of the 32-bit hardware performance counters. But with the older ARM Cortex-A8 revisions (for example, used in beagleboard), the PMU state may be occasionally messed up on the counter overflow. With the newer ARM Cortex-A8 revisions (for example, used in beagleboard-xm), the counter may just overflow without triggering an interrupt. The outcome is disasterious in both cases. Skipped interrupt may be difficult to notice because it takes slightly more than 4 seconds to count from 0 to 0xFFFFFFFF on a 1GHz processor. So the performance monitoring unit recovers itself, but each skipped interrupt results in approximately 4 seconds dropped out from the profiling session. Longer profiling runs have a higher chance of triggering this hardware bug eventually. And considering that it is important to collect really a lot of samples for getting good precision, Cortex-A8 performance monitoring unit using cycle counter is a really bad option.

The solution for all these troubles is simple: use the timer interrupt, Luke :) Hardware performance counters are actually more like a red herring. Timer interrupt works perfectly fine for the simple profiling tasks, so there is no point trying to use the performance monitoring unit no matter what. Admittedly, I have wasted quite a lot of time myself trying to workaround this pesky issue.

In order to override Cortex-A8 performance monitoring unit with a simple timer driver, adding "oprofile.timer=1" to the kernel command line can be used. Or using "timer=1" module parameter if oprofile is built as a module.

Also when using the simple timer driver, it makes sense to tweak it a bit if we don't want to collect samples at something like a pitiful default 128 Hz rate. The following hack can be applied to the linux kernel to solve this:

diff --git a/drivers/oprofile/timer_int.c b/drivers/oprofile/timer_int.c
index 3ef4462..56fb6c3 100644
--- a/drivers/oprofile/timer_int.c
+++ b/drivers/oprofile/timer_int.c
@@ -20,13 +20,15 @@
 
 #include "oprof.h"
 
+#define OPROFILE_TIMER_TICK_NSEC 244141 /* ~4096 Hz */
+
 static DEFINE_PER_CPU(struct hrtimer, oprofile_hrtimer);
 static int ctr_running;
 
 static enum hrtimer_restart oprofile_hrtimer_notify(struct hrtimer *hrtimer)
 {
    oprofile_add_sample(get_irq_regs(), 0);
-  hrtimer_forward_now(hrtimer, ns_to_ktime(TICK_NSEC));
+  hrtimer_forward_now(hrtimer, ns_to_ktime(OPROFILE_TIMER_TICK_NSEC));
    return HRTIMER_RESTART;
 }
 
@@ -40,7 +42,7 @@ static void __oprofile_hrtimer_start(void *unused)
    hrtimer_init(hrtimer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
    hrtimer->function = oprofile_hrtimer_notify;
 
-  hrtimer_start(hrtimer, ns_to_ktime(TICK_NSEC),
+  hrtimer_start(hrtimer, ns_to_ktime(OPROFILE_TIMER_TICK_NSEC),
              HRTIMER_MODE_REL_PINNED);
 }

Additional verification for the Poisson based stddev estimate (added on 2011-08-28)

Let's take the following profiling session as an example:

Profiling through timer interrupt
samples  %        image name               symbol name
52105    40.0715  djpeg                    jpeg_idct_islow
41281    31.7473  djpeg                    ycc_rgb_convert
15126    11.6327  djpeg                    decode_mcu
15001    11.5366  djpeg                    h2v1_fancy_upsample
2029      1.5604  djpeg                    decompress_onepass
1470      1.1305  libc-2.12.2.so           memset
1118      0.8598  no-vmlinux               /no-vmlinux
967       0.7437  libc-2.12.2.so           _wordcopy_fwd_dest_aligned
333       0.2561  djpeg                    jpeg_fill_bit_buffer
69        0.0531  libc-2.12.2.so           fwrite
69        0.0531  libc-2.12.2.so           write

Poisson gives us a theoretical estimate for the standard deviation as the square root of the number of samples. But just to be sure, we can verify it by running the same profiling session 10 times and calculating sample standard deviation for the number of samples attributed to each function.

function	time spent in the function, measured in oprofile samples										mean	sample stddev	sqrt(mean)
jpeg_idct_islow	52105	52171	51968	52243	52389	52126	52347	52217	52078	52543	52218.7	169.2	228.5
decode_mcu	15126	15119	15315	15060	15108	15397	15227	15017	15175	15138	15168.2	115.8	123.2
decompress_onepass	2029	2042	2070	2012	2057	2127	2022	2074	2048	1992	2047.3	37.98	45.25
fill_bit_buffer	333	311	333	311	334	297	309	309	336	304	317.7	14.63	17.82

By comparing the last two columns in the table, we can see that the values there are reasonably close to each other. So assuming a stable test environment with no background activity from the other processes, etc., we can run just one profiling session and already have a good estimate for the precision of the measurement for each function. Still it is a good idea to repeat profiling at least one more time and check if the results are consistent between runs in order to rule out any possible interference from the external factors or the problems in the whole setup (see the 'ARM Cortex-A8 performance monitoring unit disaster' section). If the results are not consistent across runs, it makes sense identifying and eliminating the source of this noise.

Also the applicability of the Poisson based standard deviation estimate is limited to the functions which take a reasonably small percentage of time (as wikipedia article says: "The Poisson distribution can be applied to systems with a large number of possible events, each of which is rare. A classic example is the nuclear decay of atoms"). And if taking a corner case as an example, if oprofile log shows that all the samples belong only to a single function ('main'), then the precision of this measurement would be very high and only depend on the timer resolution. The number of samples would be equal to the time taken by the process multiplied by the oprofile samples collection frequency. But on a positive side, sqrt(N) still provides a reliable pessimistic estimate, with the real standard deviation being lower than that.

SIMD DCT/IDCT in libjpeg-turbo and bit-exactness

2011-08-22T00:00:00+00:00

libjpeg-turbo is currently the fastest open source jpeg encoder/decoder to the best of my knowledge. Achieving good performance in libjpeg-turbo would be impossible without using SIMD instructions available in modern processors. The optimizations for MMX/SSE2 capable x86 processors existed in libjpeg-turbo for a while, and now support for ARM NEON is also coming in the next libjpeg-turbo 1.2 release.

One of the important parts of libjpeg-turbo, which benefits from SIMD optimizations is DCT/iDCT. For the obvious practical reasons (easier testing and maintenance and full compatibility with the older versions), it makes a lot of sense to ensure that SIMD optimized code produces exactly the same results as C code. That is, unless there are some really good reasons not to do so (for example, if the algorithm is a bad match for the instruction set of some particular processor).

And there are naturally some potential pitfalls on the bit-exactness road. In order to use SIMD efficiently, it is important to use the smallest possible data type in calculations. The C code is happy to use 32-bit variables and "32-bit * 32-bit -> 32-bit" multiplications. But for the SIMD code, using 16-bit data means that we can pack more information into a single register and process more of it in parallel, saving CPU cycles. Still using 16-bit calculations, we need to be sure that there are no unwanted overflows. And doing things somewhat different from C always has a risk of getting somewhat different results in the end.

DCT takes 8x8 blocks of samples with the values in [-128, 127] range and produces blocks of 8x8 DCT coefficients in [-1024, 1023] range. IDCT can convert the DCT coefficients back to the original 8-bit samples. Mathematically, the original samples can be perfectly reconstructed. But practically, there may be rounding errors and some extra loss of precision due to quantization. And there is one very important thing to note. Any arbitrary 8x8 block of [-128, 127] samples passed through DCT produces a 8x8 block of coefficients in [-1024, 1023] range. But any arbitrary 8x8 block of [-1024, 1023] coefficients does not necessarily produce 8x8 block of [-128, 127] samples when passed through IDCT. Some of the samples may be well outside [-128, 127] range. Searching on the Internet reveals some information, which says that the range of IDCT output may be as large as [-1805, 1805]. Obviously, there is no way for such arbitrarily selected DCT coefficients to have been generated by the forward DCT with the normal [-128, 127] input in the first place. However, it is possible to hand craft JPEG bitstreams and embed any arbitrary DCT coefficients there, so the decoder has to handle them somehow.

When developing SIMD optimized IDCT implementation, apparently there are two separate cases to consider:

decoding the files generated by a normal jpeg encoder (DCT coefficients are generated by a normal forward DCT from [-128, 127] samples)
decoding some bogus out-of-range data (DCT coefficients are generated in some arbitrary way)

For the former, the decoding result is better to be well defined and bit-exact when compared to C impementation. The latter is a bit of gray area. On one hand, still producing the same results as C would be nice. On the other hand, if producing the same results as C regresses performance, then it is clearly not so desirable. Also we may need to look carefully in the spec, just to see how the out-of-range DCT coefficients data fits into it and whether it is allowed. What if some cleverly optimized jpeg encoder tries to use them for some purpose?

But now it's time for some experiments. Generating hand crafted DCT coefficients is actually quite easy by modifying libjpeg code and using cjpeg tool. It is a simple matter of just hacking convsamp function and injecting the samples data there.

Quirks in the C code

The first victim of these experiments is actually not SIMD, but C implementation. The comment from jdmaster.c explains:

MASK is 2 bits wider than legal sample data, ie 10 bits for 8-bit
samples.  Under normal circumstances this is more than enough range and
a correct output will be generated; with bogus input data the mask will
cause wraparound, and we will safely generate a bogus-but-in-range output.

So what happens if we deliberately generate a jpeg file, which can decode to such very much out of range samples? One of the variants of 8x8 DCT coefficients for this purpose can be the following:

-1024	0	0	0	0	0	0	0
0	0	0	0	0	0	0	0
0	0	0	0	0	0	0	0
0	0	0	0	0	0	0	0
0	0	0	0	0	0	0	0
0	0	0	0	0	0	0	0
0	0	0	0	0	0	0	0
0	0	0	0	0	0	0	0

And the results of decoding this hand crafted sample are below. You may want to pay special attention to the leftmost image, because it links to the bogus jpeg file itself and gets decoded by the jpeg library used by your browser.

original file, decoded by your browser	decoded using jpeg_idct_islow	decoded using jsimd_idct_islow_sse2

The rightmost image (decoded by SSE2 implementation from libjpeg-turbo 1.1.1) does not have any range limitations and always performs correct clamping to bring the color into [0, 255] range. So the color of some 8x8 tiles gets saturated to white. The C implementation wraps around and shows the same tiles as black.

Quirks in the SIMD optimized code

As mentioned earlier, SIMD relies a lot on 16-bit arithmetics. And looking at ISLOW IDCT C code, there is an obvious case of potential overflow:

/* Odd part per figure 8; the matrix is unitary and hence its
     * transpose is its inverse.  i0..i3 are y7,y5,y3,y1 respectively.
     */

    tmp0 = (INT32) wsptr[7];
    tmp1 = (INT32) wsptr[5];
    tmp2 = (INT32) wsptr[3];
    tmp3 = (INT32) wsptr[1];

    z1 = tmp0 + tmp3;
    z2 = tmp1 + tmp2;
    z3 = tmp0 + tmp2;
    z4 = tmp1 + tmp3;
    z5 = MULTIPLY(z3 + z4, FIX_1_175875602); /* sqrt(2) * c3 */

The 16-bit values from wsptr[1], wsptr[3], wsptr[5] and wsptr[7] are all added together and passed as an argument to MULTIPLY macro, which is supposed to be able to treat its arguments as 16-bit values (so this sum must fit 16 bits). And this can easily overflow on the second pass if the DCT coefficients feeded to IDCT function contain arbitrary [-1024, 1023] input. The comment stating that

The outputs of the first pass are scaled up by PASS1_BITS bits so that
they are represented to better-than-integral precision. These outputs
require BITS_IN_JSAMPLE + PASS1_BITS + 3 bits; this fits in a 16-bit word
with the recommended scaling.

clearly applies to the case whan handling the "normal" DCT coefficients data. Because "BITS_IN_JSAMPLE + PASS1_BITS + 3" is equal to 13, we have enough of headroom to add 4 such values together without overflowing 16 bits. But again, this is not true for the arbitrary hand crafted [-1024, 1023] coefficients data. In any case, the C implementation uses 32-bit variables and we have no luck reproducing this overflow with it :)

The equivalent SSE2 code is a little bit different:

; -- Odd part

        movdqa  xmm4, XMMWORD [XMMBLOCK(1,0,rsi,SIZEOF_JCOEF)]
        movdqa  xmm6, XMMWORD [XMMBLOCK(3,0,rsi,SIZEOF_JCOEF)]
        pmullw  xmm4, XMMWORD [XMMBLOCK(1,0,rdx,SIZEOF_ISLOW_MULT_TYPE)]
        pmullw  xmm6, XMMWORD [XMMBLOCK(3,0,rdx,SIZEOF_ISLOW_MULT_TYPE)]
        movdqa  xmm1, XMMWORD [XMMBLOCK(5,0,rsi,SIZEOF_JCOEF)]
        movdqa  xmm3, XMMWORD [XMMBLOCK(7,0,rsi,SIZEOF_JCOEF)]
        pmullw  xmm1, XMMWORD [XMMBLOCK(5,0,rdx,SIZEOF_ISLOW_MULT_TYPE)]
        pmullw  xmm3, XMMWORD [XMMBLOCK(7,0,rdx,SIZEOF_ISLOW_MULT_TYPE)]

        movdqa  xmm5,xmm6
        movdqa  xmm7,xmm4
        paddw   xmm5,xmm3               ; xmm5=z3
        paddw   xmm7,xmm1               ; xmm7=z4

        ; (Original)
        ; z5 = (z3 + z4) * 1.175875602;
        ; z3 = z3 * -1.961570560;  z4 = z4 * -0.390180644;
        ; z3 += z5;  z4 += z5;
        ;
        ; (This implementation)
        ; z3 = z3 * (1.175875602 - 1.961570560) + z4 * 1.175875602;
        ; z4 = z3 * 1.175875602 + z4 * (1.175875602 - 0.390180644);

        movdqa    xmm2,xmm5
        movdqa    xmm0,xmm5
        punpcklwd xmm2,xmm7
        punpckhwd xmm0,xmm7
        movdqa    xmm5,xmm2
        movdqa    xmm7,xmm0
        pmaddwd   xmm2,[rel PW_MF078_F117]      ; xmm2=z3L
        pmaddwd   xmm0,[rel PW_MF078_F117]      ; xmm0=z3H
        pmaddwd   xmm5,[rel PW_F117_F078]       ; xmm5=z4L
        pmaddwd   xmm7,[rel PW_F117_F078]       ; xmm7=z4H

Here only the values of z3 (wsptr[3] + wsptr[7]) and z4 (wsptr[1] + wsptr[5]) are calculated using 16-bit additions and then used as 16-bit operands for multiplication. The following DCT coefficients have been hand crafted with the intention to trigger "wsptr[3] + wsptr[7]" overflow:

-724	-299	-724	300
-1004	-416	-1004	416
-946	-391	-946	392
-851	-352	-851	352
-724	-299	-724	300
-569	-235	-569	235
-391	-162	-391	162
-199	-82	-199	82

And the decoding results of the generated sample are below:

original file, decoded by your browser	decoded using jpeg_idct_islow (correctly clamped)	decoded using jpeg_idct_islow	decoded using jsimd_idct_islow_sse2

Funnily enough, the three images on the left are all different ("correctly clamped" is the case when C code is tweaked to solve the range problem described in the previous section). Comparing the leftmost image with each one of them can give some idea about what kind of IDCT implementation might be used on your computer.

I think it's necessary to add a disclaimer just in case: this all only applies to decoding bogus out-of-range data. So the differences in decoding results can't be immediately considered a bug.

ARM NEON

This whole blog post is actually the result of my mini-investigation, intended to clear the doubts that I got shortly after submitting ARM NEON optimized ISLOW iDCT patch.

Just like SSE2 IDCT, ARM NEON code also has some overflows for the out-of-range data, but should be perfectly fine for the normal jpeg files. And it still can be easily tweaked to ensure no overflows even when handling any arbitrary [-1024, 1023] DCT coefficients. But this may cost a few extra CPU cycles.

And one more final disclaimer: I'm not a hardcore multimedia expert, so may be easily wrong. Comments and corrections are surely welcome.

Discovering instructions scheduling secrets

2011-08-03T00:00:00+00:00

Knowing the instructions scheduling rules is quite important when implementing assembly optimizations. That's especially true for the simple embedded processors such as ARM or MIPS, which don't typically implement out-of-order execution or where the out-of-order instructions execution is just rudimentary at best. Instruction cycle timings are quite well documented for some processors such as ARM11 or ARM Cortex-A8, even sometimes providing a comprehensive scheduling example. But some processors such as ARM Cortex-A9 are apparently either too complex or maybe just too new to be described in more detail, and the instruction cycle timings information is rather poor (more about Cortex-A9 maybe in another blog post). And some ARM compatible processors even don't seem to have any public documentation at all.

Even having a good documentation, there always can be some ambiguity or omission of the fine details. For example, ARM Cortex-A8 supports limited dual-issue for NEON instructions. But can it really sustain 2 instructions per cycle execution rate on a long sequence of instructions? Another example is accumulator forwarding for multiply-accumulate instructions. Using back to back multiply-accumulate instructions is fine, but will the forwarding still work if an unrelated instruction is inserted between them?

The solution is really simple. In addition to just reading and (mis)interpreting the manuals, it makes a lot of sense to verify every important detail by running some tests and benchmarks. Especially considering, that it is actually not very difficult at all. The easy way to do this is just to create some *.S file and add the sequence of the instructions to be investigated there, placing them in a simple loop. Then compile and run this test program, measuring how much time it takes to run. Very simple. And in order to make it easier to convert time into CPU cycles, it makes sense to set the number of loop iterations to run to be equal to the CPU clock frequency. In this case, the time of execution of the test program in seconds would be equal to the number of cycles spent in the loop body.

Below is a trivial test program (tried on different CPU architectures, not just ARM), which benchmarks the performance of a long sequence of back-to-back ADD instructions. Addition is a simple and fast operation, which typically takes just 1 cycle to provide the result. And because each instruction depends on the result of the previous one, they can't dual-issue. So for most processors (with some exceptions) the performance of this code will be exactly 1 cycle per ADD instruction.

ARM

.text
.arch armv7-a
.global main

#ifndef CPU_CLOCK_FREQUENCY
#error CPU_CLOCK_FREQUENCY must be defined
#endif

#define LOOP_UNROLL_FACTOR   100

main:
        push        {r4-r12, lr}
        ldr         ip, =(CPU_CLOCK_FREQUENCY / LOOP_UNROLL_FACTOR)
        b           1f

    .balign 64
1:
    .rept LOOP_UNROLL_FACTOR
        add         r0, r0, r0
        add         r0, r0, r0
        add         r0, r0, r0
        add         r0, r0, r0
        add         r0, r0, r0
    .endr
        subs        ip, ip, #1
        bne         1b

        mov         r0, #0
        pop         {r4-r12, pc}

And the results of this benchmark from ARM Cortex-A8 @1GHz:

$ gcc -DCPU_CLOCK_FREQUENCY=1000000000 bench.S && time ./a.out

real    0m5.017s
user    0m5.016s
sys     0m0.000s

A few more explanations about this test program and the interpretation of results. The '.rept LOOP_UNROLL_FACTOR / ... / .endr' block repeats the code contained inside it LOOP_UNROLL_FACTOR times (more information about gnu assembler macros can be found by reading 'info as'). This helps to reduce the loop overhead so that it becomes insignificant and can be ignored. Unrolling even more is good, though we need to be careful in order not to exceed the instructions cache size. The end result is that the block of 5 ADD instructions is executed CPU_CLOCK_FREQUENCY times when running this test program. If the test program takes 5 seconds to execute, then it means that the sequence of instructions inside of .rept block needs 5 cycles. If we had a non-integer number of seconds, then it would mean that something likely went wrong.

Multiple variations are also possible. Earlier I posted some code template for experimenting with NEON instructions scheduling, tailored for tuning ARM NEON optimizations specifically for the pixman library.

MIPS

.text
.set noreorder

#ifndef CPU_CLOCK_FREQUENCY
#error CPU_CLOCK_FREQUENCY must be defined
#endif

#define LOOP_UNROLL_FACTOR  100

.global main
.type main, @function
main:
        li      $t9, (CPU_CLOCK_FREQUENCY / LOOP_UNROLL_FACTOR) - 1
1:
    .rept LOOP_UNROLL_FACTOR
        addu    $t0, $t0, $t0
        addu    $t0, $t0, $t0
        addu    $t0, $t0, $t0
        addu    $t0, $t0, $t0
        addu    $t0, $t0, $t0
    .endr

        bnez    $t9, 1b
        addiu   $t9, $t9, -1

        j       $ra
        li      $v0, 0

MIPS74K @480MHz:

$ gcc -DCPU_CLOCK_FREQUENCY=480000000 bench.S && time ./a.out

real    0m10.064s
user    0m10.060s
sys     0m0.003s

MIPS24Kc @680MHz:

$ gcc -DCPU_CLOCK_FREQUENCY=680000000 bench.S && time ./a.out

real    0m5.040s
user    0m5.030s
sys     0m0.000s

This was a variant of the same benchmarking code for MIPS, which shows that MIPS74K has a higher latency and needs 2 cycles per addition.

x86, and also taking a look at SMT

A similar benchmarking method can be also extended to analyze the efficiency of SMT capable processors (Intel Atom, IBM Cell PPE and friends). Because the resources of a single CPU core are shared between two hardware threads, there can't be 100% scalability and it may be interesting to see how much SMT can actualy help on real or artificial workload. The test program for x86 may look like this:

.intel_syntax noprefix
.text
.global main
.global fork
.global wait

#ifndef CPU_CLOCK_FREQUENCY
#error CPU_CLOCK_FREQUENCY must be defined
#endif

#define LOOP_UNROLL_FACTOR  100

main:
#ifdef TWO_THREADS
        call    fork
#endif
        mov     ecx, (CPU_CLOCK_FREQUENCY / LOOP_UNROLL_FACTOR)
        jmp     1f

    .balign 64
1:
    .rept LOOP_UNROLL_FACTOR
        addps   xmm1, xmm1
        add     eax,  eax
        add     eax,  eax
        addps   xmm2, xmm2
        add     eax,  eax
        add     eax,  eax
        addps   xmm3, xmm3
        add     eax,  eax
        add     eax,  eax
    .endr
        dec     ecx
        jnz     1b

#ifdef TWO_THREADS
        push    0
        call    wait
        add     esp, 4
#endif
        mov     eax, 0
        ret

And the results of this benchmark from Intel Atom N450 @1.66GHz:

$ gcc -m32 -DCPU_CLOCK_FREQUENCY=1660000000 ht-bench.S && time ./a.out

real    0m6.034s
user    0m6.032s
sys     0m0.000s

$ gcc -m32 -DCPU_CLOCK_FREQUENCY=1660000000 -DTWO_THREADS ht-bench.S && time ./a.out

real    0m9.088s
user    0m18.097s
sys     0m0.028s

When running just one thread, 6 cycles are needed for each 9 instructions from the loop body (ADDPS instructions can dual issue with ADD instructions, so the whole loop is limited only by ADD instructions performance). And two threads need 9 cycles for each 2 * 9 = 18 instructions, reaching the maximum theoretically possible IPC = 2 for this processor.

This particular benchmark is quite interesting, because I used it to verify the hypothesis from some other person, who suggested that at any given CPU cycle, only the instructions from one hardware thread may be processed (either a single instruction or a pair of instructions), but never from both. But just because there are 12 ADD instructions to be executed in 9 cycles and they can't dual issue within a single thread, there is no other way for the processor but to occasionally execute a pair of ADD instructions fetched from different threads simultaneously.

Though there is still something wrong with Intel Atom hyper-threading implementation, because actually removing all the ADDPS instructions from the benchmark program causes performance regression for the multithreaded case. It regresses to 12 cycles per each 2 * 6 = 12 remaining ADD instructions, so hyper-threading becomes useless. Two threads running simultaneously need exactly the same time to complete as would be needed to run just a single thread twice. So those additional extra ADDPS instructions work as some kind of "catalyst" and improve multithreaded performance for this particular code sequence!

But what about the hardware performance counters available in modern processors?

The hardware performance counters are surely useful. And moreover, they have many interesting events monitored in addition to just a simple cycle counter, which surely expose some additional information about what is happening inside of the processor and help to better understand it.

However simple time based tests are just fine and may be preferable in some cases. The most important is when you want to ask somebody else to run some benchmark on his hardware, but the performance counters are not accessible from the userspace by default and that person is reluctant to touch the kernel.

On the other hand, the simple timer based tests described here are problematic when something like turbo-boost is supported by the hardware and is enabled, causing the CPU clock frequency to drift.

Origenboard, early adopter impressions

2011-07-30T00:00:00+00:00

A little bit of rant

Since a few days ago, I'm a somewhat happy owner of origenboard from the first batch. So why I'm not totally happy yet? Actually I expected that the board would be easy to get up and running, considering that the same Exynos 4210 SoC is used in a rather popular Samsung Galaxy S2 smartphone already available on the market (which means that the SoC intself should not have any serious hardware problems by now). And also because of the demos like this (which means that at least Linaro should have some usable linux kernel to run these demos on). So there is no reason not to expect some validation SD card image readily available for download and some basic getting started instructions, right?

The reality is that the only support area on origenboard website is a pre-moderated forum, where a few other fellow users have asked about the sources of u-boot. And my reply to that topic, trying to share the information with them, has not yet passed through moderation as of today. Hopefully the initial mess will be resolved soon and there will be some usable communication channel for origenboard users in the future. But considering that there are only 30 days of warranty, it may be a bit disturbing not to be able to use some validation image and test the board for hardware defects right away.

Because origenboard website refers to Linaro as the intended provider of the software part, I tried to see if Linaro can offer something usable for origenboard now. The information seems to be scarce and scattered there currently (I looked at the downloads area, wiki pages and asked around on #linaro irc channel). And the downside is that the maturity of the currently provided linaro kernel 2.6.39-2011.07 appears to be not very good yet.

My experience with this board so far is the following:

linaro kernel: USB does not work (so no USB ethernet), only a single CPU core is available. Also there is something on HDMI, but monitor reports "out of range" error
insignal kernel: USB works, both CPU cores are available (though running at only 1GHz), no HDMI output to monitor at all (and a few random configuration tweaks did not help)

But in any case, the insignal kernel at least provides a usable headless configuration. And this is surely better than nothing. Also on a positive side, the current situation inspired me to finally start a blog and post about something. Hopefully blogging could be entertaining for both me and the prospective readers :)

Board setup notes

The instructions below are not complete, but are supposed to highlight the most important steps. All of this has been discovered by using trial and error method and also by bugging relevant people on #linaro irc channel (thanks for their patience). A total newbie may still get stuck, but this information should be sufficient for those having some experience installing linux on any other ARM development boards.

Also this information is likely to get outdated very soon (even if assuming that it was useful in the first place).

u-boot and linux kernel sources

The combination of u-boot and kernel that I'm using at the moment is the following:

u-boot: linaro-origen-2011.07
kernel: insignal 3645a1cb402be68b83feb9f9c8d7af2728cc8878

This kernel needs to be patched when used with this particular u-boot (as advised by linaro guys):

diff --git a/arch/arm/mach-s5pv310/mach-origen.c b/arch/arm/mach-s5pv310/mach-origen.c
index e24e8d1..977f0c9 100644
--- a/arch/arm/mach-s5pv310/mach-origen.c
+++ b/arch/arm/mach-s5pv310/mach-origen.c
@@ -549,7 +549,7 @@ static void __init origen_fixup(struct machine_desc *desc,
    mi->nr_banks = 2;
 }
 
-#if 0
+#if 1
 MACHINE_START(ORIGEN, "ORIGEN")
 #else
 MACHINE_START(SMDKV310, "SMDKV310")
-- 
1.7.3.4

Compile u-boot (to get u-boot-mmc-spl.bin and u-boot.bin):

make CROSS_COMPILE=arm-none-linux-gnueabi- mrproper
make CROSS_COMPILE=arm-none-linux-gnueabi- origen_config
make CROSS_COMPILE=arm-none-linux-gnueabi-

Compile the kernel (to get uImage):

make ARCH=arm CROSS_COMPILE=arm-none-linux-gnueabi- mrproper
make ARCH=arm CROSS_COMPILE=arm-none-linux-gnueabi- origen_android_defconfig
make ARCH=arm CROSS_COMPILE=arm-none-linux-gnueabi- menuconfig
make ARCH=arm CROSS_COMPILE=arm-none-linux-gnueabi- -j8 uImage
make ARCH=arm CROSS_COMPILE=arm-none-linux-gnueabi- -j8 modules
scp arch/arm/boot/uImage root@origen:/mnt/mmcblk0p1/uImage
make ARCH=arm CROSS_COMPILE=arm-none-linux-gnueabi- modules_install INSTALL_MOD_PATH=/mnt/origen-nfs-root

Be sure to tweak configuration options as needed (add drivers for USB ethernet adapters, statically compile in ext3 support, disable CONFIG_ANDROID_PARANOID_NETWORK, etc.)

SD card layout

This section is based on the information from linaro wiki. In order to successfully boot the system, u-boot binary needs to be put into certain predefined areas on SD card.

Raw Sectors (sector size = 512 bytes)				Partitions
0	1 to 32	33 to 64	65 to 1088	FAT partition	any linux partition
MBR	u-boot-mmc-spl.bin	u-boot environment	u-boot.bin	uImage (kernel)	root filesystem

Writing u-boot into raw sectors of SD card (assuming that SD card is detected as /dev/sdb):

# dd if=u-boot-mmc-spl.bin of=/dev/sdb bs=512 seek=1
# dd if=u-boot.bin of=/dev/sdb bs=512 seek=65

Install rootfs for the distro of your choice and boot the system

Typical u-boot environment (when using rootfs from SD card instead of NFS):

baudrate=115200
bootargs=root=/dev/mmcblk0p2 rw rootwait console=ttySAC2,115200
bootcmd=fatload mmc 0 40007000 uImage; bootm 40007000
bootdelay=3
stderr=serial
stdin=serial
stdout=serial

But in order to get login prompt on serial console, s3c2410_serial2 (not ttySAC2) needs to be added to /etc/inittab and /etc/securetty. That's a bit weird, but I have not tried to look into it yet.

Finally turn on the board by pressing switch and then power button.

Update from 2011-09-19

Linaro kernel is getting better. Now it supports cpufreq (using 1.2GHz CPU clock frequency is possible), has a somewhat working USB support (is very slow and sometimes gets stuck for a few seconds), and a somewhat usable HDMI output which is hardcoded to use 1920x1080 resolution and use only a small 1024x600 area in the center. Still compared to the initial state, it is a major improvement.

I guess, everything is going to be in a much better shape in a few more months.