Computer Vision Software

Face recognition

rhondasw — Wed, 01 Jun 2022 02:24:35 +0000

The face recognition demo shows person facial feature training via a single photo and subsequent face matching on the live video stream, using VisionLabs’ library integrated onto the H22 System on a Module (SoM).

The H22 SoM, designed in-house, is a power-efficient camera platform for high-resolution video encoding and live video streaming. The core of the SoM platform is an Ambarella H22S85N System on a Chip that integrates an advanced image processing pipeline, H.265 (HEVC) and H.264 (AVC) encoders , and a powerful Quad core ARM® Cortex-A53 CPU for advanced business logic like computer vision, flight control, WiFi streaming, and other user applications. The H22 SoM is supplied with the SoM SDK – a Linux-based toolchain that allows executing user-level applications on the ARM core. To speed up the development process, there are a series of reference code samples for the SoM SDK.

The demo system implements a face training scenario using a simple mobile app. A single photo, captured through a mobile phone and added to the database, is quite enough for the NN to learn. Photos with name tags stored in the mobile app are transferred via WiFi onto the SoM to extract face descriptors and carry on with the recognition scenario. Recognition occurs in real-time on faces detected in the camera’s field of view. The markup is straightforward: red frames and “Unknown” tags for the people that are not found in the database, green frames and a nametag for the people from the database, and grey frames without a tag for the stage when the person’s face is found but is still being processed by the recognition algorithm.

This basic face detection and tracking algorithm was put together for demo purposes. More robust solutions are to be selected for more practical usage scenarios. One such solution will be in an upcoming post.

Despite being only a demo implementation, the high resolution and decent image quality enables precise face detection and recognition with indoor lighting in overcrowded conditions. Face recognition capabilities could be a value-added feature for security applications such as seamless entry control.

Pose Estimation and Activity recognition demo

rhondasw — Fri, 22 Apr 2022 03:27:15 +0000

This demo showcases real-time Human Pose Estimation, based on the Open Pose library, ported onto the camera platform, and designed by Rhonda’s Activity Recognition neural network for human behavior recognition. The two Deep Learning Neural Networks (DNN), along with the video pipeline, run on the Rhonda Software CV22 System on a Module (CV22 SoM).

CV22 SoM – designed in-house as a low-power camera platform, is capable of running multiple neural networks, in addition to providing superior image quality. The core of the SoM platform is an Ambarella CV22 System on a Chip – an ARM-based Image Signal Processor with a DNN inference acceleration engine, implemented on a single crystal.

Both CV applications run simultaneously. The Pose Estimation network performs human body detection in a full 4K frame, and people’s figures recognized in the camera’s field of view are visualized with “skeleton-like” pose markups. A blob of pixels around a foreground skeleton selected within the region of interest is passed to the Activity recognition DNN.

The activity recognition algorithm is a simple, yet robust demo built by Rhonda’s CV team from scratch, and trained to identify several activity types: walking, standing, welcome hand gestures (high-five), jumping jacks, body-weight squats. Recognized Activity for the foreground body is displayed in the upper- left corner of the screen.

After the initial port onto the CV22 platform, Open Pose algorithm delivered a frame rate of 1 frame per second. It took a number of optimization procedures performed by Rhonda’s CV experts (such as pruning, quantization, and dedicated retraining) to achieve a fifteen fold acceleration in performance.

The system can be trained for different use cases, such as security, elderly care, production automation, sports activity analysis, and more.

For demo and testing purposes we’ve deployed a setup with HDMI video injection to show platforms’ recognition capabilities with additional activities.

As a road-safety application example, Rhonda Software has assembled the Pedestrian detection demo, based on the same optimized port of the Open Pose library. The algorithm is applied to automotive conditions to detect pedestrians as participants of road traffic.

Character Generator for Lattice HDR-60 FPGA Board

Yuri Vashchenko — Fri, 08 Feb 2013 00:48:00 +0000

Introduction

Rhonda software specializes in developing video analytic algorithms, including hardware development for FPGA. Lattice HDR-60 Evaluation board was selected as a development platform. A typical development cycle consists of implementing all required modules in VHDL or Verilog programming language and then debugging them in a simulator. When debugging of individual components is complete, they are integrated and tested on actual FPGA hardware. If something is not working as it should, debugging the hardware video analytics algorithms on the actual hardware can be a challenge, especially if no soft-core CPU is instantiated. HDR-60 board has a camera sensor (input) and an HDMI output. So, many video analytic algorithms take input video signal from the camera, process it and send resulted output video stream to HDMI. If something is not working and the results you see are not what you expected, you have very limited means of debugging.

One of such means is the Reveal Analyzer. It can “record” values of different signals on predefined triggers and it is possible to see and analyze the results later. Although Reveal analyzer is a very useful tool but it also has some limitations. First, the learning curve is steep, and you need to spend a lot of time before it can record and show you the values of signals you wanted. Moreover, the more signals you are interested in, the more on-chip memory it is required to store these values. Finally, Reveal analyzer modifies design, uses extra resources and you need to rebuild the whole design every time some changes in monitored signals are made. For complex designs there could be not enough space or memory left on chip. Moreover, rebuilding the design could take hours.

Besides Reveal analyzer, there is a programmable led light on EBR-60 that can be turned on or off. It is helpful, but it is only 1 bit of information and it is very difficult to output a number to that led.

To make hardware debugging easier we designed a character generator. Using it, a developer can “print” on the HDMI output any numerical/text messages.

Ability to print text and other data is useful not only for debugging purposes – output of data and messages like current system time, system uptime, number of processed objects, etc. improves usability of designs created for FPGA board.

Data from the camera sensor, after passing the imaging pipeline (that may include sensor controller, debayer and tonemapper modules) arrives in the form of 8 bits (for grayscale) or 24 bits (for color video) per pixel. In addition, sensor provides 2 control signals, lval (line valid, “1” for valid pixels in line or “0” for blanks between lines) and fval (frame valid, “1” for valid frame lines or “0” for blank lines). When fval = ‘1’ and lval = ‘1’ the pixel is valid and will be displayed on the screen. Pixel data, fval and lval change every clock tick. The frequency of pixel clock depends on sensor configuration. Default settings produce frequency of 74.25 MHz. HDR-60 sensor supports resolution of 1280×720 at 60 frames per second (60 Hz).

We want our character generator to print characters on top of the image from camera (overlay mode).

A straightforward method of character output could be as follows:

Copy original frame into a frame buffer
For each character in the output text string:
1. Extract corresponding font matrix from font ROM (graphical representation of this character)
2. For each pixel in font matrix:
  1. calculate corresponding screen coordinates and address in frame buffer
  2. if pixel is visible replace it in the frame buffer with text color, otherwise leave it “as is”.
Send contents of modified frame buffer to video output

However sometimes we do not have a frame buffer in the design (for example, there is not enough memory for it). In such cases characters should be printed in a streaming (online) mode, i.e. module always deals with the only current pixel for which it decides if current pixel should go to output “as is” or it should be replaced with a text color.

The following two pictures illustrate both approaches of printing characters. The first approach (with a frame buffer) is more suitable for software implementation. The second (online) approach is well-suited for the FPGA hardware design, and it was implemented. Both examples show the process of printing of the same number (1234). In both cases the printing is in progress.

Each square cell represents a single pixel from the camera sensor. Blue pixels are outside the printed text area – they go to output as is. Green pixels belong to the printed character; their original brightness/color has been replaced with text color. Gray pixels belong to the printed text area, but don’t belong to character font matrix, so their brightness has been decreased according to transparency setting to provide greater contrast between printed text and background image to make it more readable. Finally, white pixels are not yet processed. They are here to illustrate the intermediate state of the process.

Picture1. Frame buffer implementation.
In the traditional software implementation example, digits “1” and “2” (with corresponding transparency setting) have already been printed and printing digit “3” is in progress.

Picture 2. Streaming (online) implementation.

In this example upper part of the whole number 1234 is complete, and lower part will be complete when corresponding input pixels from sensor are processed.

Binary 8-bit digitizer

The very first attempt to implement a character generator was a binary digitizer. It was very simple and was able to display 8-bit integers at the specified screen position in binary format. A developer had to manually interpret the output, e.g. a decimal number “157” was printed as “10011101”. Below are some technical implementation details:

Pixel counter knows when every line starts and ends (using line valid (lval) sensor signal) and counts line pixels (X screen coordinate)
Line counter knows when each frame and each line starts and ends (using the same line valid (lval) and frame valid (fval) sensor signals) and counts frame lines (Y screen coordinate).
Knowing text window size and offset, algorithm decides, if current pixel belongs to any of 8 printed bits and, if yes, finds out the value of the corresponding bit (“0” or “1”).
Then, checking the pixel coordinates (x and y), algorithm finds out if it belongs to the edge of the character position (8×16 pixels).
If “0” is to be printed and current pixel belongs to any character edge, its brightness/color is replaced with text color. Otherwise, if “1” is to be printed and current pixel belongs to right character edge its brightness/color is also replaced with text color. If both conditions are not true, pixel goes to output as is.

The module was resided in one .vhdl entity. The advantage of this implementation was its simplicity. The disadvantages were:

Binary output (only “0”s and “1”s). It was not very convenient to use
8 bit limitation. To print 10-bit number it was required to instantiate module twice which would eat twice as many resources.

7-Segment hex number printer

One-digit printer

The idea of 7-segment hex printer is to implement character generator that could output hex digits like most cheap LCD displays in portable electronic devices work.

Every hex digit (0-9, a-f) is encoded into 7-bit bitmask B[0..6], where each bit controls corresponding segment. For example, as shown in the picture above, digit “8” should light all segments, so it’s bitmask is “1111111” (0x7f).

First implementation was able to output a single hex digit for given 4-bit number. The algorithm is explained in details below:

Screen coordinates (Xs, Ys), 0 ≤ Xs < ScreenWidth, 0 ≤ Ys < ScreenHeight of current input pixel are used to find out if it belongs to a character area (8×16 pixels). If no, pixel goes to output as is
If input pixel belongs to character position, local coordinates inside the character are calculated (Xc, Yc), 0 ≤ Xc < 8, 0 ≤ Yc < 16.
Character local coordinates are used to find out if pixel belongs to any segment or not. If it does not, it’s brightness decreased according to transparency setting.
If pixel belongs to a segment, its character coordinates Xc and Yc are used to get segment id S, 0 ≤ S < 6.
Segment S is then used as an index to get “0” or “1” from input digit’s bitmask B. If it’s “1”, the brightness of the pixel increased, otherwise it’s decreased according to transparency setting.
Resulting pixel goes to HDMI output.

Multiple-digits printer

Having one-digit hex character generator would make it possible to print any hex digit anywhere on the screen. It would be better than just binary output, but, especially when design occupies almost the whole chip, it could lead to ineffective resources (LUTs) usage – to print 32-bit integer, it would require to instantiate the hex digit printer 8 times (for each hex digit). Moreover, logic that computes segments was mostly combinational, so, in addition to extra space usage, every extra instance would negatively affect maximum design frequency (FMAX).

So, the next step was to improve hex printer to make it print hex numbers of specified width with the single instance of the printer entity.

To implement this, we designed a serializer module. Serializer takes n-bit input binary integer and index value and returns corresponding 4-bit digit. For example, for the given 11-bit input number “110’1011’1000” (0x6B8) serializer will return “0110” (0x6) for input index “0”, “1011” (0xB) for index “1” and “1000” (0x8) for index “2”.

One-digit printer used provided coordinates to print a digit at the specified location. Multiple-digits printer uses the provided location to print first digit and calculates corresponding coordinates for remaining digits.

As Picture 2 above illustrates, frame pixels arrive pixel by pixel, line by line. Multiple-digits printer module prints the input number accordingly, i.e. when pixels from first frame line arrive, the module modifies them to print first line of the whole number (not just first digit). When first image line ends, top line of all printed digits is ready. Then the second line is processed pixel by pixel, then the third, etc., until all lines are processed.

7-segment decimal printer

Although hex printer that can print any number is much better than previous binary 8-bit printer, it still not very convenient to use because it outputs numbers in hex format, while people got used to see decimal numbers. To make life easier, the Decimal printer module was designed. It uses hex printer described above, but before going there, input number is converted to the corresponding BCD representation.

Binary to BCD converter

Binary to BCD converter converts binary numbers into corresponding packed Binary-Coded Decimal (BCD) representation. For example, number 0x0A would be converted to 0x10 and number 0xFF would become 0x255. The module uses efficient [Double dabble] algorithm.

Original version of algorithm is combinational, meaning that result of conversion is ready almost immediately. Unfortunately, for long binary inputs (e.g. 32-bit integers) it produces a lot of combinational logic which occupies significant amount of chip and dramatically decreases maximum frequency at which the design can work (FMAX). Considering the fact that during normal operation we don’t need converted BCD representation at the same clock as input binary number arrives, the algorithm was implemented in a sequential way to save FPGA area space and improve timing, produce less combinational logic and (FMAX).

Resulting algorithm is not pipelined and for n-bit input integer it produces m-bit output BCD representation in k clock cycles, here m = RoundUp(n / 3) * 4 and k = n * RoundUp(n / 3).

The Binary to BCD converter module does nothing with segments, fonts and/or pixels and can be instantiated in any context where converting from binary representation to corresponding BCD one is required and specified earlier latency is acceptable. The original (fully combinational) algorithm and description how it works can be found here. The modified version is more complex and employs a finite state machine to process each bit of input number in a sequential manner. You can download the source code of modified version here.

Character printer

Transparency setting

Next step was to implement a Character printer to be able to print anything including numbers, text or both. As you can see in the picture below, there are 3 main cases:

A pixel does not belong to the output text window. In this case it goes to output as is. These are the pixels around the printed text.
A pixel is inside of the output text window, but it does not belong to a character. In this case pixel brightness is altered according to the transparency setting, to make the text more readable on different backgrounds. Transparency can be turned off (in this case green letters on the green background will not be seen), or set to “no background”, when all pixels around the text will be black.

Character Generator

Functional block diagram of Character printer is presented below:

Picture 3. Functional Block Diagram of Character printer module.

Picture 4. Sequence diagram of Character printer module.

Character generator is a main module where almost all work is done. Character generator works as follows:

Pixel counter uses line valid signal to understand when a new line starts and counts line pixels (on-screen Xs coordinate), 0 ≤ Xs < ScreenWidth;
Line counter uses frame valid and line valid signals to understand when a new frame and new line start and counts lines on the current frame (on-screen Ys coordinate), 0 ≤ Ys < ScreenHeight.
Character printer uses the font size parameter, which is a scale factor from 0 to 4. Characters at scale 0 are 8×16 pixels large, scale 1 gives 16×32, scale 2 gives 32×64 pixels, etc. Using current on-screen coordinates (Xs, Ys) character generator calculates text coordinates (Xt, Yt), 0 ≤ Xt < TextColumns, 0 ≤ Yt < TextRows..
Text coordinates are fed to position converter which generates indices for string serializer (explained in details below) to get the character code that should be printed at current position.
Character generator also converts screen coordinates to local character coordinates (Xc, Yc), 0 ≤ Xc < 8, 0 ≤ Yc < 16.
Local-character coordinates Xc and Yc, input character from string serializer (with data valid signal) are then fed to Font ROM module.
Font ROM, based on given data (local character coordinates (Xc, Yc), character ASCII code and character valid), returns “1” when pixel at the given position belongs to a printed character, “0” otherwise.
If a pixel is outside the printed text window, it goes to output as is. Otherwise, if it does not belong to a printed character, it is altered to decrease its brightness according to transparency parameter. If it belongs to a printed character, it’s replaced with text color setting.

Position converter

The main task of the position converter is to convert on-screen text coordinates (Xt, Yt) into text coordinates (indices) inside the text window (Xw, Yw), that go to String serializer module to get corresponding character from the video memory. Position converter is a very simple module made of combinational logic only.

Serializer

String serializer

String serializer contains video memory (RAM or ROM) and returns a character for requested position (Xw, Yw). If position is out of window range, it de-asserts data valid signal. In a simple text-only implementation with no placeholders for numbers, string serializer works similar to the serializer module from hex printer, but instead of returning fixed number of bits by input index, it returns character ASCII code requested by index.

Number serializer

In more complex design, message ROM would contain special placeholders for digits of one or more displayed numbers, like “%d” in C-language “printf()” function. In this case, there would be one more module instantiated, number serializer or “str”. Main function of “str” is similar to serializer module from hex printer, which used input index to extract requested 4-bit digit from longer (n-bit) input binary integer, but instead of returning binary 4-bit digit, “str” returns corresponding 8-bit ASCII code of this digit. Following the example from hex printer module, for the given 11-bit input number “110’0111’1000” (0x678) str will return 0x36 (ASCII code of character “6”) for input index “0”, 0x37 (ASCII code of character “7”) for index “1” and 0x38 (ASCII code of character “8”) for index “2”.

FPGA implementation of myAudience-Count. Overview and details.

Sergey Koulik — Fri, 21 Dec 2012 06:17:39 +0000

Recently, Rhonda Software took yet another step towards more power, area and cost effective solutions targeting broad range of embedded devices. In an effort to make one of our leading solutions myAudience-Count embedded-friendly, different possibilities were considered. Here is where FPGA technology came at hand.

With Video Analytic target in mind, after extensive market research it was decided to use Lattice HDR-60 development kit as a base platform for our Embedded Count solution. The selected kit is a good choice for several reasons, among which are: mounted 1280×960 camera sensor, Ethernet PHY, DDR2 memory, 2 USB ports, and of course the main decision driver – Lattice ECP3 FPGA device with 70K of LUTs, 150KB of embedded EBR memory blocks, 256 DSP multipliers and other useful ASIC components. All of the above come packaged in a rather compact base board accompanied with development toolchain up and ready to use.

It is time now to look at what’s inside of Embedded Count product and unveil some core algorithms and approaches.

As with the PC version at the heart of the system there is an Optical Flow estimator which is basically a motion tracker capable of calculating for each pixel its position relative to the position of the same pixel within previous and next frame in video sequence. In general, if there is a movement present in some part of video frame, the algorithm has to find its position and direction. For unmoved areas the algorithm has to yield nothing.

For those who are interested, here comes a piece of technical details. Simply speaking, any change of pixel from frame to frame can be explained by either spatial motion or changing of its brightness over time. The latter can be considered as a temporal motion. More strictly, here is a commonly used differential equation relating pixel’s brightness change to its movement, called Optic Flow Constraint:

∂I/∂x*Vx + ∂I/∂y*Vy + ∂I/∂t = 0,

where Vx and Vy are components of pixel’s speed in spatial directions. ∂I/∂x, ∂I/∂y, ∂I/∂t are spatial and temporal partial derivatives. As one can see, the equation is under-determined calling for regularization. The most common way of transforming the problem into a well-posed one is adding a Smoothness Constraint of some form which basically postulates that the adjacent pixels tend to move in the same or similar direction. This constraint which usually comes in a form of a laplacian or some other mixture of second derivatives effectively transforms the original equation into, in general, an over-determined system of linear equations which can be solved approximately with the use of convolutions only. Interested readers may refer to a nice article by Zhaoyi Wei et al., where the idea is cleanly explained without too much of an analytic overhead.

So, the building blocks of the Optical Flow algorithm are spatial and temporal derivatives, which call for intermediate frame buffers and convolutions, which require buffers, multiplications, summations and divisions. Hardware implementation itself dictates some constraints the major of which is that in order to be real time the algorithm has to be non-iterative and fully pipelined. As a new pixel from sensor is produced on each clock cycle, it has to be pushed into the processing pipeline at once before the next pixel becomes ready. The amount of memory for storing intermediate results is strictly limited to 150 Kb in total. And of course, there is no such luxury as floating point calculations.

Fortunately, both convolutions and spatial derivatives require only a limited number of frame lines at a time equal to the size of the convolution kernel. Even better, with the use of shift-register structure with taps, they can be easily pipelined so that one output pixel is produced on every clock cycle while new pixel is being pushed into the pipeline.

Multiplications and divisions are much tougher with FPGA. There are a limited number of fixed-point DSP multipliers within the chip and there are no dividers. Implementing either of them within LUT logic will eat up all of the available resources before long. To overcome the lack of multipliers and dividers, approximate convolution kernels for both smoothing and differentiation were carefully designed. The coefficients as well as their sum were chosen to be powers of two. Thus only summations and shifts were required. The shifts on FPGA are resource-free, because they do not produce additional logic or interconnect. To reduce resources consumption even further the separability of the kernels in spatial directions was highly exploited, which allowed to efficiently transform a 2D sub-problem into an 1D.

The next logical step towards problem simplification was rewriting of the Optical Flow Constraint equation eliminating the ∂I/∂x*Vx term and leaving one spatial and one temporal dimension only. This could be done for this particular problem, because the original task of counting people who crosses a virtual line considers motion in the direction orthogonal to this line only and pays no attention to the parallel movements. With a little quality penalty this greatly reduced the amount of required calculations and freed a lot of FPGA resources.

Additionally, as too detailed motion map was not required for solving the problem, a frame downscaling was implemented which allowed to both lessen the intermediate buffers requirements and reduce working clock frequency and power consumption.

The result of Optical Flow calculation is further combined with the result of Background Model module also implemented in hardware and working in parallel with Optical Flow estimator. What follows is the reduction of the combined field into a line and extracting line segments which correspond to the persons being counted. The result of the reduction is then transferred into a CPU implemented as a soft core on the same chip for final post-processing and transmitting onto myAudience portal over Ethernet.

Among other important hardware modules of the system there are: Background Model (mentioned above), Debayer (responsible for converting Bayer pattern coming from sensor to RGB), Tone-mapper (for compressing tonal range of input pixels from 12 to 8 bit), JPEG encoder (for streaming preview frames onto calibration web-UI), Ethernet MAC, DDR2 controller, LM32 CPU + Embedded Linux (for running Ethernet stack, transmitting People Count results onto myAudience portal, running web-server and JPEG preview streamer), I2C master (for programming sensor’s registers), UART and others. All of them were successfully fitted into a single 70K LUT FPGA consuming about 85% of available chip resources in both LUTs and memory blocks and forming a finished, production-ready People Count solution targeted embedded systems.

CETW participation announcement

Alexander Gavrik — Mon, 04 Apr 2011 00:44:03 +0000

myAudience promo video

Rhonda Software will participate in Customer Engagement Technology World (CET World) in San Francisco, April 27-28, 2011. We are pleased to invite you to visit our booth #235.

We’ll be glad to have this chance to introduce you our innovative system myAudience – tool for automated audience measurement for digital signage, kiosks, showcases and many others. You can click this special link to register.

Your special PRIORITY CODE will automatically appear with your registration, giving you a FREE exhibits only pass. For up-to-date information about Customer Engagement Technololgy World, please visit www.CETworld.com.

Rhonda will participate in Customer Engagement Technology World (CET World) in San Francisco, April 27-28, 2011. We are pleased to invite you to visit our booth #235.

We’ll be glad to have this chance to introduce you our innovative system myAudience – tool for automated audience measurement for digital signage, kiosks, showcases and many others. Please click this special link to register https://www.xpressreg.net/register/cetw041/start.asp?p=PAS4GST.

Fine tuning of compiler options to increase application performance

Alexander Permyakov — Mon, 21 Mar 2011 01:27:57 +0000

Performance is essential for video analytic applications since algorithms are usually computationally heavy and such systems are supposed to work almost in real time. From one side it can be increased by improving & changing algorithms. This is a major way since it allows to increase performance dramatically. From another side performance can be increased little bit more by relatively simple way – using of good compiler and by tuning of compile options. Let see how it can be done in real programs.

For the first example I used LAME encoder (http://lame.sourceforge.net/) . Why LAME? First of all because it open source and I can recompile it with different compilers and options. In the second place the simplicity of performance measurement. Performance will be a time required to reencode mp3 file. In the third place it shows well determinate results what allow better understand how different compile options affect speed.

The testing has been performed on computers with different CPUs under Windows operation system.

Intel Pentium 4 3GHz
Intel Core 2 Duo 2.8 GHz
AMD Athlon2x4 (635) 2.9 GHz overclocked to 3.3 GHz
Intel Core i5 (2500) 3.3 GHz

Compilation has been done by VisualStudio9 and GCC4.5.1(using MinGW)

Encoding time has been measured 10 times and average value placed to the table.

As the base 0.00% I used safe options (-O3 -march=prescott -fomit-frame-pointer -mfpmath=sse) that will work on most modern AMD and Intel CPUs. Option -march=core2 may use ssse3 instructions and therefore code may fail to work on AMD and Intel Pentimum 4 family CPUs.

Intel Pentium 4 3GHz

Compiler	Compiler options	Average Time	%
GCC4.5.1	-O3 -march=prescott -fomit-frame-pointer -mfpmath=sse -ffast-math -profile-use	13.206153 sec	-7.83 %
GCC4.5.1	-O3 -march=prescott -fomit-frame-pointer -mfpmath=sse -profile-use	13.537400 sec	-5.52 %
GCC4.5.1	-O3 -march=prescott -fomit-frame-pointer -mfpmath=sse -ffast-math	13.999892 sec	-2.29 %
GCC4.5.1	-O3 -march=prescott -fomit-frame-pointer -mfpmath=sse	14.328020 sec	0.00 %
GCC4.5.1	-O3 -fomit-frame-pointer -mfpmath=sse	14.621770 sec	2.05 %
Visual_Studio_9	/GS- /fp:fast /O2	14.646769 sec	2.22 %

Optimization to prescott architecture gives 2% speed increase.
–ffast-math gives 2% more
Profile guided optimization gives 5% speed increase

Intel Core 2 Duo 2.8 GHz

Compiler	Compiler options	Average Time	%
GCC4.5.1	-O3 -march=core2 -fomit-frame-pointer -mfpmath=sse -ffast-math -profile-use	7.818235 sec	-6.65 %
GCC4.5.1	-O3 -march=core2 -fomit-frame-pointer -mfpmath=sse -profile-use	7.824039 sec	-6.58 %
GCC4.5.1	-O3 -march=prescott -fomit-frame-pointer -mfpmath=sse -ffast-math -profile-use	7.893243 sec	-5.75 %
GCC4.5.1	-O3 -march=prescott -fomit-frame-pointer -mfpmath=sse -profile-use	7.976644 sec	-4.75 %
GCC4.5.1	-O3 -march=core2 -fomit-frame-pointer -mfpmath=sse -ffast-math	8.234858 sec	-1.67 %
GCC4.5.1	-O3 -march=core2 -fomit-frame-pointer -mfpmath=sse	8.374867 sec	0.00 %
GCC4.5.1	-O3 -march=prescott -fomit-frame-pointer -mfpmath=sse -ffast-math	8.415269 sec	0.48 %
GCC4.5.1	-O3 -march=prescott -fomit-frame-pointer -mfpmath=sse	8.423270 sec	0.58 %
Visual_Studio_9	/GS- /fp:fast /O2	8.814092 sec	5.24 %
GCC4.5.1	-O3 -fomit-frame-pointer -mfpmath=sse	9.224519 sec	10.15 %

Optimization to core2 architecture gives 10% speed increase.
–ffast-math gives only 1% increase
Profile guided optimization gives 6% increase

Intel Core i5 (2500) 3.3 GHz

There is no special -march option for core i3,5,7 CPUs. Option -march=core2 can be used for them.

Compiler	Compiler options	Average Time	%
GCC4.5.1	-O3 -march=core2 -fomit-frame-pointer -mfpmath=sse -ffast-math -profile-use	4.059390 sec	-8.52 %
GCC4.5.1	-O3 -march=prescott -fomit-frame-pointer -mfpmath=sse -ffast-math -profile-use	4.093767 sec	-7.75 %
GCC4.5.1	-O3 -march=core2 -fomit-frame-pointer -mfpmath=sse -ffast-math	4.156268 sec	-6.34 %
GCC4.5.1	-O3 -march=prescott -fomit-frame-pointer -mfpmath=sse -ffast-math	4.200015 sec	-5.35 %
GCC4.5.1	-O3 -march=core2 -fomit-frame-pointer -mfpmath=sse -profile-use	4.253143 sec	-4.15 %
GCC4.5.1	-O3 -march=prescott -fomit-frame-pointer -mfpmath=sse -profile-use	4.321892 sec	-2.61 %
GCC4.5.1	-O3 -march=prescott -fomit-frame-pointer -mfpmath=sse	4.437519 sec	0.00 %
GCC4.5.1	-O3 -march=core2 -fomit-frame-pointer -mfpmath=sse	4.468770 sec	0.70 %
Visual_Studio_9	/GS- /fp:fast /O2	4.737522 sec	6.76 %
GCC4.5.1	-O3 -fomit-frame-pointer -mfpmath=sse	4.815647 sec	8.52 %

Optimization to core2 architecture gives 9% speed increase.
–ffast-math gives 6% increase
Profile guided optimization gives 4% increase

AMD Athlon2x4 (635) 2.9 GHz overclocked to 3.3 GHz

Compiler	Compiler options	Average Time	%
GCC4.5.1	-O3 -march=amdfam10 -fomit-frame-pointer -mfpmath=sse -ffast-math -profile-use	6.078386 sec	-6.14 %
GCC4.5.1	-O3 -march=amdfam10 -fomit-frame-pointer -mfpmath=sse -ffast-math	6.170114 sec	-4.73 %
GCC4.5.1	-O3 -march=prescott -fomit-frame-pointer -mfpmath=sse -ffast-math -profile-use	6.308954 sec	-2.58 %
GCC4.5.1	-O3 -march=amdfam10 -fomit-frame-pointer -mfpmath=sse -profile-use	6.388826 sec	-1.35 %
GCC4.5.1	-O3 -march=amdfam10 -fomit-frame-pointer -mfpmath=sse	6.476186 sec	0.00 %
GCC4.5.1	-O3 -march=prescott -fomit-frame-pointer -mfpmath=sse -profile-use	6.527979 sec	0.80 %
Visual_Studio_9	/GS- /fp:fast /O2	6.942938 sec	7.21 %
GCC4.5.1	-O3 -march=prescott -fomit-frame-pointer -mfpmath=sse -ffast-math	7.293316 sec	12.62 %
GCC4.5.1	-O3 -march=prescott -fomit-frame-pointer -mfpmath=sse	7.372564 sec	13.84 %
GCC4.5.1	-O3 -fomit-frame-pointer -mfpmath=sse	7.661477 sec	18.30 %

Optimization to amdfam10 architecture gives 18% speed increase
–ffast-math gives 5 %
Profile guided optimization gives only 1%.

Total results

Optimization to particular architecture and profile guided optimization may give up to 20 % speed increase.

As I already said LAME is simple example. Let see how performance options affect real video analytic application .

For the second example I used critical part of real video analytic application (myAudience). It uses boost, opencv and ffmpeg libraries. Also it runs in several threads. In comparison with LAME encoder performance measurement for this application was not so simple. Moreover because of inaccuracy of measurements in multithreading dynamic enviroment results were not so well determinate. So I have prepared just one table which shows results in general how I understand them.

Compilation has been done by GCC4.5.2 and GCC4.1.2 on CentOS_5.5

Intel Core i5 (2500) 3.3 GHz

Compiler	Compiler options	Average Time	%
GCC4.5.2	-O3 -march=core2 -fomit-frame-pointer -mfpmath=sse -ffast-math -profile-use	19591.16	-4.90 %
GCC4.5.2	-O3 -march=core2 -fomit-frame-pointer -mfpmath=sse -profile-use	19873.74	-3.53 %
GCC4.5.2	-O3 -march=core2 -fomit-frame-pointer -mfpmath=sse -ffast-math	20010.55	-2.86 %
GCC4.5.2	-O3 -march=core2 -fomit-frame-pointer -mfpmath=sse	20410.36	-2.09 %
GCC4.5.2	-O3 -march=prescott -fomit-frame-pointer -mfpmath=sse -ffast-math	20410.36	-0.92 %
GCC4.1.2	-O3 -march=core2 -fomit-frame-pointer -mfpmath=sse -ffast-math	20532.91	-0.33 %
GCC4.5.2	-O3 -march=prescott -fomit-frame-pointer -mfpmath=sse	20600.55	0.00 %
GCC4.1.2	-O3 -march=core2 -fomit-frame-pointer -mfpmath=sse	20816.26	1.05 %
GCC4.1.2	-O3 -march=prescott -fomit-frame-pointer -mfpmath=sse -ffast-math	21962.44	6.61 %
GCC4.1.2	-O3 -march=prescott -fomit-frame-pointer -mfpmath=sse	22221.88	7.87 %

What can we conclude after that? Few things

GCC4.5.2 little bit faster than GCC4.1.2 plus it allow to use profile guided optimization, and “amdfam10”, “atom” architecture options.
Profile guided optimization give about 4% speed increase.
–ffast-math gives about 2% speed increase

As you can see, tuning compiler options allows to get real improvement in performance, not so huge sometimes, but almost free, so it should be kept in mind.

Testing video analytic algorithms

Yuri Vashchenko — Wed, 19 Jan 2011 09:09:05 +0000

A modern video analytic system depending on business/customer requirements should work in different situations/conditions. Complex, noisy background with many different objects/textures, changing lighting conditions, shadows, lack of light, weather conditions (for outdoor system installations) like rain, snow, fog and others, motion blur, camera movements, cameral sensor quality, camera resolution, camera focus issues, camera internal optimizations, color temperature, end many other factors make development of the good object recognition software a challenging, almost impossible task. In addition, the usual requirement is that the system should work in real time, which makes this task even more difficult.
So, even having current high-performance hardware, developers have to find a balance between the algorithm quality and speed (performance). Fixing a small quality issue sometimes causes significant performance degradation.
To keep this under control, a consistent unit testing should be performed with every algorithm change. To do this, Rhonda Software uses the unit test approach as described below:

A set of metrics is prepared. These metrics definitions describe “ground rules”, i.e. how actual logs from the system under test is interpreted, what is considered correct and what is not correct. For example, for people counting metric the following definition may be used:PEOPLE COUNTING
total number of visitors in test frame range = Correct + Missing + Unexpected + False
Real visitors (all found and not found visitors besides False) = Correct + Missing + Unexpected
Correct: Log visitors associated with test visitors, even if more than one log visitor is associated with one test visitor (only if test visitor had Hard detection status between frames were previous and next log visitor were correlated).
Missing: Not found in log
Unexpected: Log visitor is associated with earlier associated test visitor, but there is no Hard visitor detection status between these log visitors.
False: Log visitor is not associated with any test visitor
For each metric a set of KPIs (Key performance indicators) is defined. For example, for people counting metric the following KPIs may be defined:
# of Total visitors
# of Real visitors
# ofFalse visitors
# of Missing visitors
False visitor rate (percent)
Missing visitor rate (percent)
Counting error rate (percent)
Some KPIs may have a goal. For example, we may want to have Counting error rate less than 3%.
A set of test videos is created. Typically, there are dozens of videos prepared for the project to cover as many different situations/conditions as possible.
Project specific marker tool is used to create a special “markup”, or Meta information describing the objects located at each of input video. Usually, it is an xml file having the same name as an input video file. A specially trained engineer uses this tool to open a video file, go through selected frames and mark objects on them. For instance, for demography detection software, an operator may mark all persons found on a frame, specify coordinates location of their faces and add special attributes for every face like gender, ethnicity or age category. Some objects may be marked as “hard examples”. It specifies that this object is hard to detect/recognized due to different situations, like had motion blur, partially cover by another object, etc. By other words, a hard example is an object that potentially could be detected/recognized by the software, but it is not very likely, so it is ok if the software does not detect/recognize this object. Markup files are stored together with the video test files on a dedicated server for easy access.
Software under test is executed in a special logging mode. This mode tells the software to log everything it detects/recognizes and put it into a special (usually xml) log file. A specially prepared set of video files is given to the software version as an input. For each input video file a corresponding xml log is created.
Another project-specific tool, comparer is then used to get KPI values. The tool uses prepared markup (see 5) and actual data values from log (see 6). As a result, set of KP values is generated.

Metrics, KPIs and markup are created once per project/video. Testing and metrics calculation can be made for every release to see how this release changed in terms of quality and speed. This allows quickly detecting and fixing possible quality/speed degradations introduced in the release. In addition, when tested periodically, project management can see how the system evolves over the time in terms of quality/performance.
While making video for testing and markup of this video requires people to do, most of other unit testing activities could be performed automatically which helps keeping the quality and performance under control and requires little effort to perform. All of above allows us to develop state of the art, highly competitive software.

The video above explains the process of markering a video file.

Face Recognition

Sergey Koulik — Fri, 10 Sep 2010 01:51:42 +0000

2D face recognition is an extensively studied, but still evolving subject of research. Various strategies including statistical approaches, hidden Markov models, neural networks, template based and feature based matching have been proposed. Here we briefly present our implementation which is based on past research and achieves state-of-the-art recognition performance on considerably low resolution input facial images.
Our approach can be divided into three independent phases: Facial landmarks library construction (offline), Building of facial descriptor (once per novel image) and Facial descriptors matching.

Facial landmarks library construction
A set of training images is marked by hand. Coordinates of important facial landmarks (such as lips’ corners, nose tip, etc.) are stored in a database for further processing.
Using information about landmarks position it is easy to geometrically transform and align training images.
Illumination correction is applied to transformed images in order to get rid of shadows and glares and normalize overall exposure.
Gabor jets are then extracted from normalized images in every landmark location. Extracted jets are stored in facial landmarks library for further using during novel images processing.

Building of facial descriptor
Having received a novel facial image we first try to locate approximate eyes position using our hybrid method of Viola-Jones and Bayesian classifier. The purpose why eyes coordinates are required is twofold: they are used to geometrically wrap and align input image and also to get initial estimate of coordinates of other facial landmarks.
Illumination correction and background clipping is then performed.
Starting from approximate eyes position and using samples from facial landmarks library we iteratively find precise locations of all facial landmarks on novel face.
Found landmarks positions are then used to extract Gabor jets and construct informative facial descriptor. Original input image is not required any more after this step.

Facial descriptors matching
The last and the simplest step is matching two facial descriptors which yields a similarity measure between two faces – a real number between 0 (nothing in common) and 1 (complete match). A threshold found by experiment is used to make decision whether the faces belong to the same or different persons.
Descriptors matching is several orders of magnitude faster than descriptor building which makes it possible to match new face against a database of known persons in moderate time.

Some technical details
Working face size: 45×45 px
Descriptor building time: 300ms on P4 1300 MHz (single core).
Descriptors matching: 2ms on P4 1300 MHz (single core).
Recognition rate: Feret fa/fb: 88%, Yale Faces: 86%, Faces in wild: 73%. Both false positives and false negatives considered.

Currency recognition using cortex-like model.

Igor Stepura — Mon, 09 Aug 2010 07:05:18 +0000

Currency recognition seems to be one of the popular topic in “applied” computer vision. There are a lot of articles, blog entries describing different approaches to currency recognition. In this post I’ll share my experience of using so-called HMAX model.

Introduction

HMAX aims to model hierarchical object recognition in cortex. I won’t get into details of HMAX model, just provide some useful links for curious readers:

http://riesenhuberlab.neuro.georgetown.edu/hmax.html

http://cbcl.mit.edu/cbcl/publications/index-pubs.html

My approach in using HMAX for currency recognition was pretty simple.

1. Generate proper С1-feature dictionary
2. Using C1-feature dictionary, generate C2 vectors for images in training dataset
3. Train multi-class SVM classifier using C2 vectors.
4. Test classifier on testing dataset
5. PROFIT

C1 dictionary generation
My initial approach was to generate dictionary automatically using an “interesting point” detector and then generate C-1 patches of size corresponding to the size of detected image structure. Hessian-Laplace (see Mikolajczyk K. and Schmid, C. 2004. ) method seemed to suit well for this task, since it detects characteristic scale of interesting points. My implementation of Hessian-Laplace worked reasonable fine, however number of interesting points it detected was big, so it was pretty hard for my C1 extractor to decide which one of these points was really “interesting” for currency recognition.
As a result – none of my automatically generated dictionaries (which contained about 1000 samples) produce suitable classification outputs.

So I decided to start from smaller dictionary, which would contain hand-picked, really descriptive patches from all kind of bills. For each image in C1 dataset I took 8 patches ( see example of pacthes below). All images in C1 dataset were resized to have width 300 pixels. Size of C1 dictionary now became 88.

Classifier training
The size of my training dataset was 337 images of different dollar bills of all classes (1 to 100) + set of “background” images.

I used LibSVM for classifier training and testing, using RBF and Linear classifier kernels.

To train models with RBF kernels I used script easy.py from LibSVM – this handy script automatically scales training data and the searches for the best C/gamma parameters of the kernel.
Linear classifiers were trained using semi-automated approach – I scaled training data first, then used grid.py to find best value for C parameter of the kernel.

Something like that:
svm-scale -s newdict.range newdict.l > newdict.scale


./grid.py -log2g 1,1,1 -log2c -5,15,0.5 -t 0 newdict.scale

svm-train -t 0 -c 0.353553390593 newdict.scale newdict.model

And for testing:
svm-scale -r newdict.range newdict.t > newdict.t.scale

svm-predict newdict.t.scale newdict.model newdict.model.predict

Experiment results

Best classification accuracy I’ve reached so far is 88.62% using linear classifier.

Conclusions and future work

HMAX model has proved successful in classification of currency images and seem to have potential for better results.

Possible directions toward better classification could be:
1. Brightness correction. HMAX implementation I used seems to be pretty sensitive to brightness changes. I’ll need to investigate this more deeply and if necessary – normalize trainig/testing images to get rid of brightness-related issues.

2. Quality of C1-dictionary. While my current dictionary proved to be good enough to recognize dollar bill classes, I suppose it could be improved. For example – data there could be more patches per bill class, taken for different layers (bands) of C1 “pyramid”. The size of the features may also vary to achieve better results.

3. HMAX model tuning. Perhaps model modification according to approach of Jim Mutch и David G. Lowe ( PDF) would give better recognition results.

4. Use Ada-Boost in addition to SVM for better model training.

Compiling OpenCV for Android using NDK 3

Alexander Permyakov — Thu, 22 Apr 2010 06:01:50 +0000

Build platform: Ubuntu 9.10
Target platform: Android

Download and prepare OpenCV library source code.

1. Download the latest version of OpenCV (http://sourceforge.net/project/showfiles.php?group_id=22870).

2. As build platform is Linux, select linux version (for example OpenCV2.1.0.tat.bz).

3. Unpack somewhere to home dir.

Download and prepare cross-compiler

1. Download Android NDK 3 for Linux (http://developer.android.com/sdk/ndk/index.html)

2. Unpack it to ~/android_ndk_3/

3. Then run ~/android_ndk_3/build/host-setup.sh but first fix the error in line 119

Change

if [ “$result” = “Pass” ] ; then

if [ “$result” == “Pass” ] ; then

4. Do/install whatever needed to let host-setup.sh complete successful.

Create NDK project/Modify Makefiles

There is one big issue with NKD toolchein. It has trimmed stdc library which does not contain STL. Because of that some files (like cvkdtree.cpp in cv) can not be compiled since they use vector, list and other stuff. The solution is to compile STL from source code. In my OpenCV NDK project I used STL sources from uClibc (http://www.uclibc.org).

The simpliest way to start your OpenCV NDK project is to update hello-jni project with OpenCV source files.

The ~/android_ndk_3/apps/hello-jni/project/jni folder of hello-jni project may look like this

The ~/android_ndk_3/apps/hello-jni/project/jni/Android.mk may looks like this

LOCAL_PATH := $(APPS_PATH)/cv/src
LOCAL_C_INCLUDES := $(APPS_PATH)/cv/hdr $(APPS_PATH)/cxcore/hdr $(APPS_PATH)/stl/hdr

LOCAL_MODULE := cv
LOCAL_SRC_FILES := cvkdtree.cpp cvaccum.cpp cvadapthresh.cpp cvapprox.cpp cvcalccontrasthistogram.cpp cvcalcimagehomography.cpp cvcalibinit.cpp cvcalibration.cpp cvcamshift.cpp cvcanny.cpp cvcolor.cpp cvcondens.cpp cvcontours.cpp cvcontourtree.cpp cvconvhull.cpp cvcorner.cpp cvcornersubpix.cpp cvderiv.cpp cvdistransform.cpp cvdominants.cpp cvemd.cpp cvfeatureselect.cpp cvfilter.cpp cvfloodfill.cpp cvfundam.cpp cvgeometry.cpp cvhaar.cpp cvhistogram.cpp cvhough.cpp cvimgwarp.cpp cvinpaint.cpp cvkalman.cpp cvlinefit.cpp cvlkpyramid.cpp cvmatchcontours.cpp cvmoments.cpp cvmorph.cpp cvmotempl.cpp cvoptflowbm.cpp cvoptflowhs.cpp cvoptflowlk.cpp cvpgh.cpp cvposit.cpp cvprecomp.cpp cvpyramids.cpp cvpyrsegmentation.cpp cvrotcalipers.cpp cvsamplers.cpp cvsegmentation.cpp cvshapedescr.cpp cvsmooth.cpp cvsnakes.cpp cvstereobm.cpp cvstereogc.cpp cvsubdivision2d.cpp cvsumpixels.cpp cvsurf.cpp cvswitcher.cpp cvtables.cpp cvtemplmatch.cpp cvthresh.cpp cvundistort.cpp cvutils.cpp dummy.cpp

LOCAL_STATIC_LIBRARIES := cxcore stl

include $(BUILD_STATIC_LIBRARY)

############################
# cvaux
############################
include $(CLEAR_VARS)

LOCAL_PATH := $(APPS_PATH)/cvaux/src
LOCAL_C_INCLUDES := $(APPS_PATH)/cvaux/hdr $(APPS_PATH)/cv/hdr $(APPS_PATH)/cv/src $(APPS_PATH)/cxcore/hdr $(APPS_PATH)/stl/hdr

LOCAL_MODULE := cvaux
LOCAL_SRC_FILES := camshift.cpp cvaux.cpp cvauxutils.cpp cvbgfg_acmmm2003.cpp cvbgfg_codebook.cpp cvbgfg_common.cpp cvbgfg_gaussmix.cpp cvcalibfilter.cpp cvclique.cpp cvcorrespond.cpp cvcorrimages.cpp cvcreatehandmask.cpp cvdpstereo.cpp cveigenobjects.cpp cvepilines.cpp cvface.cpp cvfacedetection.cpp cvfacetemplate.cpp cvfindface.cpp cvfindhandregion.cpp cvhmm.cpp cvhmm1d.cpp cvhmmobs.cpp cvlcm.cpp cvlee.cpp cvlevmar.cpp cvlevmarprojbandle.cpp cvlevmartrif.cpp cvlines.cpp cvlmeds.cpp cvmat.cpp cvmorphcontours.cpp cvmorphing.cpp cvprewarp.cpp cvscanlines.cpp cvsegment.cpp cvsubdiv2.cpp cvtexture.cpp cvtrifocal.cpp cvvecfacetracking.cpp cvvideo.cpp decomppoly.cpp dummy.cpp enmin.cpp extendededges.cpp precomp.cpp vs/bgfg_estimation.cpp vs/blobtrackanalysis.cpp vs/blobtrackanalysishist.cpp vs/blobtrackanalysisior.cpp vs/blobtrackanalysistrackdist.cpp vs/blobtrackgen1.cpp vs/blobtrackgenyml.cpp vs/blobtrackingauto.cpp vs/blobtrackingcc.cpp vs/blobtrackingccwithcr.cpp vs/blobtrackingkalman.cpp vs/blobtrackinglist.cpp vs/blobtrackingmsfg.cpp vs/blobtrackingmsfgs.cpp vs/blobtrackpostprockalman.cpp vs/blobtrackpostproclinear.cpp vs/blobtrackpostproclist.cpp vs/enteringblobdetection.cpp vs/enteringblobdetectionreal.cpp vs/testseq.cpp

# failed to compile
#cv3dtracker.cpp

LOCAL_STATIC_LIBRARIES := cv cxcore stl

include $(BUILD_STATIC_LIBRARY)

The ~/android_ndk_3/apps/hello-jni/Application.mk file needs to be updated as follows

To build the project go to ~/android_ndk_3 and type

make APP=hello-jni

Of course there will be compile issues. Understand and fix them. Easiest cases are related to syntax mismatch between different compilers. In more complicated cases some code should be commented out. For example usage of libs with optimizations for Intel processor is not needed for ARM.

HighGui is also can be built but only partially. Simply remove files that causing problems from Android.mk. In my case the rest of files were enough to use cvLoadImage function for bmp file.

Running facedetect openCV example

There is no way to run native C code as separate application on Android. Instead native C functions can be called from Java apps. Because of that I made native function FaceDetect using OpenCV example application facedetect.c.

The declaration looks like this

This function takes YUV_NV21 buffer (preview from camera captured by Java app), converts it to BGRA8888, searches the faces, draws circles around the faces and returns updated RGBA8888 buffer back to Java app. Java app can draw it on the screen.