Jussi Judin's weblog

afl-fuzz on different file systems

Sun, 10 Jun 2018 18:00:10 +0300

One day I was fuzzing around with american fuzzy lop and accidentally pointed the output directory for fuzzer findings to point onto a file system on a physical disk instead of the usual shared memory file system. I noticed my mistake and changed the output directory to the usual place. Consequently there was a clear difference on how many fuzzing iterations/second afl-fuzz program could do. This then lead into this comparison of different file systems and fuzzing modes on them with the workload that afl-fuzz has.

Hardware and software background

Fuzzing is a computationally intensive way to find bugs in programs, usually fully exercising the machine CPU. This and writing data to the disk can have adverse effects on the hardware even if we are working with software. Some risks are listed in the common-sense risks section at american fuzzy lop’s README.txt file that is a good read about the subject. In this article I’m mainly interested in what happens when you happen to point the output of afl-fuzz to a device with a specific file system.

afl-fuzz as part of american fuzzy lop

American fuzzy lop is a successful generic purpose fuzzer that finds bugs for you while you sleep. afl-fuzz is the executable program that does the hard work of generating new data, repeatedly running the target program (fuzz target), and analyzing the results that come up in these fuzz target executions. It has handcrafted heuristics and fuzzing strategies that in practice provide quite successful results without the need for tuning.

Figure 1: Interworkings of afl-fuzz, the fork server, and the fuzz target.

Figure 1 roughly visualizes the different parts that are part of a fuzzing session when using afl-fuzz. It spawns the program that we try to fuzz. This program then creates a fork server that is responsible for communicating with the main afl-fuzz executable over pipes and spawning new fuzz target instances with fork() call. The fork server actually a small shim that is part of the fuzz target. Also the new program instance that fork() call spawns still technically holds a reference to the fork server, but those parts are just active at different times, visualized as gray in the figure. If the fuzz target is in persistent mode, it doesn’t get restarted for every new input.

Data is passed to the fuzz target either over standard input or over a newly created file that the fuzz target reads. Then the fuzz target executes itself by writing an execution trace to a shared memory and finishes running. afl-fuzz then reads the trace left by the fuzz target from the shared memory and creates a new input by mutating old ones. What to mutate is controlled by the new and historical information that the instrumentation data from the fuzz target provides.

Solid state drives

Solid state drives are nowadays the most common physical storage medium in generic and special purpose computers. They are fast and relatively decently priced for the storage needs of a regular software developer. The decently priced solid state drives come with a price of somewhat limited write endurance due to the properties of the physical world and manufacturing compromises.

Currently (early 2018) the write endurance of a solid state drive promises to be around half in terabytes than what the drive capacity is in gigabytes on consumer level drives using (TLC NAND). This means that 200-250 gigabyte SSD has a write endurance around 100 terabytes. In practice, the write endurance will likely be more at least on a test from 2013-2015.

Such write endurance does not really pose any limitations in the usual consumer level and professional workloads. But when it comes to fuzzing, it’s not the most usual workload. The whole idea of a fuzzer is to generate new data all the time, make the program execute it, and then either discard or save it depending of the program behavior is interesting or not. So we can expect that a fuzzer will generate quite a lot of data while it’s running.

Quick back-of-the-envelope calculation when assuming 10 parallel fuzzers running 1000 executions/second with 1000 byte input/execution on average would result in over 20 terabytes of data generated every month. If all this is directly written to a drive, then we can expect to have around 5 months until we reach the promised write endurance limits of a typical solid state drive at the time of this writing. There is caching that works to prolong this, but then there are all the invisible file system data structures, minimum file system block sizes, and write amplification then fight against these savings.

Two data passing modes of afl-fuzz

afl-fuzz program has two ways to pass the new input to the fuzz target: passing the data over the standard input and creating a new file with the new data and having the fuzz target to read the data from the just created file file.

The most compact afl-fuzz command only includes the information for initial inputs, output directory and the command to execute the fuzz target without any special arguments:

$ afl-fuzz -i in/ -o out/ -- fuzz-target

The fuzz target gets the input data from its standard input file descriptor so it does not need to explicitly open any files. Technically the standard input of the child process is backed by a file that is always modified by the afl-fuzz process when new input it generated. This file is also .cur_input file at the fuzzer instance directory. The exact details of how this works are explained later.

The more explicit input file path information including commands look like this:

# "@@" indicates the name of the file that the program gets with new
# fuzz data. "-f" can be used to hard code the input file location
# that then can be passed to the fuzz target or not.
$ afl-fuzz -i in/ -o out/ -- fuzz-target @@
$ afl-fuzz -i in/ -o out/ -f target.file -- fuzz-target @@
$ afl-fuzz -i in/ -o out/ -f target.file -- fuzz-target

In the first command there is no explicit path to the input file specified, so the target program will get the path as the first argument, indicated by @@. The actual file is located at the fuzzer instance output directory (here it’s out/) as .cur_input file. Second two forms with -f target.file switch make it possible to define the location of the input file so that it can reside outside the output directory.

Details of input data writing

Now that we know that afl-fuzz provides two different ways for the data to end in the program, then let’s look at the extracted details of write_to_testcase() function to see how this data is passed to the child process.

When afl-fuzz writes data into the standard input stream of the fuzz target, the main parts of the code look like this in both afl-fuzz and fuzz target side:

// afl-fuzz part:
{
    // Make sure that reads done by fuzz target do not affect
    // new testcase writes:
    lseek(testcase_fd, 0, SEEK_SET);
    // Write new testcase data to the file:
    write(testcase_fd, testcase_data, size);
    // Make sure that old data in testcase file does not leak
    // into the new fuzzing iteration:
    ftruncate(testcase_fd, size);
    // Make read() act as it would read a fresh standard input
    // instance:
    lseek(testcase_fd, 0, SEEK_SET);
}
// Fuzz target part:
{
    // stdin file descriptor inside fuzz target is actually
    // identical to testcase_fd thanks to dup2(). So we just
    // read whatever happens to be there:
    ssize_t read_size = read(testcase_fd, buffer, sizeof(buffer));
    // Do some actual fuzzing:
    fuzz_one(buffer, read_size);
}

What happens in here that first the position in the file is changed to the beginning. A new data is written over the old one. The file length is changed to correspond to the new data length. The position is set to the beginning of the file again. Then the fuzz target reads from this file descriptor and runs the just written data through the actual program logic.

When passing data to the program through a named file following types of operations happen:

// afl-fuzz part:
{
    // Remove the old output file.
    unlink(out_file);
    // Create a new file with the same name.
    int out_fd = open(out_file, O_WRONLY | O_CREAT | O_EXCL, 0600);
    if (out_fd < 0) { abort(); }
    // Write the test data into the created file.
    write(out_fd, data, size);
    // Close the created file so that the data is available to other
    // processes.
    close(out_fd);
}
// Fuzz target part:
{
    // Open the file created by afl-fuzz.
    int in_fd = open(out_file, O_RDONLY);
    if (in_fd < 0) { abort(); }
    // Read enough data from the opened file descriptor for fuzzing.
    ssize_t read_size = read(in_fd, buffer, sizeof(buffer));
    // Close the opened file descriptor so that we don't leak
    // resources.
    close(in_fd);
    // Do some actual fuzzing:
    fuzz_one(buffer, read_size);
}

Basically old input file is removed and a new file is created, data is written to it and it’s closed. Then the fuzz target opens the just created file, reads the data out of it and closes the opened file descriptor closes it, and runs the data through the actual program logic.

You can already see that there are more file system related functions used when the fuzz data file is recreated. Then these functions are also more heavy than the ones used in the standard input version. When you call unlink() and open(), Linux kernel needs to do pathname lookup to figure out what exact file objects are accessed. When you only have the file descriptor to manipulate in the standard input case, you avoid these pathname lookups and hopefully manipulate purely numeric data. Also when you open a new file, it has to actually create the corresponding inodes and their data structures to the file system. This has a certain amount of overhead.

So looking at the called functions, it would feel like there is going to be a fuzzing overhead difference between these two input data creation approaches.

Benchmarking

Theoretical speculation is always nice, but the real hard data comes from the benchmarking. This section looks at the fuzzing overhead from both afl-fuzz perspective with low and high level filesystem access functions and from raw file system access perspective. Also the section about data writes tries to find an answer to the question that how damaging fuzzing can actually be to a solid state drive with a relatively limited write endurance.

I did these benchmarks with several major general purpose file systems found from Debian provided Linux kernels 4.14.13-1 and 4.15.11-1. These can show up as differences in numbers between tables, but numbers inside the same table are done with the same system. Also there is small variance between fuzzing sessions that I have tried to eliminate running the fuzzers for long enough that system maintenance and other short running tasks don’t have too big of an impact. But the relative mileage may vary between kernel versions, operating system libraries, and between the actual physical machines.

Executions/second

General purpose instrumentation guided fuzzers like american fuzzy lop and libFuzzer get their power from the fact that they can repeatedly execute the fuzz target in quick succession. Speed is one factor, but also the program stability from one execution to another is a second one. Having stable program ensures that the issues that fuzzing finds can be easily replicated. This basically leads into a compromise of selecting the appropriate fuzzer and fuzz target execution strategy.

It can be that the program has a global state that needs program restart or other tricks between runs. For these types of situations the default forkserver mode in afl-fuzz is appropriate. On the other hand if everything can be functionally wrapped inside a function that does not leak its state outside, we can use the much faster persistent mode in afl-fuzz. From this it should be actually quite easy to port the fuzz target to libFuzzer.

In this case libFuzzer shows 735 k executions/second with the sample target when the data is not passed between process boundaries. It is also possible to simulate in-memory file streams with fmemopen() function where libFuzzer achieved with this example program 550 k executions/second. This is 15-20 times lower fuzzing overhead than with afl-fuzz. But in this article we focus on american fuzzy lop’s persistent mode and leave the fuzzing engine selection for some other time.

A typical american fuzzy lop’s persistent mode fuzz target would generally have some generic configuration in the beginning and then the actual data dependent program execution would reside inside a while (__AFL_LOOP(40000)) { ... } loop. The number 40000 is just the number of iterations that the program does before it restarts again and can be lower or higher depending on how confident you are that the program does not badly misbehave.

Using the standard input with the persistent mode makes the data reading code look like following:

generic_configuration(argc, argv);
int input_fd = fileno(stdin);
while (__AFL_LOOP(40000)) {
    ssize_t read_size = read(input_fd, buffer, sizeof(buffer));
    if (read_size < 0) { abort(); }
    // Do the actual fuzzing.
    fuzz_one(buffer, read_size, &values);
}

Using named files with the persistent mode on the other hand requires opening a new file every iteration:

generic_configuration(argc, argv);
while (__AFL_LOOP(40000)) {
    int input_fd = open(argv[1], O_RDONLY);
    ssize_t read_size = read(input_fd, buffer, sizeof(buffer));
    if (read_size < 0) { abort(); }
    close(input_fd);
    // Do the actual fuzzing.
    fuzz_one(buffer, read_size, &values);
}

The example fuzz target

I directly used the code from my Taking a look at python-afl article and modified it to support the file and stdin reader variants. This code is available from target-simple.cpp for the prying eyes. Also the later on introduced C and C++ standard library variants are available as target-fread.cpp and target-ifstream.cpp.

File system dependent results

The benchmarking was done with a loopback device that fully resides in memory. Each of these file systems was created and mounted with their default options and the performance test was run for 2 minutes. Only a single instance of afl-fuzz was running. You can see from the table 1 the difference between the same workload on different file systems. The shared memory virtual file system (tmpfs) was the most efficient in both of these cases.

execs/second	stdin	file
btrfs	20.5 k	14.6 k
ext2	31.6 k	19.2 k
ext3	30.0 k	18.4 k
ext4	28.0 k	18.2 k
JFS	31.7 k	17.1 k
ReiserFS	28.0 k	14.5 k
tmpfs	34.6 k	25.5 k
XFS	21.1 k	16.8 k

Table 1: Average number of fuzz target executions/second over 2 minute measuring period with different file systems.

This result also shows an interesting history between ext2, ext3, and ext4 file systems. Their performance seems to decrease the newer the file system is, likely due to larger amount of data safety operations that the newer file systems do. There could be difference when parallel access to a file system is concerned.

Execution speed with different standard libraries

I also wanted to see how the use of more portable file system calls in both C and C++ standard libraries affects the fuzzing speed. Standard libraries like glibc, libstdc++, and libc++ provide a portable file system interface across operating systems. They also provide a certain amount of buffering so that every small read does not result in a system call and suffer from context switch overhead. These higher level library functions also aim to be thread safe so that concurrent manipulation of the same file descriptor would be a little bit less surprising.

When doing fuzzing, we are usually just reading small amount of data once and then either close the input file or expect the afl-fuzz to reset the input stream for us. Any buffering information must also be discarded and the stream position must be reset so that the input data for the fuzz target is the same what afl-fuzz expects it to be. If it’s not, it will show as a lower than 100 % stability value on AFL status screen.

Following is the C code used to benchmark how much overhead the standard C library functions bring into this. Standard input is put into unbuffered mode with setvbuf() function call, so buffer manipulation should not have a meaningful amount of overhead.

// C version for glibc benchmarking with standard file manipulation functions.

FILE* input_fd = stdin;
setvbuf(input_fd, NULL, _IONBF, 0);
while (__AFL_LOOP(40000)) {
    size_t read_size = fread(buffer, 1, sizeof(buffer), input_fd);
    fuzz_one(buffer, read_size, &values);
}

// We get the fuzz data filename from argv[1].
while (__AFL_LOOP(40000)) {
    FILE* input_fd = fopen(argv[1], "rb");
    size_t read_size = fread(buffer, 1, sizeof(buffer), input_fd);
    fclose(input_fd);
    fuzz_one(buffer, read_size, &values);
}

C++ streams don’t have a working unbuffered reads that would behave reliably. The closest equivalent to this is to call istream::seekg() and after that istream::clear() function. This basically is theoretically equivalent to what rewind() function does. Other solution of trying to set istream::rdbuf() to a zero sized one had no real effect on increasing stability.

C++ code for these fuzzing benchmarks looks like following:

// C++ version for libstdc++ and libc++ benchmarking with the standard
// C++11 file manipulation functions.

auto input_fd = std::cin;
while (__AFL_LOOP(40000)) {
    // Resetting stream state is 2 function calls.
    input_fd.seekg(0, input_fd.beg);
    input_fd.clear();
    input_fd.read(buffer, sizeof(buffer));
    size_t read_size = input_fd.gcount();
    fuzz_one(buffer, read_size, &values);
}

// We get the fuzz data filename from argv[1].
while (__AFL_LOOP(40000)) {
    std::ifstream input_fd(
        argv[1], std::ifstream::in | std::ifstream::binary);
    input_fd.read(buffer, sizeof(buffer));
    size_t read_size = input_fd.gcount();
    input_fd.close();
    fuzz_one(buffer, read_size, &values);
}

C++ code is a little bit more verbose than C code, as there are extra layers of abstraction that need to be removed so that afl-fuzz style data processing is possible. Table 2 then shows what is the speed difference of these different implementations and standard libraries.

execs/second	syscalls	glibc	libstdc++	libc++
stdin	33.5 k	32.4 k (98 %)	31.2 k (93 %)	30.7 k (92 %)*
file	24.2 k	22.8 k (94 %)	22.0 k (91 %)	20.7 k (86 %)

Table 2: Fuzz target executions/second over 2 minute period with different standard libraries on tmpfs.

* The stability of libc++ stdin session was not 100 %.

We can see that the pure combination of open() and read() functions is the fastest one. Portable C standard library implementation comes quite close with the unbuffered standard input implementation, but everywhere else there is a clear performance loss.

Stability is an important measurement of how reliably fuzzing can continue. Unfortunately in this tests everything was not 100 % stable. The standard input reading and resetting for libc++ does not fully work as expected and leads into reading old data.

The conclusion here is that try to stay with the low level unbuffered file manipulation functions or as close to their equivalents. More portable higher level functions for file access may provide speed benefits through buffering for regular file access patterns, but in this case they just add extra layers that slow down the data passing from afl-fuzz to the fuzz target. Unless the fuzz target itself relies on using those higher level functions.

Data writes

In the background section I described how it generally is a bad idea to write data to a physical disk, as modern solid state drives have a limited write durability. But one could think that the write caching done by Linux would prevent a situation from happening where data is constantly written to the disk if the lifetime of a file contents is really short.

The write caching towards block devices is handled by the page cache in the kernel. But as different file systems may write every new file or file part into a new location, this cache is not necessarily as effective as it could be. Even if the writes happen to the same file with very little data.

In this section I let afl-fuzz to run for 30 minutes and looked at the kilobytes written by iostat to a block device that was a file in memory that was mounted on a loopback device. This way the device should have looked like a real block device without the risk of shortening the lifespan of any real hardware. Estimated monthly values for disk writes for various file systems can be seen from table 3 for a single afl-fuzz instance.

writes/month	stdin	file
btrfs	3.052 TiB	0.020 TiB
ext2	0.004 TiB	0.005 TiB
ext3	0.020 TiB	0.028 TiB
ext4	0.035 TiB	0.021 TiB
JFS	9.74 TiB	69.8 TiB
ReiserFS	0.021 TiB	0.024 TiB
tmpfs	-	-
XFS	201 TiB	0.004 TiB

Table 3: Estimated monthly data writes for a single fuzzer on different file systems.

The numbers for the amount of data that would be written to the device is an extrapolation of the 30 minute run and show values are in terabytes (2⁴⁰ bytes). These numbers are for the persistent mode and for an extremely light algorithm. They still give some estimate on the amount of data writes to the disk even for algorithms that are 10-100 times slower. Nowadays a typical desktop computer can easily have 8-16 threads and the order of typically processed data per fuzzer iteration is closer to 1 kilobyte than to 128 bytes as in this case.

When looking at these numbers, btrfs, JFS, and XFS file systems are pretty dangerous ones to accidentally use, as the faster stdin mode where data is fed to always open file handle actually causes a meaningful amount of data to be written to a disk per month. Especially with XFS it’s 4 kilobytes of data written/fuzzing iteration. Taking into account that many solid state disk drives have write endurance only in hundreds of terabytes, accidentally running even one fuzzer for one month or more makes a significant dent in its lifetime for these file systems.

I also wanted to see what happens when you run multiple fuzzers in parallel on the same machine. The most trivial assumption for multiple fuzzers would be that the file system writes would increase about linearly with the number of fuzzers. The results on table 4 show that this is indeed the case for most of the file systems, except for ext2. Fortunately ext2 is not the worst data writer of the measured file systems and you really need to go extra mile to use it nowadays, as newer ext3 and ext4 file systems have replaced it by default.

writes/month	stdin x4	file x4
btrfs	9.53 TiB (78 %)	0.052 TiB (65 %)
ext2	0.072 TiB (450 %)	0.120 TiB (600 %)
ext3	0.075 TiB (94 %)	0.147 TiB (130 %)
ext4	0.090 TiB (64 %)	0.088 TiB (100 %)
JFS	44.4 TiB (110 %)	339 TiB (120 %)
ReiserFS	0.069 TiB (82 %)	0.072 TiB (75 %)
tmpfs	-	-
XFS	786 TiB (98 %)	0.035 TiB (220 %)

Table 4: Monthly data writes for a 4 parallel fuzzers on different file systems and the percentage to an extrapolated number based on one afl-fuzz instance.

The worrying part here is that two quite actively and widely used file systems, btrfs and XFS, suffer greatly from the writes with the standard input data generation pattern. As this is the faster mode of fuzzing from executions/second perspective, it’s quite nasty if you put the output directory for fuzzer findings on such file system.

Theoretical limits

I also wanted to see what would be the theoretical limits for the type of file system access pattern that afl-fuzz does. This is quite a synthetic benchmark, as it does not involve any context switches between processes, signal delivery, shared memory manipulation, file descriptor duplication, and other types of data processing that afl-fuzz does.

sequences/second	stdin	file
btrfs	87.5 k	40.7 k
ext2	446.1 k	66.5 k
ext3	305.4 k	60.3 k
ext4	211.5 k	56.4 k
JFS	737.7 k	59.5 k
ReiserFS	594.1 k	40.4 k
tmpfs	682.5 k	138.0 k
XFS	101.9 k	48.5 k

Table 5: Theoretical file system performance numbers with `afl-fuzz` like access pattern.

The difference between file systems in table 5 is many times larger for standard input based data generation than for afl-fuzz (table 1). When opening and closing files, the overhead is bigger and the difference between different file systems is also smaller.

If you also compare the overhead of pure file system accesses, it is higher than the overhead of libFuzzer’s 735 k executions/second basically for all cases. On top of this there is the overhead of everything else that happens inside afl-fuzz program.

End notes

You should use tmpfs that fully resides in memory for your fuzzing data generation with afl-fuzz. Usually this is available on /dev/shm/ or /run/shm/ globally and on systemd based systems also available under /run/user/$UID/. It’s faster and less straining to the hardware than trusting that the Linux block layer does not too often write to the physical device. Especially if you happen to use btrfs or XFS file systems.

In the end, this is mostly Linux specific information and for example OS X based systems need specific tricks to use memory based file systems. Same with FreeBSD where tmpfs based file system is not mounted by default.

Avatars, identicons, and hash visualization

Sun, 04 Feb 2018 18:00:09 +0200

Oftentimes we have a situation where we have a situation where we want an easy way to distinguish people of which we only know the name or a nickname. Probably the most common examples of these are distinguishing different people from each other on various discussion platforms, like chat rooms, forums, wikis, or issue tracking systems. These very often provide the possibility to use handles, often real names or nicknames, and profile images as an alternative. But as very often people don’t provide any profile image, many services use a programmatically generated images to give some uniqueness to default profile images.

In this article I’ll take a look at a few approaches to generate default profile images and other object identifiers. I also take a look at the process of creating some specific hash visualization algorithms that you may have encountered on the web. The general name for such programmatically generated images is hash visualization.

Use cases and background

A starting point to visualize arbitrary data is often hashing. Hashing in this context is to make a fixed length representation of an arbitrary value. This fixed length representation is just a big number that can be then used as a starting point for distinguishing different things from each other.

Hash visualization aims to generate a visual identifier for easy differentiation of different objects that takes an advantage of preattentive processing of human visual system. There are various categories that human visual system uses to do preattentive processing, like form, color, motion, and spatial position. Different hash visualization schemes knowingly or unknowingly take advantage of these.

Simple methods use specific colors as a visualization method, but at the same time limit the values that can be preattentively distinguished to maybe around 10-20 (around 4 bits of information) per colored area. Using additional graphical elements provides a possibility to create more distinct unique elements that are easy to separate from each other. One thesis tries hash visualization with 60 bits of data with varying success. So the amount of visualized data that can be easily distinguished from each other by graphical means is somewhere between 4 and 60 bits.

Avatars

Perhaps the most common way to distinguish different users of an Internet based service is to use an user name and an avatar image. Avatars often go with custom avatar images that users themselves upload to the service. In case the service itself does not want to host avatars, it can link to services like Gravatar that provides avatar hosting as a service. This way user can use the same avatar in multiple services just by providing their email address.

But what about a situation where the user has not uploaded or does not want to upload an avatar image to the service? There usually is a placeholder image that indicates that the user has not set their custom avatar.

There seem to be three major approaches to generate this placeholder image:

A generic dummy placeholder image. These often come in a shape of a generic human profile.
An image from predefined collection of images. These are usually related to a theme that the site wants to present.
A partially or fully algorithmically generated image. These usually make it possible to get the most variety for placeholders with the least amount of work.

Algorithmically generated images vary greatly and usually come in three varieties:

Simple images that mainly use a color and a letter to distinguish users.
Images that consist of a set of a predefined parts and color variations.
More complex shapes usually created fully algorithmically.

I have taken a closer look at different algorithmic avatar creation methods in form of WordPress identicon, GitHub identicon, and MonsterID. These provide an overview on how algorithmic avatar generation can work. There is also a chapter of using a character on a colored background to give an example of a simple algorithmic method for default avatar generation.

I want to focus here on visualizing the few methods that I have encountered rather than having a all-encompassing overview of avatar generation. For that, there is Avatar article on Wikipedia that gives a better textual overview of these and many more methods than I could give.

Character on a colored background

Figure 1: Avatar examples having a colored background with the first letter of an user name.

Services like Google Hangouts, Telegram and Signal, and software with people collaboration functionality like Jira generate a default avatar for an user by using the first letter of user’s name with a colored background. This is simple method where you need to select a font, a shape that you want to use to surround the letter, and size of the avatar. Some example avatars generated by this method are visible from figure 1.

WordPress identicon

Figure 2: Example WordPress identicons.

WordPress identicon has been a well known algorithmically generated avatar for quite long time on web forums. It has been used to distinguish between users based on their IP addresses and email addresses. Example WordPress identicons can be seen from figure 2 where you should be able to notice that there is some symmetry in these images.

Figure 3: 44 parts from which a WordPress identicon is generated.

Looking at the code how these avatars are generated reveals 44 patterns, shown in figure 3 that the generators uses to generate these images. They are symmetrically placed around the center and rotated 90 degrees when they reach the next quadrant. This process is illustrated in figure 4 that shows how these parts are placed and rotated around the center. In case the requested identicon has an even number of parts, these parts will get duplicated. So 4x4 WordPress identicon will only have 3 distinct parts in it. This is the same number as for 3x3 WordPress identicon.

Figure 4: 5x5 WordPress identicon generation animated.

GitHub identicon

Figure 5: Example GitHub style identicons.

GitHub is a great collaborative code repository sharing service where users can have identifying images. If user, however, does not have any profile image, GitHub will generate an identicon that forms a 5x5 pixel sprite. Figure 5 examples how such identicons can look like. The exact algorithm that GitHub uses to generate these identicons is not public, so these example images just imitating the style of GitHub identicons.

Figure 6: GitHub style identicon creation process animated. Visible pixels are horizontally mirrored.

Figure 6 shows the process of creating such identicon. Basically the given data is hashed and from that hash a specific color is selected as the value for a visible pixel. Then other part of the hash (15 bits) is used to form an image that is horizontally mirrored. I have used a modified ruby_identicon to generate these images and animations. This library also enables the generation of other sizes than 5x5 sprites.

MonsterID

Figure 7: Example artistic MonsterIDs.

WordPress MonsterID is a hash visualization method where hashes are converted to various types of monsters. It’s a one type of hash visualization method where lifelike creature is created from predefined body parts. WordPress MonsterID actually offers two types of monsters, the default ones and artistic ones. Here I’m showing some example artistic monsters in figure 7 just because they look better than the default ones.

Figure 8: Creating an artistic MonsterID image by selecting and colorizing an appropriate part from each category.

Figure 8 shows the MonsterID creation process. First there is a lightly colored background on top of which we start forming the monster. We select a body part in order of legs, hair, arms, body, eyes, and mouth. These parts are show in figure 9. Then assign a color to the selected part, unless it’s one of the special cases. Then just overlay it on top of the image formed from previous parts. There are special cases that what parts should have a predefined color or be left uncolored, but that does not change the general monster creation process.

Figure 9: Parts from which a MonsterID is created divided into categories (hair, eyes, mouth, arms, body, and legs).

In addition to WordPress MonsterID plugin, Gravatar also supports old style MonsterIDs. These unfortunately do not as good as the artistic monsters in the WordPress MonsterID plugin and there is no option to select an alternative form for these avatars.

OpenSSH visual host key

Hash visualization has also its place in cryptography. OpenSSH client has an option to display so called visual host keys.

When a SSH client connects to a remote server that it has not ever seen before, it needs to accept the public key of this server so that it can communicate with it securely. To prevent man-in-the-middle attacks, user initiating the connection is supposed to verify that the host key indeed matches. An example message about host key verification request is shown in figure 10. The idea is, that either you have seen the this key beforehand or you have some other means to get this key fingerprint and verify that it indeed belongs to the server you are trying to connect to. How this works in practice or not, is a completely different discussion.

$ ssh example.com
The authenticity of host 'example.com (127.1.2.3)' can't be established.
RSA key fingerprint is SHA256:Cy86563vH6bGaY/jcsIsikpYOHvxZe/MVLJTcEQA3IU.
Are you sure you want to continue connecting (yes/no)?

Figure 10: The default SSH server key verification request on the first connection.

As you can see, the key fingerprint is not easy to remember and also the comparison of two fingerprints can be quite hairy. When you enable visual host key support, as show in figure 11, the idea is that you can quickly glance to figure out if the key of a server is different than expected. This should not be treated as a method to see if two keys are equal, as different keys can produce the same image.

$ ssh -o VisualHostKey=true example.com
The authenticity of host 'example.com (127.1.2.3)' can't be established.
RSA key fingerprint is SHA256:Cy86563vH6bGaY/jcsIsikpYOHvxZe/MVLJTcEQA3IU.
+---[RSA 8192]----+
|     ..o.=+      |
|      . E.       |
|        . .      |
| .       o       |
|o o   + S o      |
|.+ o o + *       |
|o.. .o..B.o      |
|.o  o.*BB= .     |
|+ ...=o@@+o      |
+----[SHA-256]----+
Are you sure you want to continue connecting (yes/no)?

Figure 11: Visual host key enabled SSH server key verification request on the first connection.

The algorithm to generate these visual host keys is visualized at figure 12. It’s also described in detail in an article that analyzes the algorithmic security of this visual host key generation: The drunken bishop: An analysis of the OpenSSH fingerprint visualization algorithm.

OpenSSH’s visual host key is formed in such way that the formation starts from the center of the image. Then 2 bits of the host key fingerprint are selected and the location is moved in a diagonal direction. The visual element on the plane where this movement happens is changed based on how many times a certain location is visited. This can be better seen from the animation where each visual host key generation step is visualized as its own frame and where the bits related to the movement are highlighted.

Figure 12: OpenSSH visual host key generation animated.

The exact details how the key is generated can be seen from OpenSSH’s source code. This source code also refers to a paper called Hash Visualization: a New Technique to improve Real-World Security. But the fact that OpenSSH is usually limited to terminals with text interface makes it quite hard to use the more advanced Random Art methods described in the paper. But the idea for host key verification by using images has at least taken one form in the OpenSSH world.

Conclusions

I presented here some methods that I have encountered over the years on various sites and programs. I can not possibly cover all of them, as there likely are as many algorithms to create avatars as there are people inventing them. The biggest question with any of these algorithms is that how many bits they should try to visualize and what types of graphical elements they can use in the visualization without just creating indistinguishable clutter.

Taking a look at python-afl

Sun, 07 Jan 2018 18:00:08 +0200

I have been using american fuzzy lop to fuzz various C and C++ programs and libraries. It is a wonderfully fast fuzzer that is generally easy to get started with and it usually finds new bugs in programs that have not been fuzzed previously. Support for american fuzzy lop instrumentation has also been added for other languages and I decided to try out how it works with Python. More specifically with the reference CPython implementation of it.

Fuzzing Python programs with american fuzzy lop

American fuzzy lop generally works by running a program that is compiled with american fuzzy lop instrumentation built in. It executes the program with afl-fuzz command that modifies the input data that is fed to the program, monitors how the program behaves, and registers everything that causes abnormal program behavior. This works well for natively compiled programs, but causes various issues with interpreted programs.

Python is by default an interpreted language, so to execute Python programs, you need to start a Python interpreter before executing your code. This means that if you would instrument the Python interpreter with american fuzzy lop instrumentation and run the interpreter with afl-fuzz, it would mostly fuzz the inner workings of the interpreter, not the actual Python program.

Fortunately there is python-afl module that enables american fuzzy lop instrumentation for just the Python code instead of instrumenting the Python interpreter. In native programs american fuzzy lop compiler wrapper (afl-gcc, afl-clang, afl-clang-fast) adds the necessary instrumentation and the connection to afl-fuzz. Python-afl is, however, designed in such way that it doesn’t try to wrap the whole program, but requires you to create a wrapper module that initializes fuzzing.

The most simple way to wrap a Python program with python-afl is to initialize python-afl and then run the program (assuming that main() function exists):

import afl

afl.init()
import fuzzable_module
fuzzable_module.main()

This script, saved to fuzz-wrapper.py, can be then run with py-afl-fuzz command that wraps afl-fuzz for Python programs:

py-afl-fuzz -m 400 -i initial-inputs/ -o fuzzing-results/ -- \
    python fuzz-wrapper.py @@

More details about these command line switches can be found from AFL readme file. This then brings out the famous american fuzzy lop status screen, but now for Python programs:

Figure 1: afl-fuzz status screen with python-afl. Yes, I use white background.

Next sections will explain in more details how to make fuzzing these programs more efficient and what pitfalls there could be in Python programs from fuzzing efficiency point of view.

Afl-fuzz modes and their python-afl equivalents

Generally afl-fuzz provides 4 fuzzing modes that differ in how the program execution between different fuzzing inputs behaves:

Dumb mode that just executes the program by doing fork() and execv(). This is the slowest mode that does not rely on any fancy tricks to speed up program execution and also does not provide any insights how the program behaves with different inputs.
Basic fork server mode where the fuzzed binary does all the initialization steps that happen before calling the main() function and then program is repeatedly forked from that point on. This also includes instrumentation that is compiled in to the program so there already is some insight on what is happening inside the program when a specific input is processed. There exists QEMU mode for afl-fuzz that technically enables fork server mode for uninstrumented binaries, but with some performance penalty.
Deferred instrumentation that works in similar fashion as the basic fork server mode. Instead forking just before calling main() function, this enables to move the fork point further down the line and enables heavy program initialization steps to be avoided if they can be executed independently of the input.
Persistent mode where the fuzzable part of the program is repeatedly executed without resetting the program memory every time the program is called. This only works in practice if the program does not have a modifiable global state that can not be reset to the previous state.

Afl-fuzz generates new inputs and analyzes the program execution results roughly at the same speed regardless of the mode. So these modes are in the order of efficiency in a sense that how much overhead there is for fuzzing one input. They are also in the order of complexity on how easy they are to integrate into an existing program that has not been made to be fuzzed. Especially as the fastest modes require clang to be available as a compiler and the fuzzable program needs to be able to be compiled and linked with it.

Python-afl, fortunately, provides equivalent modes without having to use special tools. These are also very fast to try out, as you don’t need to compile Python programs from scratch.

The dumb mode would be just equivalent of running Python interpreter directly with afl-fuzz without any instrumentation, so we will skip over it. The more interesting part is to use the deferred instrumentation. The code in the introductory section called afl.init() before the fuzzable module was imported. This is the most safe approach, as the fuzz target might do something with the input at import time. But more realistically, Python programs generally only call import statements, possibly conditionally, during the start-up and don’t handle any user provided data yet. So in this case, we can do imports first and move the afl.init() function just before where the actual work happens:

import afl, fuzzable_module

afl.init()
fuzzable_module.main()

We can gain some speed-ups with this by calling os._exit() function instead of letting Python to exit in the usual fashion where all the destructors and other functions that are called at exit:

import afl, fuzzable_module, os

afl.init()
fuzzable_module.main()
os._exit(0)

Previous examples assume that the input file generated by the fuzzer comes as the first parameter on the command line. This is quite a good assumption, as many data processing modules for Python include a command line interface where they read and process files given on the command line. But if we can directly call the data processing function, we can instead use the standard input to feed the data:

import afl, fuzzable_module, os, sys

afl.init()
fuzzable_module.process_data(sys.stdin)
os._exit(0)

With Python 3 comes additional complexity. Python 3 processes the standard input using the encoding specified in the environment. Often in Unix environments it is UTF-8. As afl-fuzz mostly does bit manipulation, input is going to end up with broken UTF-8 data and results in exception when reading from the standard input file object. To work around this, you can use sys.stdin.buffer instead of sys.stdin in Python 3 based programs. Or create a shim that always results in raw bytes:

import afl, fuzzable_module, os, sys

try:
    # Python 3:
    stdin_compat = sys.stdin.buffer
except AttributeError:
    # There is no buffer attribute in Python 2:
    stdin_compat = sys.stdin

afl.init()
fuzzable_module.process_data(stdin_compat)
os._exit(0)

The fastest persistent mode requires that the program should not have a global state where the previous program execution affects the next one. There unfortunately is a surprising amount of global state in Python programs. It is not that uncommon to initialize some specific variables only during the program execution and then re-use the results later. This usually is harmless, but negatively affects the program stability that afl-fuzz is going to show in its status screen.

Persistent mode code for a fuzzable program could look like following including the learnings from the deferred instrumentation and its speedups:

import afl, fuzzable_module, os

try:
    # Python 3:
    stdin_compat = sys.stdin.buffer
except AttributeError:
    # There is no buffer attribute in Python 2:
    stdin_compat = sys.stdin

while afl.loop(10000):
    fuzzable_module.process_data(stdin_compat)
os._exit(0)

Persistent mode stability

This section has been added 6 months after writing the original article when Jakub Wilk pointed out potential stability issues with the original persistent mode code example on OS X and FreeBSD.

Some systems where Python runs implement file manipulation by using buffered functions. The most prominent operating system where afl-fuzz runs that uses buffered I/O in Python is OS X/macOS and using the persistent mode example in the previous section does not work in Python 2 on these systems. It shows up as a low stability percentage in afl-fuzz user interface that means that the same input from afl-fuzz perspective leads program taking different paths between subsequent executions, as the input read by the program is not the same as what afl-fuzz has provided.

Buffered reads from a file work by calling low level file reading system calls with a moderately sized input buffer. This input buffer then makes sure that if we do a lot of small file system reads, like reading frame based data formats frame by frame, they will not result in as many system calls as they would result without buffering. This generally increases program performance, as programmers can rely on file system functions to work relatively fast even with non-optimal access patterns.

You need to set the stream position to the beginning when using persistent mode with afl-fuzz on systems that use buffered I/O and you are reading from the standard input. All these examples read data from the standard input, as it is generally more efficient than opening a new file for each fuzzing iteration. Rewinding the stream position to the beginning by using file.seek(0) method.

Unfortunately not all streams support seeking and with afl-fuzz the situation is more tricky. If you execute the program normally and read from sys.stdin, by default the standard input does not support seeking. But when you run it through afl-fuzz, the standard input is in reality a file descriptor that is backed up by a file on a file system. And this file supports seeking. So you need some extra code to detect if the standard input supports seeking or not:

import afl, fuzzable_module, os

try:
    # Python 3:
    stdin_compat = sys.stdin.buffer
except AttributeError:
    # There is no buffer attribute in Python 2:
    stdin_compat = sys.stdin

try:
    # Figure out if the standard input supports seeking or not:
    stdin_compat.seek(0)
    def rewind(stream): stream.seek(0)
# In Python 2 and 3 seek failure exception differs:
except (IOError, OSError):
    # Do nothing if we can not seek the stream:
    def rewind(stream): pass

while afl.loop(10000):
    fuzzable_module.process_data(stdin_compat)
    rewind(stdin_compat)
os._exit(0)

The effect of adding standard input rewinding more than doubles the fuzzing overhead for the persistent mode on the Debian GNU/Linux system that I have used to do these benchmarks. On OS X the effect is less than 10 % overhead increase for Python 3 and is required for Python 2 to even work with afl-fuzz on OS X.

Benchmarking different afl-fuzz modes

I wanted to measure how these different afl-fuzz modes behave with Python. So I created a small fuzz target whose main algorithm does some conditional computation based on the input data and prints out the result:

def fuzz_one(stdin, values):
    data = stdin.read(128)
    total = 0
    for key in data:
        # This only includes lowercase ASCII letters:
        if key not in values:
            continue
        value = values[key]
        if value % 5 == 0:
            total += value * 5
            total += ord(key)
        elif value % 3 == 0:
            total += value * 3
            total += ord(key)
        elif value % 2 == 0:
            total += value * 2
            total += ord(key)
        else:
            total += value + ord(key)
    print(total)

This is just to exercise the fuzzer a little bit more than a trivial function that does nothing would do. I also created an equivalent fuzz target in C++ to give numbers to compare what kind of an overhead different fuzzing modes incur for both Python and native applications. The approximate results are summarized in table 1. The actual scripts used to generate this data are available from following links: measure-times.sh, target-simple.template.py, and target-simple.cpp.

	Python 2	Python 3	Native
dumb mode	110/s	47/s	1200/s
pre-init	130/s	46/s	5800/s
deferred	560/s	260/s	6800/s
quick exit	2700/s	2100/s	8700/s
rewinding persistent mode	5800/s	5900/s	-
persistent mode	17000/s	15000/s	44000/s

Table 1: afl-fuzz benchmarks for various fuzzing modes for Python 2, Python 3, and for C++ versions of the example fuzz target.

What these results show, it is possible to make fuzzable program in Python in such way that the fuzzing start-up overhead is only three to four times larger than for a native one with afl-fuzz. This is an excellent result considering that generally algorithms implemented in Python can be considered 10-100 times slower than ones implemented in C family languages. But if you want to use Python for performance critical tasks, you are anyways using Cython or write performance critical parts in C.

There is a clear performance difference between Python 2.7.14 and Python 3.6.4 especially when all start-up and exit optimization tricks are not in use. This difference is also visible in Python start-up benchmarks at speed.python.org. This difference gets smaller when the persistent mode is used as Python executable is not shut down immediately after processing one input. What can also help Python 3 in the persistent fuzzing mode is the fact that the tracing function that python-afl sets with sys.settrace() is called only half as often with Python 3 is it is called with Python 2 for this fuzz target.

More speed for repeated Python executions

Python enables imports and other type of code loading at any phase of the program execution. This makes it quite possible that program has not fully loaded before it is executed for the first time. This is usually used as a configuration mechanism to configure the program based on the runtime environment and other type of configuration. It also provides the possibility to write plugins, similarly to what dynamically loaded libraries provide.

You can see from table 1 how the fuzzing overhead decreases a lot when the fuzzing start point moved after import statements for this simple example program. So when I was fuzzing an old version of flake8, I tried to see if it had any hidden configuration statements that would execute and cache their results for repeated calls. And it did!

Initially I used following type of fuzzing wrapper for flake8:

import afl, sys, flake8.run
afl.init()
flake8.run.check_code(sys.stdin.read())

It is basically a simple wrapper that imports all what it needs and then fuzzes what is fed from the standard input to the program. But the performance of this was horrible, around 15 executions/second. So I tried to see what happens when I change the code a little bit by calling the flake8.run.check_code() function with an empty string before setting the fuzzing starting point:

import afl, sys, flake8.run
# Make sure that the hidden runtime configuration is executed:
flake8.run.check_code("")
afl.init()
flake8.run.check_code(sys.stdin.read())

This doubled the execution speed to around 30 executions/second. It is still quite slow for the small inputs that afl-fuzz initially creates, but an improvement nonetheless. I looked at what flake8 does when it is executed and a following line popped up:

from pkg_resources import iter_entry_points

There basically is a hidden import statement in the execution path whose result is cached after the first time it is encountered. Also this pkg_resources.iter_entry_points() function is used to configure the program at runtime and that also adds some extra overhead to the process.

Flake8 also by default tries to execute checks in parallel with multiprocessing module. This might be a good idea when you have multiple files to verify at once, but during fuzzing it just adds unneeded overhead. Also the fact that it starts a new process makes the fuzzer lose all information about what is happening in the subprocess. Fortunately in flake8 it was possible to override the detected multiprocessing support by just setting one variable into False and then flake8 will act as there is no multiprocessing support. This increased the average fuzzing speed of flake8 by threefold.

The final speedup with flake8 came when looking at how flake8 is constructed. It is basically a wrapper around mccabe, pycodestyle, and pyflakes packages. So rather than fuzz flake8, it is much more productive to create a fuzz target for each one of those packages individually. I did this for pycodestyle and ended up finally executing it with around 420 executions/second for trivial data and around 200 executions/second for more realistic. So basic recommendations on how to fuzz native programs also apply for Python.

Monkey patching around fuzzing barriers

As american fuzzy lop does not know almost anything about the input, it can encounter various impediments (see section 13 from afl’s README.txt) when the file format includes any values that depend on the previously encountered data. This is especially problematic with checksums and is also an issue with other mutation based fuzzers. To work around this the C world, you can generally use C preprocessor to turn off checksum verification when such thing is encountered. This also applies for all other types of barriers that might skew fuzzing results, like random number usage.

Unfortunately Python does not have preprocessor support by default, so this type of conditional compiling is out of the question. Fortunately Python provides the possibility to do monkey patching where you can replace functions or methods at runtime. So to make a library more fuzzer friendly, you can monkey patch all functions related to data verification to always return True, or return some constant value when checksums are considered.

I used this approach to fuzz python-evxt 0.6.1 library. Python-evxt is a library to parse Windows event log files and is one of the first hits when you search Python Package Index with “pure python parser” keyword. The file format includes CRC32 checksum that will prevent the fuzzer from being able to meaningfully modify the input file, as almost all modifications will create an incorrect checksum in the file.

To monkey patch around this issue, I searched the source code for all functions that potentially have anything to do with checksum generation and made them always to return a constant value:

import Evtx.Evtx

checksum_patches = (
    (Evtx.Evtx.ChunkHeader, "calculate_header_checksum"),
    (Evtx.Evtx.ChunkHeader, "header_checksum"),
    (Evtx.Evtx.ChunkHeader, "calculate_data_checksum"),
    (Evtx.Evtx.ChunkHeader, "data_checksum"),
    (Evtx.Evtx.FileHeader, "checksum"),
    (Evtx.Evtx.FileHeader, "calculate_checksum"),
)

for class_obj, method_name in checksum_patches:
    setattr(class_obj, method_name, lambda *args, **kw: 0)

The checksum_patches variable holds all the functions that you need to overwrite to ignore checksums. You can use setattr() to overwrite these methods on class level with an anonymous function lambda *args, **kw: 0 that always returns 0 and takes any arguments. Taking any number of arguments in any fashion is enabled with *args and **kw and this syntax is explained in keyword arguments section on Python’s control flow tools manual.

Takeaways

I was impressed on how low fuzzing overhead Python programs can have when compared to C family languages. When looking at how python-afl is implemented, it becomes quite clear that the deferred mode of afl-fuzz has a great part in this where the whole Python environment and program initialization is skipped between different fuzzing runs.

Making a Python module fuzzable was also more easy than in C family languages. Although american fuzzy lop is already the easiest fuzzing engine to get started with, the fact that it uses clang to do the more advanced tricks often gives headaches when trying to get software to compile. The fact that I did not have to use any modified Python interpreter to get started but only had to import afl module made me to realize how many steps I skipped that I normally have to do when using american fuzzy lop on new systems.

Thanks to Python’s dynamic execution and monkey patching capabilities I could also try out fuzz target creation with external libraries without having to actually modify the original code. Especially selecting some specific functions to fuzz and overriding checksum generation would generally require nasty software patches with C family languages. Especially if there is main() function to override.

It was also nice to realize that the os._exit(0) equivalent in standard C library function call _Exit(0) could help make light native fuzz targets even faster. The process cleanup in this relatively trivial C++ program adds almost 30% overhead for repeated program execution. Although it will likely break any sanitizer that does any verification work during the exit, like searching for dynamically allocated memory that was not freed.

Nanorepositories

Sun, 19 Feb 2017 18:00:07 +0200

I recently encountered a microservice antipattern called nanoservice that is described in following manner:

Nanoservice is an Anti-pattern where a service is too fine grained. Nanoservice is a service whose overhead (communications, maintenance etc.) out-weights its utility.

I have encountered a lot of similar situations with source code repositories where different parts of a program or a system that was shipped as a single entity were divided into smaller repositories. The code in these repositories was not used by anything else than the product (the unit of release) that the code was part of.

In the most extreme case there were thousands of smaller repositories. Many of those repositories were just holding less than 10 files in a deep directory hierarchy implementing some tiny functionality of the whole. Plus some boilerplate functionality for building and repository management that could just have been a couple of extra lines in a build system for a larger entity.

In more common cases, one product consists of dozens of smaller repositories where one or two repositories get over 90% of the whole weekly commit traffic and other repositories just get one commit here and there. Or that there are multiple interlinked repositories (see an example in How many Git repos article) that depend on each other and very often all of them need to go through the same interface changes.

Sometimes there also is a situation that all the work is done in smaller repositories and then there is one superproject[1, 2, 3] that is automatically updated when a commit happens in any of its child projects. So basically you have one big repository that just consists of pointers to smaller repositories and has one unneeded layer of indirection. Also instead of making one commit that would reveal integration problems immediately, you now need to have multiple commits to reveal these issues. With some extra unneeded delay.

I would suggest calling these types of repositories nanorepositories. A nanorepository is a repository that holds a subsystem that is too limited to stand on its own and needs a bigger system to be part of. This bigger system usually is also the only system that is using this nanorepository. The nanorepository is also owned by the same organization as the larger entity it’s part of. Therefore it doesn’t give any advantages, for example, in access control. Nanorepositories can hold just couple of files, but they can also be relatively large applications that are, however, tightly coupled with the system they are part of.

Downsides of premature repository division

Nanorepositories are a case of premature optimization for code sharing where there is no real need for it. There are articles (Advantages of monolithic version control, On Monolithic Repositories) and presentations (Why Google Stores Billions of Lines of Code in a Single Repository, F8 2015 - Big Code: Developer Infrastructure at Facebook’s Scale) talking about the advantages of monolithic repositories, but those advantages can be hard to grasp without knowing the disadvantages of the other end.

I’ll list some issues that I have encountered when working with independent repositories. These all lead to a situation where developers need to spend extra time in repository management that could be avoided by grouping all software components that form the final product into one repository.

Expensive interface changes

Interface changes between components become expensive. You also need cumbersome interface deprecation policies and a way to support the old and new interface versions between repositories until the change has propagated everywhere. It can take months or years to ensure that all interface users have done the intended interface change. And if you don’t have a good search at your disposal, you still can’t be sure about it before you really remove the old interface.

With separate repositories it’s often the case that you can’t easily search where the interface you are deprecating is used at. This means that you don’t beforehand know what is actually depending on the interface. There naturally are search engines that span over multiple repositories, but they very rarely beat a simple grep -r (or git grep) command whose output you can further filter with simple command line tools. Especially if there are hundreds of small repositories that you need to include in your search.

Ignore file duplication

Often you need to add ignore files (.gitignore, .hgignore, etc…) to prevent junk going in to the repository by accident. Situations that can generate junk next to your code can be for example:

In-source build (versus separate build directories) generated files (*.a, *.o, *.exe, *.class, *.jar…).
Using any editor that creates backup and other files next to the file you are editing (*~, *.swp, *.bak…).
Using interpreted languages, like Python, whose default implementation byte compile the scripts for faster start-up (*.pyc, *.pyo…).
Using integrated development environments that require their own project directories.

All these generic ignore rules need to be included in every project in addition to project specific ignores. Other possibility is forcing these ignore rules on developers themselves instead of taking care of them centrally. In case of nanorepositories there likely is just one or two languages used per repository, so the amount of ignore rules likely depends on the development environment that the developers work with. But it’s still needless duplication when you could get by without.

Re-inventing inefficient build system rules

Small repositories lead into having to reinvent build system rules for every repository from scratch if you want to test and build your component in isolation. Or doing a lot of code duplication or including references to a repository including common build rules. This also applies for test runners, as different levels of testing for different languages usually have their own test runners. Sometimes multiple test runners per language, that all have some non-default options that provide various advantages in test result reporting.

Modern build systems like, ninja and Bazel, usually work on knowing the whole build graph of the system that they are trying to build. This makes it more easy to discover dependencies that only rebuild parts that are necessary to rebuild. Building every repository independently from each other leads into recursive build systems that treat their inputs as black boxes (Bitbake, npm, Maven, Make…). Changes in these black boxes are either communicated with version number changes or always rebuilding the component in question. This leads into a wasteful process and resource usage when compared to trunk based development of monolithic repositories.

Overly complicated continuous integration machinery

One of the defining principles of modern software development is having a working continuous integration system in place. This ensures that independent changes also work when they leave developer’s machine and also work when they are integrated with the rest of the product that the change is part of. And this is done multiple times every day. This, combined with trunk based development, keeps integration issues short (minutes to days) and avoids many-month release freezes compared to branched or forked development methods.

Nanorepositories likely end up in a repository specific checks in the continuous integration machinery that only verify that the component in the repository itself works well. And if this continuous integration machinery has an automatic per repository check job generation, it likely needs to have an entry point (like make test or test.sh script) to execute those tests. And the same applies for compilation. Not to mention the extra work when trying to compile and test against different systems and runtime instrumentation (like AddressSanitizer).

When the component finally gets integrated with everything else and the system breaks, figuring out the exact commit where the breakage happens (besides the integrating one) can be really painful. This is because it is easily possible to have dozens to thousands of commits between component releases. See a physical world example where components work perfectly together, but fail when integrated. And its hotfix.

A case for small repositories

Nanorepositories should not be confused with small independent repositories, as not everything needs to aim to be a part of a bigger product. A very common reason for small repositories is the combination of ownership management with shareable components. Most open source projects are such that they are owned by a certain people, or an organization, and it’s just not a good case for them to be part of anything bigger. Especially if they provide independent components that really are used by multiple external entities. Same downsides, however, generally apply to a collection of open source projects as to products consisting of multiple repositories.

Static code analysis and compiler warnings

Sun, 08 May 2016 19:00:06 +0300

Compiler generated warnings are one form of static code analysis that provides a codified form of certain types of beneficial programming practices. Nowadays modern compilers used to compile C family languages (C, C++, and Objective-C) provide hundreds of different warnings whose usefulness varies depending on project and its aims.

In this article I will examine what level of issues compiler warnings can find, what is the cost of enabling warnings and analyze compiler warning flag lists for both clang and GCC compilers.

Levels of static code analysis

Compiling C family languages usually involves preprocessor, compiler, assembler, and a linker. This also leads to situation that static code analysis can be done in various phases of program construction. These generally are:

Analysis on plain source files.
Analysis on preprocesses source files.
Analysis on compilation unit level.
Link-time analysis.

This multi-stage program construction results in difficulties for tools that are not called with the exact same arguments providing information about preprocessor definitions, and include and library directories. For example tools like splint, Cppcheck, and many editor front-ends work outside the build system and can result in false warnings because they can not see inside some macro definitions that were not included in the simple static analysis setup. This becomes an issue with larger projects that do not necessarily have the most straightforward build setups and the most trivial header file inclusion policies. This does not mean that such tools are useless, but they will result in false positive warnings that can be really annoying unless they are silenced or ignored in some way.

Analysis on preprocessed source files already provides pretty accurate picture of what kind of issues there can be in the program, but it necessarily is not enough. In the compilation phase compilers constantly transform the program into new, functionally equivalent, forms during optimization phases that can even result in unexpected code removal that is not necessarily trivial to notice. Compilation phase also gives more opportunities for target platform specific static code analysis. For example pipeline stalls or value overflows due to incorrect assumptions on data type sizes can usually be noticed only after the target platform is known.

Final phase in program construction, that provides options for static analysis, is the linking phase. In the linking phase linker takes care that all the functions and global variables that the program calls come from somewhere and that there are no conflicting duplicate names defined. This should also enable some automatic detection capabilities for memory leaks and such that come from calling functions defined in different compilation units. I’m not sure if any freely available static analyzer does this.

Compiler warning flags

Compiler warning flags are one way to do static code analysis that cover all possible phases of program construction. This assumes that the compiler is involved in all phases of program construction. And they usually are, as in all phases from preprocessing to linking compiler front-end is used as a wrapper to all the tools that do the actual hard work.

Warning flags and compilation time

Using static code analysis in form of compiler warnings incurs some penalty, as they need to execute some extra code in addition to normal code related to compilation. To measure the penalty and to contrast it with some more advanced static analysis tools,

I did some benchmarks by compiling Cppcheck 1.73 and FFTW 3.3.4 with clang 3.8, GCC 6.1, and Infer 0.8.1 by using -O3 optimization level. Cppcheck is a program mainly written in C++ and FFTW is mainly written in C. Infer has some experimental checks for C++ enabled with --cxx command line option, so I ran Infer twice for Cppcheck, with and without C++ checks. Clang had all warnings enabled -Weverything and GCC had all warning options that did not require any special values. This resulted in following minimum execution times of 3 runs:

Compiler	Program	No warnings	All warnings
clang	Cppcheck	59.3 s	1 min 1.1 s (+ 3.0 %)
GCC	Cppcheck	1 min 32.7 s	1 min 38.8 s (+ 6.6 %)
Infer	Cppcheck	-	17 min 50 s (18x slower)
Infer `--cxx`	Cppcheck	-	1 h 36 min (97x slower)
clang	FFTW	40.5 s	40.9 s (+ 1 %)
GCC	FFTW	42.7 s	58.1 s (+ 36 %)
Infer	FFTW	-	4 min 43 s (10x slower)

We can see that for clang and GCC the extra processing time added even by all warnings flags is pretty small compared to all the other compilation and optimization steps for a C++ application (Cppcheck). But for mostly C based application (FFTW) GCC gets surprisingly heavy, although build times still remain within the same order of magnitude.

If we then compare the time that a more heavy static code analyzer takes, these compiler warnings are extremely cheap way to add static code analysis. They may not catch all the same bugs as these more advanced methods do, but they do offer a cheap way to avoid the basic mistakes.

Warning flag lists

I have created a project that can automatically parse compiler warning flags from command line option definition files in clang and GCC. This came partially from a necessity and partially from curiosity to examine what kind of options clang and GCC provide in easy to digest format. Although both compiler provide some kind of lists of warning flags as part of their documentation, they are pretty cumbersome to go through when the main interest is first figure what there is available and then just look at the details.

Warning options and deprecation

Different compilers have different policies about backwards compatibility and deprecation. When looking at how warning options have evolved, GCC has not removed between versions 3.4 and 6.1 a single switch, it has just switched them to do nothing (-Wimport, -Wunreachable-code, and -Wmudflap switches). Clang on the other hand has removed multiple switches between versions and for example there is no references to -Wcxx98-cxx11-compat in the current codebase even if clang 3.3 had such switch.

Examining differences visually

Generating large purely textual differences between different files becomes quite cumbersome quite soon if you want to do anything more complicated than a simple difference of unique command line options between two subsequent versions. For example if we look at figure 1 that shows what other warnings -Wall flag enables in GCC 6 when compared to GCC 5. We can see that there are quite many extra warnings added to -Wall switch so newer compiler versions provide extra analysis capabilities even without adding all the new options individually.

Figure 1: Meld showing differences what flags -Wall enables between GCC 5 and 6.

From figure 2 we can also see that GCC 6 uses -Wc++11-compat as the default warning flag indicating differences between ISO C++ 1998 and ISO C++ 2011 for constructs that have the same name instead of -Wc++0x-compat, that refers to a draft standard. So GCC has basically deprecated -Wc++0x-compat switch in favor of a switch that refers to the actual standard.

Figure 2: -Wc++0x-compat is an alias of -Wc++11-compat in GCC 6 instead the other way around.

Suggestions for usable warning options

I won’t be giving any specific suggestions here for warning flags, as there seem to be new options for each subsequent compiler release. A good place to start is NASA’s JPL Institutional Coding Standard for the C Programming Language that includes a very short list of rudimentary warning flags for GCC. It also includes a short list of coding standards of which each one would have prevented a mission failure for NASA. SEI CERT coding standards for secure coding also provide various automatically generated lists for clang warning flags and GCC warning flags based on the issues that these standards take into account.

And finally, check out the warning flag lists for clang and GCC and make your own combinations that bring the most benefit for whatever you are working with. Not all of them are appropriate for your project and some of them may be even working against the useful development patterns that you have.

Cautionary tales about compiler warnings flags

Even though it might sound like a good idea to rush and fix all the issues that these new compiler warning flags uncover, it might actually cause some new bugs to pop up. Specifically SQLite database engine has had its own take on compiler warnings and their fixing and they have concluded that fixing compiler warnings actually has produced some extra bugs that would not have come into light if there would have not been tries to fix compiler warnings.

I have also had my own take on compiler warning fixes and sometimes I have screwed up and messed up with a perfectly working code while fixing a misleading warning. But generally my own experience has lead to more fixes than what there have been bugs. And the coolest thing is, that having these warnings enabled as the standard development process prevent some bugs from ever creeping up to the application in the first place.

Being a good CPU neighbor

Sun, 28 Feb 2016 18:00:05 +0200

Very often computational tasks are roughly divided into real-time and batch processing tasks. Sometimes you might want to run some tasks that take a large amount of computation resources, get the result as fast as possible, but you don’t really care exactly when you get the result out.

In larger organizations there are shared computers that usually have a lot of CPU power and memory available and that are mostly idle. But they are also used by other users that will be quite annoyed if they suffer from long delays to key presses or from similar issues because your batch processing tasks take all the resources that are available away from the system. Fortunately *nix systems provide different mechanisms available to non-root users to avoid these kind of issues.

These issue can also raise in situations where you have a system that can only allocate static resource requirements to tasks, like in Jenkins. As an example, you might have some test jobs that need to finish within certain time limit mixed with compilation jobs that should finish as soon as possible, but that can yield some resources to these higher priority test jobs, as they don’t have as strict time limitation requirements. And you usually don’t want to limit the resources that compilation jobs can use, if they are run on otherwise idle machine.

Here I’m specifically focusing on process priority settings and other CPU scheduling settings provided by Linux, as it currently happens to be probably the most used operating system kernel for multi-user systems. These settings affect how much CPU time a process gets relative to other processes and are especially useful on shared overloaded systems where there are more processes running than what there are CPU cores available.

Different process scheduling priorities in Linux

Linux and POSIX interfaces provide different scheduling policies that define how Linux scheduler allocates CPU time to a process. There are two main scheduling policies for processes: real-time priority policies and normal scheduling policies. Real-time policies are usually accessible only to root user and are not the point of interest of here, as they can not usually be used by normal users on shared systems.

At the time of the writing this article there are three normal scheduling policies available to normal users for process priorities:

SCHED_OTHER: the default scheduling policy in Linux with the default dynamic priority of 0.
SCHED_BATCH: policy for scheduling batch processes. This schedules processes in similar fashion as SCHED_OTHER and is affected by the dynamic priority of a process. This makes the scheduler assume that the process is CPU intensive and makes it harder to wake up from a sleep.
SCHED_IDLE: the lowest priority scheduling policy. Not affected by dynamic priorities. This equals to nice level priority of 20.

The standard *nix way to change process priorities is to use nice command. By default in Linux, process can get nice values of -20–19 where the default nice value for a process is 0. When process is started with nice command, the nice value will be

Values 0–19 are available to a normal user and -20–-1 are available to the root user. The lowest priority nice value of 19 gives process 5% of CPU time with the default scheduler in Linux.

chrt command can be used to give a process access to SCHED_BATCH and SCHED_IDLE scheduling policies. chrt can be used with combination of nice command to lower the priority of SCHED_BATCH scheduling policy. And using SCHED_IDLE (= nice level 20) policy should give around 80% of the CPU time that nice level of 19 has, as nice level weights increase the process priority by 1.25 compared to lower priority level.

Benchmarking with different competing workloads

I wanted to benchmark the effect of different scheduling policies on a work done by a specific benchmark program. I used a system with Linux 3.17.0 kernel with Intel Core i7-3770K CPU at 3.5 GHz processor with hyperthreading and frequency scaling turned on and 32 gigabytes of RAM. I didn’t manage to make the CPU frequency constant, so results vary a little between runs. I used John the Ripper’s bcrypt benchmark started with following command as the test program for useful work:

$ /usr/sbin/john --format=bcrypt -test

I benchmarked 1 and 7 instances of John the Ripper for 9 iterations (9 * 5 = 45 seconds) with various amounts of non-benchmarking processes running at the same time. 1 benchmarking process should by default not have any cache contention resulting from other benchmarking processes happening and 7 benchmarking processes saturate all but 1 logical CPU core and can have cache and computational unit contention with each other.

Non-bechmarking processes were divided into different categories and there were 0–7 times CPU cores instances of them started to make them fight for CPU time with benchmarking processes. The results of running different amounts of non-benchmarking processes with different scheduling policies can be found from tables 1, 2, 3, and 4. Different scheduling policies visible in those tables are:

default: default scheduling policy. Command started without any wrappers. Corresponds to SCHED_OTHER with nice value of 0.
chrt-batch: the default SCHED_BATCH scheduling policy. Command started with chrt --batch 0 prefix.
nice 10: SCHED_OTHER scheduling policy with the default nice value of 10. Command started with nice prefix.
nice 19: SCHED_OTHER scheduling policy with nice value of 19. Command started with nice -n 19 prefix. This should theoretically take around 5 % of CPU time out from the useful work.
nice 19 batch: SCHED_BATCH scheduling policy with nice value of 19. Command started with nice -n 19 chrt --batch 0 prefix.
sched idle: SCHED_IDLE scheduling policy. Command started with chrt --idle 0 prefix. This should theoretically take around 80 % of CPU time compared to nice level 19 out from useful work.

The results in those tables reflect the relative average percentage of work done compared to the situation when there are no additional processes disturbing the benchmark. These only show the average value and do not show, for example, variance that could be useful to determine how close those values are of each other. You can download get the raw benchmarking data and the source code to generate and analyze this data from following links:

CPU looper

CPU looper applications consists purely of an application that causes CPU load. Its main purpose is just to test the scheduling policies without forcing CPU cache misses. Of course there will be context switches that can replace data cache entries related to process state and instruction cache entries for the process with another, but there won’t be any extra intentional cache flushing happening here.

The actual CPU looper application consists code shown in figure 1 compiled without any optimizations:

int main() {
    while (1) {}
    return 0;
}

Figure 1: source code for CPU looper program.

The results for starting 0–7 times CPU cores of CPU looper processes with the default priority and then running benchmark with 1 or 7 cores can be found from tables 1 and 2.

1 worker	0	1	2	3	4	5	6	7
default	100	64.6	36.4	25.5	18.2	15.4	12.5	11.0
nice 0 vs. sched batch	100	66.5	37.0	24.0	16.3	14.8	12.2	10.9
nice 0 vs. nice 10	100	93.3	90.2	82.7	75.4	75.2	75.0	74.9
nice 0 vs. nice 19	100	92.9	90.4	91.0	91.8	89.8	89.4	90.9
nice 0 vs. nice 19 batch	100	94.0	89.2	91.4	89.8	86.4	89.4	90.7
nice 0 vs. sched idle	100	95.0	91.3	92.2	92.0	92.6	91.0	92.4
—
nice 10 vs. nice 19	100	80.0	74.7	68.0	67.7	67.5	64.5	60.7
nice 10 vs. nice 19 batch	100	92.5	83.8	75.9	75.0	74.3	74.1	69.8
nice 10 vs. sched idle	100	89.8	89.7	90.9	89.0	87.8	87.4	87.2
—
nice 19 vs. sched idle	100	77.9	69.7	66.9	66.3	64.6	63.4	58.1

Table 1: the relative percentage of average work done for one benchmarking worker when there are 0–7 times the logical CPU cores of non-benchmarking CPU hogging jobs running with different scheduling policies.

7 workers	0	1	2	3	4	5	6	7
default	100	49.8	33.6	24.6	19.4	16.4	13.8	12.2
nice 0 vs. sched batch	100	51.1	34.0	25.1	20.0	16.6	14.2	12.4
nice 0 vs. nice 10	100	92.9	87.2	80.1	75.0	71.0	66.3	61.1
nice 0 vs. nice 19	100	94.1	94.9	95.4	95.2	96.2	95.7	96.0
nice 0 vs. nice 19 batch	100	95.7	95.2	95.7	95.5	95.2	94.6	95.4
nice 0 vs. sched idle	100	96.3	95.7	96.2	95.6	96.7	95.4	96.7
—
nice 10 vs. nice 19	100	93.6	84.9	78.0	71.4	65.7	60.4	56.2
nice 10 vs. nice 19 batch	100	92.4	84.1	77.6	69.9	65.6	60.7	56.4
nice 10 vs. sched idle	100	95.5	94.9	94.5	93.9	94.3	93.7	91.9
—
nice 19 vs. sched idle	100	91.4	80.9	71.4	64.3	58.3	52.5	48.5

Table 2: the relative percentage of average work done for seven benchmarking workers when there are 0–7 times the logical CPU cores of non-benchmarking CPU hogging jobs running with different scheduling policies.

Tables 1 and 2 show that the effect is not consistent between different scheduling policies and when there is 1 benchmarking worker running it suffers more from the lower priority processes than what happens when there are 7 benchmarking workers running. But with higher priority scheduling policies for background processes the average amount of work done for 1 process remains higher for light loads than with 7 worker processes. These discrepancies can be probably explained by how hyperthreading works by sharing the same physical CPU cores and by caching issues.

Memory looper

Processors nowadays have multiple levels of cache and no cache isolation and memory access from one core can wreak havoc for other cores with just moving data from and to memory. So I wanted to see what happens with different scheduling policies when running multiple instances of a simple program that run on the background when John the Ripper is trying to do some real work.

#include <stdlib.h>

int main() {
    // 2 * 20 megabytes should be enough to spill all caches.
    const size_t data_size = 20000000;
    const size_t items = data_size / sizeof(long);
    long* area_source = calloc(items, sizeof(long));
    long* area_target = calloc(items, sizeof(long));
    while (1) {
        // Just do some memory operations that access the whole memory area.
        for (size_t i = 0; i < items; i++) {
            area_source[i] = area_target[i] + 1;
        }
        for (size_t i = 0; i < items; i++) {
            area_target[i] ^= area_source[i];
        }
    }
    return 0;
}

Figure 2: source code for memory bandwidth hogging program.

Program shown in figure 2 is compiled without any optimizations and it basically reads one word from memory, adds 1 to it and stores the result to some other memory area and then XORs the read memory area with the written memory area. So it basically does not do anything useful, but reads and writes a lot of data into memory every time the program gets some execution time.

Tables 3 and 4 show the relative effect on John the Ripper benchmarking program when there are various amounts of the program shown in figure 2 running at the same time. If you compare these numbers to values shown in tables 1 and 2 a program that only uses CPU cycles is running, the numbers for useful work can be in some cases around 10 percentage points lower. So there is apparently some cache contention ongoing with this benchmarking program and the effect of lower priority scheduling policies is not the same that could be theoretically expected just from the allocated time slices.

1 worker	0	1	2	3	4	5	6	7
default priority	100	61.7	32.6	21.8	16.6	13.3	11.2	9.8
nice 0 vs. sched batch	100	60.2	32.6	22.7	16.2	12.9	10.9	10.2
nice 0 vs. nice 10	100	85.2	82.1	75.8	70.8	69.6	70.5	68.1
nice 0 vs. nice 19	100	85.4	83.0	79.7	81.7	78.2	81.9	84.7
nice 0 vs. nice 19 batch	100	83.8	81.2	80.5	83.4	80.4	79.4	84.3
nice 0 vs. sched idle	100	82.9	80.9	81.2	82.1	80.6	82.3	81.8
—
nice 10 vs. nice 19	100	80.0	74.7	68.0	67.7	67.5	64.5	60.7
nice 10 vs. nice 19 batch	100	80.8	74.1	67.6	67.1	67.9	65.5	63.3
nice 10 vs. sched idle	100	83.9	81.6	79.2	77.0	75.9	76.8	77.1
—
nice 19 vs. sched idle	100	77.9	69.7	66.9	66.3	64.6	63.4	58.1

Table 3: the relative percentage of average work done for one benchmarking worker when there are 0–7 times the logical CPU cores of non-benchmarking CPU and memory bandwidth hogging jobs running with different scheduling policies.

7 workers	0	1	2	3	4	5	6	7
default	100	48.4	32.6	24.5	19.7	16.3	14.2	12.3
nice 0 vs. sched batch	100	48.6	32.2	23.7	19.4	16.2	14.0	12.6
nice 0 vs. nice 10	100	89.3	81.5	74.6	69.2	64.6	60.4	56.3
nice 0 vs. nice 19	100	92.0	90.5	90.4	91.2	91.3	90.6	90.5
nice 0 vs. nice 19 batch	100	91.5	91.8	92.5	91.8	91.9	92.1	91.9
nice 0 vs. sched idle	100	92.1	91.9	91.9	91.5	92.0	92.4	92.1
—
nice 10 vs. nice 19	100	85.4	77.3	70.4	63.3	58.3	54.2	50.3
nice 10 vs. nice 19 batch	100	86.8	77.7	70.0	63.4	58.8	54.4	50.6
nice 10 vs. sched idle	100	90.6	89.0	88.9	88.9	88.0	87.4	86.3
—
nice 19 vs. sched idle	100	82.8	72.9	62.9	57.3	52.2	48.2	44.1

Table 4: the relative percentage of average work done for seven benchmarking workers when there are 0–7 times the logical CPU cores of non-benchmarking CPU and memory bandwidth hogging jobs running with different scheduling policies.

Conclusions

These are just results of one specific benchmark with two specific workloads on one specific machine with 4 hyperthreaded CPU cores. They should anyways give you some kind of an idea how different CPU scheduling policies under Linux affect the load that you do not want to disturb when your machine is more or less overloaded. Clearly when you are reading and writing data from and to memory, the lower priority background process has bigger impact on the actual worker process than what the allocated time slices would predict. But, unsurprisingly, a single user can have quite a large impact on how much their long running CPU hogging processes affect rest of the machine.

I did not do any investigation how much work those processes that are supposed to disturb the benchmarking get done. In this case SCHED_BATCH with nice value of 19 could probably be the best scheduling policy if we want to get the most work done and at the same time avoid disturbing other users. Otherwise, it looks like the SCHED_IDLE policy, taken into use with chrt --idle 0 command, that is promised to have the lowest impact on other processes has the lowest impact. Especially when considering processes started with lower nice values than the default one.

C, for(), and signed integer overflow

Sun, 21 Feb 2016 18:00:04 +0200

Signed integer overflow is undefined behavior in C language. This enables compilers to do all kinds of optimization tricks that can lead to surprising and unexpected behavior between different compilers, compiler versions, and optimization levels.

I encountered something interesting when I tried different versions of GCC and clang with following piece of code:

#include <limits.h>
#include <stdio.h>

int main(void)
{
    int iterations = 0;
    for (int i = INT_MAX - 2; i > 0; i++) {
        iterations++;
    }
    printf("%d\n", iterations);
    return 0;
}

When compiling this with GCC 4.7, GCC 4.9, and clang 3.4 and compilation options for x86_64 architecture, we get following output variations:

$ gcc-4.9 --std=c11 -O0 tst.c && ./a.out
3
$ gcc-4.9 --std=c11 -O3 tst.c && ./a.out
2
$ clang-3.4 --std=c11 -O3 tst.c && ./a.out
4
$ gcc-4.7 --std=c11 -O3 tst.c && ./a.out asdf
<infinite loop...></code>

So let’s analyze these results. The loop for (int i = INT_MAX - 2; i > 0; i++) { should only iterate 3 times until variable i wraps around to negative value due to two’s complement arithmetic. Therefore we would expect iterations variable to be 3. But this only happens when optimizations are turned off. So let’s see why these outputs result in these different values. Is there some clever loop unrolling that completely breaks here or what?

GCC 4.9 -O0

There is nothing really interesting going in the assembly code produced by GCC without optimizations. It looks like what you would expect from this kind of loop for() on the lower level. So the main interest is in optimizations that different compilers do.

GCC 4.9 -O3 and clang 3.4 -O3

GCC 4.9 and clang 3.4 both seem to result in similar machine code. They both optimize the for() loop out and replace the result of it with a (different) constant. GCC 4.9 is just creating code that assigns value 2 to the second parameter for printf() function (see System V Application Binary Interface AMD64 Architecture Processor Supplement section 3.2.3 about parameter passing). Similarly clang 3.4 assigns value 4 to the second parameter for printf().

GCC 4.9 -O3 assembler code for main()

Here we can see how value 2 gets passed as iterations variable:

Dump of assembler code for function main:
5 {
0x00000000004003f0 <+0>: sub $0x8,%rsp

6     int iterations = 0;
7     for (int i = INT_MAX - 2; i > 0; i++) {
8         iterations++;
9     }
10     printf("%d\n", iterations);
0x00000000004003f4 <+4>: mov $0x2,%esi
0x00000000004003f9 <+9>: mov $0x400594,%edi
0x00000000004003fe <+14>: xor %eax,%eax
0x0000000000400400 <+16>: callq 0x4003c0 <printf@plt>

11     return 0;
12 }
0x0000000000400405 <+21>: xor %eax,%eax
0x0000000000400407 <+23>: add $0x8,%rsp
0x000000000040040b <+27>: retq

clang 3.4 -O3 assembler code for main()

Here we can see how value 4 gets passed as iterations variable:

Dump of assembler code for function main:
5 {

6     int iterations = 0;
7     for (int i = INT_MAX - 2; i > 0; i++) {
8         iterations++;
9     }
10     printf("%d\n", iterations);
0x00000000004004e0 <+0>: push %rax
0x00000000004004e1 <+1>: mov $0x400584,%edi
0x00000000004004e6 <+6>: mov $0x4,%esi
0x00000000004004eb <+11>: xor %eax,%eax
0x00000000004004ed <+13>: callq 0x4003b0 <printf@plt>
0x00000000004004f2 <+18>: xor %eax,%eax

11     return 0;
0x00000000004004f4 <+20>: pop %rdx
0x00000000004004f5 <+21>: retq

GCC 4.7 -O3

GCC 4.7 has an interesting result from optimizer that is completely different that could be expected to happen. It basically replaces main() function with just one instruction that does nothing else than jumps at the same address that it resides in and thus creates an infinite loop:

Dump of assembler code for function main:
5 {
0x0000000000400400 <+0>: jmp 0x400400 <main>

Conclusion

This is an excellent example of nasal demons happening when there is undefined behavior ongoing and compilers have free hands to do anything they want. Especially as GCC 4.7 produces really harmful machine code that instead of producing unexpected value as a result, it just makes the program unusable. Other compilers and other architectures could result in different behavior, but let that be something for the future investigation.

Shell script patterns for bash

Sun, 14 Feb 2016 18:00:03 +0200

I write a lot of shell scripts using bash to glue together existing programs and their data. I have noticed that some patterns have emerged that I very often use when writing these scripts nowadays. Let the following script to demonstrate:

#!/bin/bash

set -u  # Exit if undefined variable is used.
set -e  # Exit after first command failure.
set -o pipefail  # Exit if any part of the pipe fails.

ITERATIONS=$1  # Notice that there are no quotes here.
shift
COMMAND=("$@")  # Create arrays from commands.

DIR=$(dirname "$(readlink -f "$0")")  # Get the script directory.
echo "Running '${COMMAND[*]}' in $DIR for $ITERATIONS iterations"

WORKDIR=$(mktemp --directory)
function cleanup()
{
    rm -rf "$WORKDIR"
}
trap cleanup EXIT  # Make sure that we clean up in any situation.

for _ in $(seq 1 "$ITERATIONS"); do
    (  # Avoid the effect of "cd" command from propagating.
        cd "$WORKDIR"
        "${COMMAND[@]}" \  # Execute command from an array.
            | cat
    )
done

Fail fast in case of errors

Lines 3–5 are related to handling erroneous situations. If any of the situations happen that these settings have effect on, they will exit the shell script and prevent it running further. They can be combined to set -ueo pipefail as a shorthand notation.

A very common mistake is to have an undefined variable, by just forgetting to set the value, or mistyping a variable name. set -u at line 3prevents these kind of issues from happening. Other very common situation in shell scripts is that some program is used incorrectly, it fails, and then rest of the shell script just continues execution. This then causes hard to notice failures later on. set -e at line 4 prevents these issues from happening. And because life is not easy in bash, we need to ensure that such program execution failures that are part of pipelines are also caught. This is done by set -o pipefail on line 5.

Clean up after yourself even in unexpected situations

It’s always a good idea to use unique temporary files or directories for anything that can be created during runtime that is not a specific target of the script. This makes sure that even if many instances of the script are executed in parallel, there won’t be any race condition issues. mktemp command on line 14 shows how to create such temporary files.

Creating temporary files and directories poses a problem where they may not be deleted if the script is exited too early or if you forget to do a manual cleanup. cleanup() function shown at lines 15–18 includes code that makes sure that the script cleans up after itself and trap cleanup EXIT makes sure that cleanup() is implicitly called in any situation where this script may exit.

Executing commands

Something that is often seen in shell scripts that when you need to create a variable that has a command with arguments, you then execute the variable as it is ($COMMAND) without using any safety mechanisms against input that can lead into unexpected results in the shell script. Lines 9 COMMAND=("$@") and 24 "${COMMAND[@]}" show a way to assign a command into an array and run such array as command and its parameters. Line 25 then includes a pipe just to demonstrate that set -o pipefail works if you pass a command that fails.

Script directory reference

Sometimes it’s very useful to access other files in the same directory where the executed shell script resides in. Line 11 DIR=$(dirname "$(readlink -f "$0")") basically gives the absolute path to the current shell script directory. The order of readlink and dirname commands can be changed depending if you want to access the directory where the command to execute this script points to or where the script actually resides in. These can be different locations if there is a symbolic link to the script. The order shown above points to the directory where the script is physically located at.

Quotes are not always needed

Normally in bash if you reference variables without enclosing them into quotes, you risk of shell injection attacks. But when assigning variables in bash, you don’t need to have quotes in the assignment as long as you don’t use spaces. The same applies to command substitution (some caveats with variable declarations apply). This style is demonstrated on lines 7 ITERATIONS=$1and 11 DIR=$(...).

Shellcheck your scripts

One cool program that will make you a better shell script writer is ShellCheck. It basically has a list of issues in shell scripts that it detects and can notify you whenever it notices issues that should be fixed. If you are writing shell scripts as part of some version controlled project, you should add an automatic ShellCheck verification step for all shell scripts that you put into your repository.

A succesful Git branching model considered harmful

Sun, 07 Feb 2016 18:00:01 +0200

Update 2018-07. The branching model described here is called trunk based development. I and other people who I collaborated with did not know about the articles that used this name. Nowadays there are excellent web resources about the subject, like trunkbaseddevelopment.com. They have a lot of material about this and other key subjects revolving around the area of efficient software development.

When people start to use git and get introduced to branches and to the ease of branching, they may do couple of Google searches and very often end up on a blog post about A successful Git branching model. The biggest issue with this article is that it comes up as one of the first ones in many git branching related searches when it should serve as a warning how not to use branches in software development.

What is wrong with “A successful Git branching model”?

To put it bluntly, this type of development approach where you use shared remote branches for everything and merge them back as they are is much more complicated than it should be. The basic principle in making usable systems is to have sane defaults. This branching model makes that mistake from the very beginning by not using the master branch for something that a developer who clones the repository would expect it to be used, development.

Using individual (long lived) branches for features also make it harder to ensure that everything works together when changes are merged back together. This is especially pronounced in today’s world where continuous integration should be the default practice of software development regardless how big the project is. By integrating all changes together regularly you’ll avoid big integration issues that waste a lot of time to resolve, especially for bigger projects with hundreds or thousands of developers. This type of development practice where every feature is developed in its own shared remote branch drives the process naturally towards big integration issues instead of avoiding them.

Also in “A successful Git branching model” merge commits are encouraged as the main method for integrating changes. I will explain next why merge commits are bad and what you will lose by using them.

What is wrong with merge commits?

“A successful Git branching model” talks how non-fast-forward merge commits can be thought as a way to keep all commits related to a certain feature nicely in one group. Then if you decide that a feature is not for you, you can just revert that one commit and have the whole feature removed. I would argue that this is a really rare situation that you revert a feature or that you even get it done completely right on the first try.

Merges in git very often create additional commits that begin with the message that looks like following: “Merge branch 'some-branch' of git://git.some.domain/repository/”. That does not provide any value when you want to see what has actually changed. You need go to the commit message and read what happens there, probably in the second paragraph. Not to mention going back in history to the branch and trying to see what happens in that branch.

Having non-linear history also makes git bisect harder to do when issues are only revealed during integration. You may have both of the branches good individually but then the merge commit fails because your changes don’t conflict. This is not even that hard to encounter when one developer changes some internal interface and other developer builds something new based on the old interface definition. These kind of can be easy or hard to figure out, but having the history linear without any merge commits could immediately point out the commit that causes issues.

Something more simple

Figure 1: The cactus model

Let me show a much more simple alternative, that we can call the cactus model. It gets the name from the fact that all branches branch out from a wide trunk (master branch) and never get merged back. Cactus model should reflect much better the way that comes up naturally when working with git and making sure that continuous integration principles are used.

In figure 1 you can see the principle how the cactus branching model works and following sections explain the reasoning behind it. Some principles shown here may need Gerrit or similar integrated code review and repository management system to be fully usable.

All development happens on the master branch

master branch is the default that is checked out after git clone. So why not also have all development also happen there? No need to guess or needlessly document the development branch when it is the default one. This only applies to the central repository that is cloned and kept up to date by everyone. Individual developers are encouraged to use local branches for development but avoid shared remote branches.

Developers should git rebase their changes regularly so that their local branches would follow the latest origin/master. This is to make sure that we do not develop on an outdated baseline.

Using local branches

Cactus model does not to discourage using branches when they are useful. Especially an individual developer should use short lived feature branches in their local repository and integrate them with the origin/master whenever there is something that can be shared with everyone else. Local branches are just to make it more easy to move between features while commits are tested or under code review.

Figures 2 and 3 show the basic principle of local branches and rebases by visualizing a tree state. In figure 2 we have a situation with two active local development branches (fuchsia circles) and one branch that is under code review (blue circles) and ready to be integrated to origin/master (yellow circles). In figure 3 we have updated the origin/master with two new commits (yellow-blue circles) and submitted two commits for code review (blue circle) and consider them to be ready for integration. As branches don’t automatically disappear from the repository, the integrated commits are still in the local repository (gray circles), but hopefully forgotten and ready to be garbage collected.

Figure 2: Local development branches

Figure 3: Local branches after integration and rebase

Shared remote branches

As a main principle, shared remote branches should be avoided. All changes should be made available on origin/master and other developers should build their changes on top of that by continuously updating their working copies. This ensures that we do not end up in integration hell that will happen when many feature branches need to be combined at once.

If you use staged code review system, like Gerrit or github, then you can just git fetch the commit chain and build on top of that. Then git push your changes to your own repository to some specific branch, that you have hopefully rebased on top of the origin/master before pushing.

Releases are branched out from origin/master

Releases get their own tags or branches that are branched out from origin/master. In case we need a hotfix, just add that to the release branch and cherry-pick it to the master branch, if applicable. By using some specific tagging and branching naming scheme should enable for automatic releases but this should be completely invisible to developers in their daily work.

There are only fast-forward merges

git merge is not used. Changes go to origin/master by using git rebase or git cherry-pick. This avoids cluttering the repository with merge commits that do not really provide any real value and avoids christmas tree look on the repository. Rebasing also makes the history linear, so that git bisect is really easy to use for finding regressions.

If you are using Gerrit, you can also use cherry-pick submit strategy. This simply enables putting a collection of commits to origin/master at any desired order instead of having to settle for the order decided when commits were first put for a code review.

Concluding remarks

Git is really cool as a version control system. You can do all kinds of nifty stuff with it really easily that was hard or impossible to do. Branches are just pointers to certain commits and that way you can create a branch really cheaply from anything. Also you can do all kinds of fancy merges and this makes using and mixing branches very easy. But as with all tools, branches should be used appropriately to make it more easy for developers to their daily development tasks, not harder by default.

I have also seen these kind of scary development practices to be used in projects with hundreds of developers when moving to git from some other version control systems. Some organizations even take this as far as they fully remove the master branch from the central repository and create all kinds of questions and obstacles by not having a sane default that is expected by anyone who has used git anywhere else.

Starting a software development blog

Sun, 31 Jan 2016 18:00:00 +0200

Isn’t it a nice way to start a software development related blog by talking about how to start a software development related blog? Or at least it goes one up in the meta blogging level where there is an article about what needs taken care of when creating a software development related blog that starts with how meta this is.

Aren’t there tons of articles regarding this already?

There are tons of articles about how to establish a blog. Some of them are very good at giving generic advice on topics and blogging platform selection, and you should use your favorite search engine to search for those. I’m just going to explain my reasoning for choices I have made when establishing this blog with my current understanding and background research.

Why blog at all?

Why should I blog at all? Blogging takes time and effort and I will probably lose interest in it after some weeks/months/years. The biggest reason for me to start blogging is that I find myself very often explaining things through email discussions, wikis, or chats, sometimes the same things multiple times. Also sometimes I just do benchmarks and other types of digging and comparison, and I would like to have a place to present these findings where I can easily link to.

What I’m not after here is being a part of some specific blogging community. The second thing that I try to avoid is discussing topics that are not software development related, or just share some links with a very short comment on them. Those things I leave for Facebook and Internet relay chat type of communication platforms.

Coming up with ideas

I have collected ideas and drafts that I could write about for almost a year now. And there are over a dozen different smaller or larger article ideas in my backlog. So at least in the beginning I have some material available. The biggest question is that do I find this interesting enough to continue doing and how often do I come up with new ideas that I could blog about.

Deciding the language I write articles in

I have decided to write my blog in English. It’s not my first, or even my second language, so why write a blog in English? The reason for this is that software development topics come up naturally for me in English. Also people who may be interested in these topics basically have to know English to have any competence in this industry and I’m not aiming at complete beginners.

Selecting a blogging platform

I had a couple of requirements for a blogging platform that it should satisfy so it would fit my needs:

Hosted by a third party. I have hosted my own web server and it requires somewhat constant maintenance and I want to avoid that. Especially I don’t want to have constant security issues.
To be able to write articles locally on my computer and then easily export the result to the blog. Also as articles generally take quite a few trials and errors, they should be easy to modify.
Support source code listings and syntax coloring.
Be able to upload files of any type to the platform and link to them. I know that I would like to produce source code, so why not give out larger programs directly as files without having to list all the source code in the article?
- These files should be easy to modify as, as mentioned before, articles take some iterations to be complete.

First I was thinking about using Blogger as my blogging platform, as quite a few blogs that I follow use it as their platform. But I wasn’t convinced that the transfer of files from local machine to Blogger service would be the most convenient, especially with images and other types of files. Then I remembered that some people have pages on GitHub and stumbled upon Jekyll that creates static files and is aimed specifically for this kind of content creation. So that’s what I decided to use for now.

Jekyll does require some work to be set up and probably to maintain, but at least it frees me to do the content in any way I like to. Many of the things that I also need to currently do manually would probably have been taken care of by some plugins on third party blogging services. But that’s the price of doing this in a more customizable way.

Coming up with some basic information

Should I add a personal biography to the blog, or should people who have not met me know me just by the name and the content that I write? I have decided to write a little bit about myself that would hopefully shed a light on who I am. But I have decided not to use the author page as my curriculum vitae, as it would get too long and boring.

Deciding how this blog should look and feel

As a general rule when making personal web pages, I try to make them load fast and avoid images. And with this blog, I just took the default Jekyll theme and started customizing it in certain aspects where I want to have more information visible. As new devices with different resolutions come into market and browsers get updated, I may need to update this to something that better supports the reactive nature of the website that I hope to have.

One interesting direction where web development is going into regarding the pages optimized for mobile devices is to have accelerated mobile pages of the content. Currently I have decided to write as much content as possible in Markdown that is processed by kramdown Markdown processor. It doesn’t natively support accelerated mobile pages yet, so I have opted out from using accelerated mobile pages for now. In the future it would be really nice to have lazy image loading and support for responsive images with different sizes and device support.

Favicon

Favicon is a small image that should make a site stand out from rest when you have multiple tabs open in a browser, bookmarks, and when creating icons of websites/applications to a device or operating system. I had made a favicon for my web site ages ago. I still have the original high resolution version, so now I decided to see what kind of different formats and sizes there are for these icons. Fortunately I didn’t have to look too much, as I found a website called Real Favicon Generator.net that would take care of converting this icon to different formats for me.

I was a little blown away by how many different targets there are for these kinds of icons. The generated file collection included 27 different images and 2 metadata files in addition to code to add to a website header. A current list of them looks like this:

Favicon for website in different sizes and formats.
Safari’s pinned tabs.
Windows 8 and Windows 10 pinned website tiles
Android home screen icon
iOS home screen icon

In a couple of years we probably have new systems and resolutions to support. So I need to regenerate these favicons again to support those new systems.

Responsive images

Some images that I produce are originally made with Inkscape in vector graphics format. Unfortunately the browser support for vector graphics formats is appalling, the next best option is to use responsive images that provide different resolution versions of the image based on the display device capabilities. Every vector graphic image has 6 versions (100%, 125%, 150%, 200%, 300%, 400%, and PDF) of it generated, which can then be used in srcset image attribute which enables devices with higher density screens to get more crisp images. Higher density images are also used in modern browsers when the page is zoomed. PDF versions are just used in clickable links, as PDF files as images don’t really have that good of a browser support yet.

Optimize everything!

When writing content or doing anything with the layout, I naturally want to have them as easy to edit as possible. That also means that if I don’t minify HTML, CSS, images and other often rendered resources on the page, I will lose on page load time. I would like the Jekyll plugins to optimize everything that is produced, but unfortunately only Sass/SCSS assets provide easy minifying options. So page HTML is not minified, even though in the front page’s case it could reduce its size by some percents.

Every PNG image on this blog is optimized by using ZopfliPNG. It produces around 10% smaller images than OptiPNG, that is the standard open source PNG optimizer. ZopfliPNG takes its time, especially on larger images, but as the optimization only needs to be done just before publishing, it’s worth doing, as it can easily save 10–50% in image size. Also if PNG images are not meant to be as high quality representations as possible of something, they have their color space reduced by pngquant if the color space reduction does not result in a significant quality deterioration. This can result in savings excess of 50% while still keeping the image format appropriate for line drawings and other high frequency image data.

Every article has its own identicon and that would result in an extra request if I were to add its as a simple image. Instead I embed it to the page source by using base64 encoded data URI scheme for the around 200 bytes that these page identifier images result in. I could shave off some bytes by selecting between image formats that take the least space to encode the same information. I don’t think that I need to go that far, as I usually have far more text on a page than what images usually take.

Another place that I can optimize is to move all the content that is not required for initial page rendering at the end of the page. This it also possible to load some content asynchronously after the page has initially loaded itself. This mostly means that I have moved all Javascript this page has to the end of the page. Also Google’s PageSpeed Insights gives good hints on what else to optimize and one place to optimize is to either inline or load CSS asynchronously. Unfortunately Jekyll has not made it easy to include the generated CSS file as a part of the document so instead of fighting it, I have opted to have it as a separate file. The second option to load that CSS file asynchronously unfortunately creates an annoying effect when the page is loaded and then the style sheets are applied due to the hamburger button, so I left the main style sheet to be synchronously loaded in the header.

One place to still optimize would be the style sheets that these pages have, but as this layout should be lightweight enough already, I hope that I can survive without having to optimize the things that affect layout rendering.

The front page of this blog has 5 articles with full content visible. I have opted for that option simply because I have noticed that I like reading blogs where I don’t need to open any pages to get to the content. Having less articles per page would make the front page load faster and more articles would help in avoiding page switches, so I hope that this is a decent compromise.

All size optimizations are, in the end, pretty much overwhelmed by just adding Disqus as a commenting platform, as it loads much more data than an average article and also includes dozens of different files that a browser needs to fetch.

Two-way communication

One of the powerful features of any blogging platform is that they enable context related two-way communication between the author and the readers. This means that some blogs have comments. As I’m using static pages, the easiest solution is just to leave the commenting options out. But I decided to opt in for Disqus as a commenting system for now, especially because there is a nice post explaining what you need to do with Jekyll to use Disqus on GitHub pages.

Disqus has its haters and issues, but I think that it’s the best bet for now. If I spend most of my time deleting spammy comments, I will then probably just abandon commenting features altogether and let people comment on Twitter or by using some other means, if they feel that’s necessary and are interested enough.

Linking to myself

As I have some presence on multiple social media sites and some personal project resources, I have decided to include the most current ones in the footer of the page.

Is there actually anyone who reads my blog?

As I have opted to use a third party web hosting service with this, I don’t have access to their logs and therefore can’t really know how many page loads I get. So I take the second option and use full blown Google Analytics service for this blog. It also provides much better data analysis features than any of the web service log analyzers that I have used, so at least by using it I should have some kind of an idea how many people actually visit this blog.

Also using FeedBurner should provide me statistics on how many people have subscribed to my blog.

Article feeds

For me article feeds have two purposes: to provide the subscribers of this blog an easy access to new articles and for me some kind of an idea how many people are regularly reading this blog. Displaying the full article content in the feed is the approach that I like when reading my feed subscriptions, as I can read the whole article in the feed reader. I decided to use a third party feed service, FeedBurner to get feed subscriber statistics. FeedBurner also provides services I can use to further modify my feed, like making it more compatible with different devices and feed readers.

Adding a search

One thing with a website with a lot of content is that it should be possible to find the content. One way is to add a search to the website, but it’s not that easy to do if the website is a collection of static pages. I have now decided to opt out of adding a separate search widget and hope that the article list is enough. If there are any specific articles which interest people, web search engines would bring those up.

Adding share buttons for social media services seems like a no-brainer today. I was already exploring AddToAny service to add such buttons to my site, but then I decided to do a little bit of background research and think about the effects of such buttons.

A quick search returns articles that suggest avoiding social media share buttons (1, 2, 3) and question their actual value for the website. I checked that AddToAny creates 3 extra requests on a page load. Although such requests are probably cached in the desktop browser, mobile browsers don’t necessarily have such luck with caching and they will then suffer quite a large extra delay while loading those resources.

Also the share buttons would needlessly clutter my otherwise minimalistic layout, even if I’m not a graphically talented designer to give a professional opinion about this. And also I have had horrible experiences browsing web on a phone where these kinds of social media sharing buttons actually take real estate on the screen and contribute more to already slow mobile experience.

These things taken into account, I won’t be adding social media sharing buttons here, as their downsides seem to be bigger than the upsides.

Adding metadata

Adding metadata to pages should make it more easy to share these articles around, as different services are able to bring out the relevant content on those pages. Major metadata collections for blog posts used all over the web seem to be Facebook’s Open Graph objects, Twitter Cards, and Rich Snippets for Articles that Google uses. To use all of these, you need to have the following information available:

Information about the author:
- Name
- Web page
- Twitter handle
Information about the organization that produces this content:
- Name
- Logo
- Twitter handle
Information about the article:
- Title
- Publishing date
- Modified date
- Unique per article image, different sizes
- Article URL
- Description/excerpt
- Tags/keywords

Not all of this information is used in all metadata collections, but you need to have that much data available if you want to use all these three services. Fortunately these services provide metadata validators:

Unique article images

The thing that resulted in the most issues was to get an unique per article image done. That might make sense for news articles, but when there are software development related articles about abstract matters, it’s pretty hard to come up with images that do not look out of place or are just completely made up. So I decided to completely make up images that are generated automatically, if there is nothing better specified.

I’m using identicon type of approach by creating such 4 icons from SHA-1 of article’s path (“/2016/01/software-development-blog” for this one). Then these icons are either combined horizontally or to a 2x2 square grid and that way they are used to create article images that are appropriate for different media. It’s a crude trick, but at least with a very high probability it generates a unique per article image that can be visually identified. For example this article has these identifier images:

400x400 square identicon for this article.

800x200 horizontal identicon for this article.

Monetizing blogging

There are various ways how one can monetize blogging, of which ads are usually the most visible one. As I’m just starting and I don’t have any established readership, it would be a waste of my time to add any advertisements on these pages. If I get to 1000 readers monthly, then I will probably add some very lightweight ads, as that way I could use the money for one extra chocolate bar per month.

How timeless is this?

Probably in 5 years, I have lost interest in blogging. Most of the pages that I’m now linking into are gone. Many of the subjects will be outdated either by technology going forwards or development methods going forward. Jekyll will be either in maintenance mode or a completely abandoned project. If I continue blogging, I might even have started using some real blogging service or some other blogging platform.

Conclusion

I have not actively been involved in web development during the last couple of years and this provided me an excellent opportunity to see that what kind of web development related advances and changes there have been. The biggest surprise comes from the metadata usage, and how many different systems expect to have icons in different sizes and formats.

I generally try to put usability and speed before looking fancy. In this blog, however the need for speed is not always satisfied, as having the commenting system hosted outside adds a lot of extra data to transfer and makes these pages more heavy for low power devices.