Thomas Andreas Jung's Blog

Rust - parallel topfew on ARM

2020-06-02T08:20:00.000+02:00

I was trying to improve the Rust implementation for Tim Bray's topfew tool lately. The tool is so small that it is easy to grasp but still interesting enough to study.

You can do this using Unix tools but for fun you can try to do an optimized implementation. The Unix command line implementation is this:

awk '{print $1}' access_log | sort | uniq -c | sort -rn | head -12

Luckily, Dirkjan already did a Rust implementation, so I could use that without translating Tim's implementation into Rust.

The work topfew is doing is:

read lines of a file
regex split the lines into fields
count the most frequent values.

I could do some improvements for the Rust implementation:

The Tim's implementation used \s+ but now the expression is [\t ]. This is faster to match and correct to read a TSV file format.
There are multiple ways to avoid memory allocation: a) while reading and matching input lines. b) while creating compound key.
The counting uses a hashmap. The ahash hashmap has a better performance than the Rust standard library implementation for this case.

My improvement ideas came from looking at memory allocation (and guessing) and from running cargo flamegraph. If you're familiar with sampling profilers, flamegraph is super simple to use on Linux.

The topfew's work consists of two phases 1) count the values and 2) find the top values. Rayon can run the first phase in parallel over chunks of the input file. The second phase reduces the chunk results into the total count and returns the most frequent values.

Flamegraph still works after introducing parallelism. This is an added benefit of sampling profilers: they don't care how you run your workload as long as they get a stack trace.

I ran the implementation on a Graviton 64 core ARM m6g.16xlarge EC2 instance and a 96 core Intel Xeon r5dn.24xlarge EC2 instance. The performance for both is comparable with roughly 12 GB/sec. This is with a 20 GB input file read from tempfs. For comparison that's roughly 5x faster than the fastest available SSD today. A 16 core Graviton machine would be more than capable of saturating an SSD.

The interesting bit is that while the graviton ARM CPU is $3/hr, the Xeon is $9.5/hr. If you have similar performance for a workload and a 3x price difference things become interesting. It looks like, given all the competition — server ARM, AMD and GPUs — Intel will be in trouble unless they can innovate quickly.

Leaving Flickr

2019-12-26T08:02:00.002+01:00

It is the time of the year to renew my new 2-year pro Flickr account. The sticker price increased by 100% for long-term Flickr accounts. This is a good time to reflect if I want to stay on Flickr.

I’ve been a paying Flickr user for a long time with more than seven thousand uploaded photos. Not very surprisingly, I use the site to discover photos and to share photos with friends and family.

I have the greatest join by:

a) going through old my photos. Then I’m reminiscing where I was in my life when I took that photo. As public online photos are carefully curated, this is a 99% positive experience.

b) looking at other peoples’ photos. After leaving South Africa I did this. I’ve seen a lot of historic Cape Town and African wilderness photos. Flickr is a good quiet place to do that, no additional social network noise and distractions.

c) knowing that I can download my photos in case I lost some or all of them.

There are also negative aspects:

a) There is limited trust with Flickr as a company. It changed hands multiple times. How good is the data protection in the long-term? Somebody will monetize my data in obscene ways in the future.

b) The Flickr’s owner will go bankrupt one day. Then I have to leave. Now I can choose to or not.

c) When I joined Flickr it had a active community. Now most people left and went somewhere else. There isn’t much of a network effect. It’s low interaction and traffic.

There are basic questions to answer:

a) Somebody will be the missing marginal Flickr user that pushes the company running Flicker to close shop. If users leave then Flickr dies. This would be said as Flickr is a de facto photo archive for the Internet. Do I care enough that this archive stays up?

b) Opportunity cost, pretend I’ve exactly these $50 to use. Is this the best way to spend it?

c) Do I have to see this as a pure business transaction? One company bought Flickr from another company. It was sheer luck that Yahoo didn’t close shop already. Companies have their own motivations to sell/run/close a site. Why should I pay money to make this business transaction a success?

a) Yes, but I expect the internet archive would take over. It makes more sense to donate to the internet archive than a for profit company.

b) No, there are more interesting things to spend money on. I could buy 10 magazines, 10 movies or x GB of photo storage for the same amount.

c) Yes, this is a business transaction. At the end of the day they’ll do anything to make ends meet for Flickr. Including decisions that will finally destroy Flickr.

The also-good-fallacy

Institutions, including companies, claim that we should support them because they do good. If we follow this thinking then we are the victim of a fallacy. Any large enough organization will do good. This is unavoidable. A large organization makes a lot of decisions. By chance the outcome of some of those decisions have to overlap with the general good.

The fallacy is to think that this is the reason the institution exists, that this is a significant portion of their actions and this is a rational decision to support the institution. All the other decisions are very selfish and not aligned with the common good in deed or to state it differently “I support an organization because it’s mostly selfish and doesn’t align its action with the general good.” Changing the point of view to look at all actions shows how ludicrous the argument is.

To be also good is as profound a statement as to exist. As long as anything exists it will do positive and negative deeds.

It’s sad to stop my Flickr subscription. Somehow the optimism of the internet went away. Openness gave way to monetization in walled gardens. At least now I’m going to have my own walled garden where all the weed I want can grow. I’ll also give $50 to Wikipedia in 2020.

You can do basic math in Python, right?

2019-10-14T06:30:00.000+02:00

Let's do a small quiz. I've pick an example of basic math. I would like to demonstrate how limited our knowledge about the types we use is. If you are a floating point expert you can feel good that you score 100%. For the 99,99% of developers enjoy the ride.

On the way I point out heuristics I find important to consider while developing code. I think those make success more likely.

def div(a, b):

    return a / b

assert div(1,1) == 1

This is the code I would like to test. Looks pretty simple: we divide a by b. We also have a test.

Test branch coverage says we're done.

A looks good or did we miss something?

Yes, there is more than 1 float value.

How big is the problem space?

1<<(64 + 64) # I think, haven't read the IEEE standard to give the correct number

Heuristic: Reading documentation is better than not reading documentation.

If we want to tackle a problem we have to know how big it is. If the surface area is small it's easier to know all valid states. For example testing a function that takes a boolean value is easier than a function that takes an integer value.

How bad is that?

from datetime import timedelta

pflop = 1<<50 200pf="" 2018="" 9.7="" comparison:="" ibm="" mw="" nov="" span="" summit="" supercomputer="" with="">

problem = 1<<(64+64)

billion_years_with_1_pflop = problem / (timedelta(days=365).total_seconds() * pflop) / (10**9)

billion_years_with_1_pflop

#Pretty bad

9583696.5659455

Heuristic: Use brute-force if possible.

In this instance we cannot test all values with brute-force. Testing exhaustively should always be the first approach. If this doesn't work we have to use our knowledge of floating point arithmetic to produce software that is good enough for the context it's used in.

Domain Knowledge

If you doing anything other than scientific computation, float is the incorrect type.

For other applications decimal type, or other types (date, complex, etc.) are appropriate.

Heuristic: Know your domain.

Depending on your domain, you're already doomed.

If you doing anything other than scientific computations with float it is the incorrect type. It won't work for your currencies, interest rate and book keeping calculations. Your those you would use a decimal type.

Now let's start the quiz to see how good our float knowledge is.

This should be true, right?

In the quiz we test the div function from before with the assert_div function. This test succeeds if the first parameter a is equal to dividing and then multiplying by b.

def assert_div(a, b):

    assert div(a,b) * b == a

assert_div(1,1)

assert_div(2,3)

This is true for a lot of float values but not all.

Can you come up with 1-2 examples where this is false?

The Quiz

It's not such a hard quiz really. I'll just give you examples and we try to explain why it's not working.

assert_div(1,0) #Let's start simple

---------------------------------------------------------------------------

ZeroDivisionError                         Traceback (most recent call last)

 in ()

----> 1 assert_div(1,0) #Let's start simple

 in assert_div(a, b)

      1 def assert_div(a, b):

----> 2     assert div(a,b) * b == a

 in div(a, b)

      1 def div(a, b):

----> 2     return a / b

      4 assert div(1,1) == 1

ZeroDivisionError: division by zero

This example is simple: division by zero. This is the same as in normal math and not specific to floating point.

Not a number is fun

Floating point introduces Nan to represent all values that cannot be represented by any other value.

NaN is pretty special. You cannot treat it as any other value. Any operation with NaN returns NaN.

assert_div(float(1), float("nan"))

---------------------------------------------------------------------------

AssertionError                            Traceback (most recent call last)

 in ()

----> 1 assert_div(float(1), float("nan"))

 in assert_div(a, b)

      1 def assert_div(a, b):

----> 2     assert div(a,b) * b == a

AssertionError:

float(1) / float("nan")

nan

Not a number is even funnier

assert_div(float("nan"), 1.0)

---------------------------------------------------------------------------

AssertionError                            Traceback (most recent call last)

 in ()

----> 1 assert_div(float("nan"), 1.0)

 in assert_div(a, b)

      1 def assert_div(a, b):

----> 2     assert div(a,b) * b == a

AssertionError:

Given what we saw in the last test, this test looks like it could work. It doesn't because NaN equal to itself.

n = float("nan")

(n == n, n is n)

(False, True)

This is actually pretty odd. https://docs.python.org/3.7/reference/expressions.html

Equality comparison should be reflexive. In other words, identical objects should compare equal:x is y implies x == y

By flipping the arguments to div we get the result NaN. NaN is a weird type that even though is the same memory address we still do not consider it the same object.

Inifinity is also fun

Infinity and negative infinity support to represent values smaller and larger than a regular value float.

assert_div(float("inf"), float("inf"))

---------------------------------------------------------------------------

AssertionError                            Traceback (most recent call last)

 in ()

----> 1 assert_div(float("inf"), float("inf"))

 in assert_div(a, b)

      1 def assert_div(a, b):

----> 2     assert div(a,b) * b == a

AssertionError:

float("inf")/float("inf")

nan

Losing it

assert_div(3e-5, 7)

---------------------------------------------------------------------------

AssertionError                            Traceback (most recent call last)

 in ()

----> 1 assert_div(3e-5, 7)

 in assert_div(a, b)

      1 def assert_div(a, b):

----> 2     assert div(a,b) * b == a

AssertionError:

Here we lose precision, which again leads to the input value not machting the result.

3e-5 / 7 * 7

2.9999999999999997e-05

Big and small

import sys

assert_div(sys.float_info.max, sys.float_info.min)

---------------------------------------------------------------------------

AssertionError                            Traceback (most recent call last)

 in ()

      1 import sys

----> 2 assert_div(sys.float_info.max, sys.float_info.min)

 in assert_div(a, b)

      1 def assert_div(a, b):

----> 2     assert div(a,b) * b == a

AssertionError:

In this instance the division produces infinity and there's no way to go back from infity to a.

sys.float_info.max / sys.float_info.min

inf

Small and big

assert_div(sys.float_info.min, sys.float_info.max)

---------------------------------------------------------------------------

AssertionError                            Traceback (most recent call last)

 in ()

----> 1 assert_div(sys.float_info.min, sys.float_info.max)

 in assert_div(a, b)

      1 def assert_div(a, b):

----> 2     assert div(a,b) * b == a

AssertionError:

Something similar happens if we flip the input values. The division leads to 0 and we cannot go back to a.

sys.float_info.min / sys.float_info.max

0.0

Precisely wrong

assert_div((1<<64 1="" span="">

---------------------------------------------------------------------------

AssertionError                            Traceback (most recent call last)

 in ()

----> 1 assert_div((1<<64 1="" span="">

 in assert_div(a, b)

      1 def assert_div(a, b):

----> 2     assert div(a,b) * b == a

AssertionError:

This is a more interesting problem. Float's internal representation doesn't allow to represent all 1<<64 span="">integer values in a 64 bit float value. In this example (1<<64 span=""> is the same as (1<<64 span="">.

( (1<<64 float="" int="" span="">

(18446744073709551615, 18446744073709551616)

_[1]-_[0]

Bonus round of weirdness

If this wasn't bad enough. There is additional weirdness that is implementation specific. You'll not find this by reading the documentation.

Heuristic: Test small values exhaustively. Test with random input.

ma = sys.float_info.max * 1.1

mi = sys.float_info.min / 10

(ma, mi)

(inf, 2.225073858507203e-309)

mi < sys.float_info.min

True

The upper and lower bound are not treated the same. You can have a min value that is smaller than the value returned by float_info.

Wrap up

How did you do in the quiz?

If you can use an int don't use an float. You will avoid a whole class of problems.

Heuristic: Simpler software is better than complex software.

How did you do? I'm pretty sure most developers struggle with this quiz. I'm one of them. We work casually with complexity. We use concepts without thoroughly understanding them. This is true for nearly everything in software as we're building layer over layer of software.

In general this is a good thing, it's the only way we know of how to use the hardware we have. On the other hand, we should be conservative in estimating our software quality. It's probably way worse than we think. The only way out is to make the implementation as simple as possible.

Thinking about edge cases

2019-10-05T09:40:00.005+02:00

Software developers use the edge case liberally in conversations. I have the impression it's used without much thought and mostly means: "I don't want to care about this now". Even when it is used more thoughtfully edge case comes with some assumptions that do not hold. We are better off without the term.

What does edge case mean? Let's start with the "case" part: it's some state your software can be in. Okay, that was easy. Now to the "edge" part: this is a bit more fuzzy as software generally doesn't come with edges. I think, it means two things: something at the boundary of the input type domain and a thing of low probability.

If have a 1-dimensional type that is finite from values [x,z] then you can expect that there should be more problems on "the edges" so close to x and z. An example could be [0, max integer].

This makes all sense. The problem is there aren't many of those in your software, even 1-dimensional types have more boundaries. The most lowly type int has a lot of edges to look at [min int, -1, 0, -1, max int]. Once you convert from int32 to int64 you get the union of the edges of both types and this is only lowly int.

If you have a decimal type then edges are harder to define: there are no min/max values (for most implementations) but you get CPU caching related "edges". Your number could be spread over multiple pages. Functionally this is the same behavior but now you could run into performance problems.

This again was only for lowly integer types. The next level of complexity is floating point types: resolution of numbers, min, max, NAN, INF. A whole new set of problems.

Okay what's the mean of all this? Typically, I don't know your types well you’re programming with. You also cannot know the interesting cases.

The probability meaning is even trickier. In the general case you don't know the input value distribution so you have to assume that all values are used. This is the safe and easy assumption. Pragmatically, you can assume that some values aren't actually used in your context as you also control the caller. In this case you can redefine the domain of input values. This allows to lower your software development cost.

What doesn't hold is that "edge cases" are different and allow anything other than to answer: “This is either correct or incorrect”. This would mean we also allow undefined results which are either undefined correct or undefined incorrect.

What's your caller doing if the result is undefined and the range of input values is also undefined that produces those values? Accept occasional disaster. If you want to avoid you have to avoid undefined behavior.

Okay, what can I do?

Step 0: Is your software important enough? Correctness won't come for free. Define how much you want to invest into correctness. If the cost of an invalid result is lower than a correct implementation stop.

Step 1: Define the allowed input values. If you do not give an answer for a given input you cannot make an error.

Step 2a: As a testing strategy: start with what you know and write explicit test.

If you're lucky you know your types really well. This is half realistically for integer and I wouldn't bet that I can write correct code for float, string or timestamps.

Step 2b: For all other types try to exhaustively test. Drop the concept of an edge case. If I don't know the type I'm not able to define the interesting cases. Look into something like Smallcheck for exhaustively testing "small" values and Quickcheck and fuzzing for randomized testing.

Step 3: Never use the word edge case again.

The positive result of this humble world view is that you'll learn. As an exercise you can write a simple function something that works with 2 float values, 2 dates with different time zones or a string. Look at all the problems once you start testing values you haven't thought about. The floating point, Unicode and calendar implementations have enough juice to make your life interesting. If you dear to look.

Career advice

2017-05-09T20:10:00.001+02:00

I got an unsolicited email asking for career advice. Today is one of these days to help somebody or an email bot. Here’s what I wrote:

Hi P,

I think you need two basic skills:

1. You have to be a generalist. Learn enough to be able to solve a problem (a mobile app, a website, a weather station, whatever) end to end. You don't have to understand everything and cutting corners is fine to get to a working solution. The alternative is that you try to understand everything and that is just too much.

2. The other thing is to be an expert in one thing. Try to learn exhaustively about this one small little thing. Read all the books, articles, source code, marketing material, studies, etc. you can find about it. I think it's a valuable experience to try to learn everything about a small corner of the world. All things have more depth, subtlety and history once you start looking. You'll learn to see the trade-offs. The decisions that worked and those that didn't.

You should remember that you're in a long game. Progress is measured in months and years. Hope that helps.

Cheers

Thomas

How to untangle a commit

2016-05-01T11:10:00.001+02:00

To untangle a commit you have incrementally to 1) commit, 2) stash save, 3) test, 4) commit fix (optional), 5) revert fix (optional) and 6) stash apply.

Every now and then you have a commit that is 100% working and following your quality standards, but it is simply to big to be consumed in one code review. What you need to do is to break it up in multiple commits.

You can just to do that with git in a safe way and with reasonable afford using some git yoga. The process is pretty much like swapping a carpet with a sofa on it.

Let's start a session with three files a, b and c.

$ cd $(mktemp -d)
$ git init
Initialized empty Git repository in /tmp/tmp.a1UbQEZsn2/.git/
$ git checkout -b initial
Switched to a new branch 'initial'
$ git commit --allow-empty -m"Initial" 
$ git checkout -b big_one
$ echo "a" > a
$ echo "b" > b
$ echo "c" > c
$ git add a b c
$ git commit -m"the big one"
[big_one 35beda3] the big one
 3 files changed, 3 insertions(+)
 create mode 100644 a
 create mode 100644 b
 create mode 100644 c

We added a, b and c. This is the state we consider too large for one code review. Let's go back to the initial state with files a, b and c unstaged:

$ git checkout initial
$ git cherry-pick --no-commit big_one
$ git reset HEAD
$ git status
On branch initial
Untracked files:
  (use "git add file..." to include in what will be committed)

 a
 b
 c

We want to add a, b and c in three separate commits we can send out as three independent code reviews.

The overall process is simple for the happy case. Add a file and commit it. We use stash to clean the workspace. Then we check if the latest commit was successful. The example uses bash conditions as a check. This feedback would normally come from your test suite.

$ git checkout -b a
Switched to a new branch 'a'
$ git add a
$ git commit -m"a"
[a b73878a] a
 1 file changed, 1 insertion(+)
 create mode 100644 a
$ git stash save -u stash_b_c
Saved working directory and index state On a: stash_b_c
HEAD is now at b73878a a
$ [ -f a ] && [ ! -f b ] && [ ! -f c ] && echo Well done # we run a test
Well done


$ git stash apply
On branch a
Untracked files:
  (use "git add file..." to include in what will be committed)

 b
 c
$ git checkout -b b
Switched to a new branch 'b'
$ git add b
$ git commit -m"b"
[b 64fac68] b
 1 file changed, 1 insertion(+)
 create mode 100644 b
$ git stash save -u stash_c
Saved working directory and index state On b: stash_c
HEAD is now at 64fac68 b
$ [ -f a ] && [ -f b ] && [ ! -f c ] && echo Well done # we run a test
Well done

$ git stash apply
On branch b
Untracked files:
  (use "git add file..." to include in what will be committed)
 c
$ git checkout -b c
Switched to a new branch 'c'
$ git add c
$ git commit -m"c"
[c 2f66c1b] c
 1 file changed, 1 insertion(+)
 create mode 100644 c
$ [ -f a ] && [ -f b ] && [ -f c ] && echo Well done # we run a test
Well done

When we diff with big_one they have the same content:

$ git diff big_one && echo Passed
Passed

The stash is the trail of changes we made, when we "removed" files from the working directory temporarily.

$ git stash list
stash@{0}: On b: stash_c
stash@{1}: On a: stash_b_c

For now we only used stash to clean the working directory to be able to commit and to restore those files after commit. By using git stash apply we keep the trail of changes.

If you're trying this with real code it won't be so simple. Things will go wrong. We forgot to add files to a commit and our tests might start failing. We have to be able to backtrack and restore previous state. We never want to lose work.

Using stash we can do all this safely and methodically. Let's go back to the initial state with the big_one branch containing all changes and the files a, b, and c as untracked changes.

$ cd $(mktemp -d)
$ git init
$ git checkout -b initial
$ git commit --allow-empty -m"Initial" 
$ git checkout -b big_one
$ echo "a" > a
$ echo "b" > b
$ echo "c" > c
$ git add a b c
$ git commit -m"the big one"
$ git checkout initial
$ git cherry-pick --no-commit big_one
$ git reset HEAD

$ git status 
On branch initial
Untracked files:
  (use "git add file..." to include in what will be committed)

 a
 b
 c

In this session we make the error of adding a and b in the first commit, when we actually only should add a.

$ git checkout -b a
Switched to a new branch 'a'
$ git add a 
$ git add b                                  # b - problem to fix 
$ git commit -m"Adding a (actually also b)"
[a c604de5] Adding a (actually also b)
 2 files changed, 2 insertions(+)
 create mode 100644 a
 create mode 100644 b
$ git stash save -u stash_c
Saved working directory and index state On a: stash_c
HEAD is now at 5d031e6 Adding a (actually also b)
$ [ -f a ] && [ ! -f b ] && [ ! -f c ] && echo Well done || echo Error # we run a test
Error

The tests fails now. We have to remove b, then the test passes.

$ git rm b                                               # remove b
$ git commit -m"Remove b"
[a 043f3a7] Remove b
 1 file changed, 1 deletion(-)
 delete mode 100644 b
$ [ -f a ] && [ ! -f b ] && [ ! -f c ] && echo Well done || echo Error # we run a test
Well done

The branch a has the right state now - only the file a, but the stashed change only contains c. We lost b.

How can we recover from this state if the desired state is a subset of changes from the branch and the last stashed changes? We need some git yoga to get where we want.

We added a and b in the first commit and removed b in the second commit. Together both commits leave only a and pass the test. This is the valid state we want for branch a.

The stash contains only c, so we have to restore b. We do this by reverting the state before we removed b. This is the state of the working directory when the change was stashed. Then we apply the latest change from on the stash which is c. The result is that we have a, b and c ((a + b) – b + b + c = a + b + c) in the working directory. The file a is committed. The files b and c are uncommitted changes.

 
$ git revert --no-commit HEAD                            # restore b
$ git stash apply                                        # b & c
On branch a

Changes to be committed:
  (use "git reset HEAD file..." to unstage)

 new file:   b

Untracked files:
  (use "git add file..." to include in what will be committed)

 c

When we look at the history of branch a, it's not exactly the result we wanted. The branch a now contains two commits. We squash those to one commit.

$ git stash
Saved working directory and index state WIP on a: 043f3a7 Remove b
HEAD is now at 043f3a7 Remove b
$ git reset --soft HEAD~
$ git commit --amend -m"Add a for real"
[a bd012df] Add a for real
 1 file changed, 1 insertion(+)
 create mode 100644 a
$ git stash apply
On branch a
Changes to be committed:
  (use "git reset HEAD file..." to unstage)

 new file:   b

Untracked files:
  (use "git add file..." to include in what will be committed)

 c

The squashing is optimal and be done later using a interactive rebase git rebase -i initial for the branch a. From here we can commit b and c separately just as in the happy case before. This gives us the same result:

$ git stash apply
$ git checkout -b b
$ git add b
$ git commit -m"b"
$ git stash save -u stash_c
$ [ -f a ] && [ -f b ] && [ ! -f c ] && echo Well done # we run a test
Well done

$ git stash apply
$ git checkout -b c
$ git add c
$ git commit -m"c"
$ [ -f a ] && [ -f b ] && [ -f c ] && echo Well done # we run a test
Well done

$ git diff big_one && echo Passed
Passed

With stash we can keep safe-points of our changes, clean and restore the working directory. We also use it together with git revert to undo changes.

This process is also applicable to save results from explorative coding. If I hack together a solution end-to-end to see that it's actually works, I can keep those results. I rebuild it then cleanly from the ground and at the same time make sure that the overall end-to-end solution works. This gives you the quick feedback for the exploration, end-to-end testing and quality of actually building it incrementally with the quality you want. It's the best of both worlds.

In a real world scenario the untangling isn't that rigid with strictly adding a, b and c in single commits, such that the the sum of all incremental commits is strictly the same as the big initial commit. Most of the time you will see small problems on the way you want to address in the process. It's strictly speaking less safe because now diffing isn't a simple binary result any more, but it allows a more fluid more workflow. If you're uncomfortable with that you can also split the results in n commits first, keep notes of the necessary changes, and add the fixes in separate commits later (on the branches a, b and c).

In the given session the result are the branches initial, a, b, c that build on each other in this order. We can still improve on that. If you observe that a, b, c are actually orthogonal then the can rebase b, c on the initial branch. This makes the review work parallelizable. Not all features are orthogal. If you have two different ways to cut a commit you should choose the one that makes the changes as independent as possible. This also makes rebasing and merging easier.

$ git log --pretty=oneline --patch a --
bd012df01c7889182c6295725faf942f90fa251e Add a for real
diff --git a/a b/a
new file mode 100644
index 0000000..7898192
--- /dev/null
+++ b/a
@@ -0,0 +1 @@
+a
7a358aa4a6c6d4cb52cc403c3cc7d3b59c208c70 Initial

$ git checkout initial
$ git checkout -b b_orthogonal
$ git cherry-pick b
$ git log --pretty=oneline --patch b_orthogonal
006022af90003f57bde0ad2700201aee73163b95 b
diff --git a/b b/b
new file mode 100644
index 0000000..6178079
--- /dev/null
+++ b/b
@@ -0,0 +1 @@
+b
7a358aa4a6c6d4cb52cc403c3cc7d3b59c208c70 Initial

$ git checkout initial
$ git checkout -b c_orthogonal
$ git cherry-pick c

$ git log --pretty=oneline --patch c_orthogonal
43a58337baf1e337f0ff1fae72c2c5b955874027 c
diff --git a/c b/c
new file mode 100644
index 0000000..f2ad6c7
--- /dev/null
+++ b/c
@@ -0,0 +1 @@
+c
7a358aa4a6c6d4cb52cc403c3cc7d3b59c208c70 Initial

The overall process isn't trivial, but once you know the necessary bits of git (stash, revert, checkout, commit, branch) it should be fairly easy to remember that the only steps necessary are to 1) commit, 2) stash save, 3) test, 4) commit fix (optional), 5) revert fix (optional) and 6) stash apply. If you can keep that in your head you know one solution to this problem.

Revert 2016 edition

2016-04-30T08:20:00.002+02:00

Five years ago I wrote about how reverting can speed you up. The main argument is that retrying to go from a new broken state to a new good state can be incredibly hard. The problem is that you did too many steps and you don't know which of the steps actually broke your system.

My experience is that it is hard for people to let go and try a new start. I always see more of a positive edge. You still know all the things you learnt in the process. Making smaller steps will give you predictable progress. Carrying on might lead to nothing. It is high risk.

I'm reverting more on the enjoyable work days. I did something I didn't know yet how it would work out. I learnt something: it doesn't work this way and the problem is a bit more challenging than I though initially. Failures are the most interesting bits of information. I remember my failures way more vividly than the stuff that actually worked. Digesting failures fully gives you lots of information and an opportunity to grow.

I don't like to arrive on a development battle field with corpses all over the place. There are twenty different changes you made from the last good state - all potentially breaking - and now I should find the one line change that broke it all. I want to see how we got from the pristine state of goodness to this. This is the safe approach. Every developer should be able to minimize the breaking change. It's fine to not know the solution for this minimal problem. This is where the investigation starts.

The easiest way to go from a broken state to a good state is to undo all your changes. Go to the known good state and make smaller steps from there.

With git this all became cheaper and safer. The main tools are git add and git stash.

If you're making progress and everything works like it should you can do git add -A between all good states to stage them. Using staged changes is a lightweight way to safe your work. It's the right tool if the change is not big enough that commit is worth it.

In example I'll use bash conditional expressions as a stand-in for a test suite. It tests the presents or absence of a file.

We start with an initial commit that we know works:

$ cd $(mktemp -d)
$ git init
Initialized empty Git repository
$ echo a > a
$ [ -f a ] && echo works || echo broken # run test
works
$ git add -A
$ git commit -m'working'
[master (root-commit) 263f908] working
 1 file changed, 1 insertion(+)
 create mode 100644 a

Now we can change the source, run our test and use add to stage the change.

$ echo -n b > a
$ [ "$(cat a)" == "b" ] && echo works || echo broken #run test
works
$ git add -A

Then we can do another change, find out that it was breaking the test and revert back by checking out the change we staged.

$ echo -n c > a
$ [ "$(cat a)" == "b" ] && echo works || echo broken #run test
broken
$ git checkout a
$ [ "$(cat a)" == "b" ] && echo works || echo broken #run test
works

We can repeat git add until we are ready to commit. This process adds minimum overhead.

Using git stash we make safe points on the way. This will help recover if something went wrong in the process, for example if we recovered to the wrong intermediate state.

The same session with some safe points added.

$ cd $(mktemp -d)
$ git init
$ echo a > a
$ [ -f a ] && echo works || echo broken # run test
$ git add -A
$ git commit -m'working'

$ echo -n b > a
$ [ "$(cat a)" == "b" ] && echo works || echo broken #run test
works
$ git add -A

$ echo -n c > a
$ [ "$(cat a)" == "b" ] && echo works || echo broken #run test
broken
$ git stash
Saved working directory and index state WIP on master: df9056f working
HEAD is now at df9056f working

$ # Oh, no the test was actually wrong!
$ [ "$(cat a)" == "c" ] && echo works || echo broken #run test
broken

$ git stash apply
On branch master
Changes not staged for commit:
  (use "git add file..." to update what will be committed)
  (use "git checkout -- file..." to discard changes in working directory)

 modified:   a

no changes added to commit (use "git add" and/or "git commit -a")

$ [ "$(cat a)" == "c" ] && echo works || echo broken #run test
works
$ git add -A
$ git commit -m"working again"
[master 3faee04] working again
 1 file changed, 1 insertion(+), 1 deletion(-)

I'd advise to not use stash pop. Apply leaves the safe points in case you tried to applied the stash incorrectly. You can always recover from all your safe points starting from the last commit and you will never lose work. They can be safely discarded once you reached a git commit.

A process with more overhead but similar results is to use git add, git commit, git revert and git rebase -i. You use commit regularly between states and you revert bad states. Finally you squash with git rebase -i to have a clean history. Depending on your preference this process or stash is the better choice. Try both and you'll see which one is the appropriate for your situation.

Coming back to the old post about reverting, git stash is the cheapest way to clean a workspace. You can fire away git stash left and right. If you were wrong you can always go back and scavenge the bits of the changes that were important after all. Given these tools the overall process got massively better.

Speed-up your project build without the tmpfs hassle

2013-05-27T21:19:00.000+02:00

Running all tests in my current project takes some time. With 26 GB of free memory, why not use it for something useful? tmpfs is one way to speed up the test execution by keeping a complete file system in memory.

The problem with tmpfs is that it's only kept in memory. You have to setup the scripts yourself to flush content back to disk. These scripts should better work perfect, otherwise you'll loose parts of your work.

A common approach is to work directly in a tmpfs folder and backup your work to a folder on disk. When your machine is booting you restore the tmpfs folder from this backup folder. Once booted cron is used to synch the tmpfs folder and disk folder.

I found this setup a bit complicated and error prone. I never really trusted my own setup on boot time and with cron. Now I use a much simpler setup that is not using cron at all.

The performance on my machine running a single test, using the IDE and deploying in a web server was always reasonable. Just running all tests takes to much time.

The sweet spot I found is to setup a workspace on disk, sync to tmpfs under /dev/shm and run all tests there. This keeps my setup more or less unchanged and removes the possibility to loose work just because I'm too dump to setup things correctly.

The resulting performance increase is reasonable:

$ nosetests && run_tests.py
........................................................................................................................................................................................................................................................
----------------------------------------------------------------------
Ran 248 tests in 107.070s

OK
........................................................................................................................................................................................................................................................
----------------------------------------------------------------------
Ran 248 tests in 19.423s

OK

It's now five times faster than before.

With python the setup the setup is quite simple:

#!/bin/bash -e

WORK=src/py
LOG=$(pwd)/test.log
TARGET=$(hg root)
SHADOW=/dev/shm/shadow/$TARGET

date > $LOG
mkdir -p $SHADOW

cd $SHADOW
rsync --update --delete --exclude=".*" --exclude=ENV --archive $TARGET ./..

if [ ! -d ENV ]
then
    virtualenv ENV
fi
. ENV/bin/activate

cd $WORK
python setup.py develop >> $LOG
nosetests $* | tee -a $LOG
exit ${PIPESTATUS[0]}

I'm just resyncing into a /dev/shm folder, setup the test environment there (virtualenv and python setup.py) and run the tests (nosetests).

It's is still possible to run single tests from the command line on the tmpfs folder. It's also possible to kick this off from your IDE but you'll loose your test runner and debugging capabilities. As I sad earlier I don't need these right now.

I hope my little twist with tmpfs helps with setting up a faster development environment without all the scripting hassle.

How to use the Mercurial Rebase extension to collapse and move change sets

2013-05-01T12:19:00.000+02:00

Every now and then I like to combine change sets to one change set. You can do this with the rebase extension. I will show you how rebase works with some examples.

The other mayor use case for the rebase extension is to keep the history of your mercurial linear when working with a team of developers. The extension gets its name from this use case: changing the parent change set of your changes to another change set (often the tip). As long as every developer uses pull, rebase, push the history is free of merge change sets.

I setup a repository with three change sets.

hg init repo
cd repo

echo a > a
hg commit -A -m"add a"
ID_A=$(hg id -n)

echo b > b
hg commit -A -m"add b"
ID_B=$(hg id -n)

echo c > c
hg commit -A -m"add c"
ID_C=$(hg id -n)

hg version | head -n 1 ; echo
hg glog --patch


Mercurial Distributed SCM (version 2.5.4)

@  changeset:   2:aa77f078beed
|  tag:         tip
|  user:        blob79
|  date:        Wed May 01 12:14:27 2013 +0200
|  summary:     add c
|
|  diff -r a692ab61aad8 -r aa77f078beed c
|  --- /dev/null        Thu Jan 01 00:00:00 1970 +0000
|  +++ b/c      Wed May 01 12:14:27 2013 +0200
|  @@ -0,0 +1,1 @@
|  +c
|
o  changeset:   1:a692ab61aad8
|  user:        blob79
|  date:        Wed May 01 12:14:27 2013 +0200
|  summary:     add b
|
|  diff -r e31426a230b0 -r a692ab61aad8 b
|  --- /dev/null        Thu Jan 01 00:00:00 1970 +0000
|  +++ b/b      Wed May 01 12:14:27 2013 +0200
|  @@ -0,0 +1,1 @@
|  +b
|
o  changeset:   0:e31426a230b0
   user:        blob79
   date:        Wed May 01 12:14:27 2013 +0200
   summary:     add a

   diff -r 000000000000 -r e31426a230b0 a
   --- /dev/null        Thu Jan 01 00:00:00 1970 +0000
   +++ b/a      Wed May 01 12:14:27 2013 +0200
   @@ -0,0 +1,1 @@
   +a

After the collapsing the change sets $ID_B and $ID_C a new change set is in the history containing changes of both. This change set can then be pulled or imported as one unit making the history of the repository a bit easier to read.

hg rebase --collapse --source $ID_B --dest $ID_A -m"collapse change sets"
hg glog --patch


saved backup bundle to /home/thomas/Desktop/rebaseblob/repo/.hg/strip-backup/a692ab61aad8-backup.hg
@  changeset:   1:bb2e0cde315f
|  tag:         tip
|  user:        blob79
|  date:        Wed May 01 12:14:28 2013 +0200
|  summary:     collapse change sets
|
|  diff -r e31426a230b0 -r bb2e0cde315f b
|  --- /dev/null        Thu Jan 01 00:00:00 1970 +0000
|  +++ b/b      Wed May 01 12:14:28 2013 +0200
|  @@ -0,0 +1,1 @@
|  +b
|  diff -r e31426a230b0 -r bb2e0cde315f c
|  --- /dev/null        Thu Jan 01 00:00:00 1970 +0000
|  +++ b/c      Wed May 01 12:14:28 2013 +0200
|  @@ -0,0 +1,1 @@
|  +c
|
o  changeset:   0:e31426a230b0
   user:        blob79
   date:        Wed May 01 12:14:27 2013 +0200
   summary:     add a

   diff -r 000000000000 -r e31426a230b0 a
   --- /dev/null        Thu Jan 01 00:00:00 1970 +0000
   +++ b/a      Wed May 01 12:14:27 2013 +0200
   @@ -0,0 +1,1 @@
   +a

If you have to keep the change sets because they were already published, you can also keep the old change sets adding the new collapsed change set as a separate branch.

hg rebase --keep --collapse --source $ID_B --dest $ID_A -m"collapse change sets"
hg glog


@  changeset:   3:0fa31b1ebcf1
|  tag:         tip
|  parent:      0:448200734313
|  user:        blob79
|  date:        Wed May 01 12:14:28 2013 +0200
|  summary:     collapse change sets
|
| o  changeset:   2:0a6400bebe21
| |  user:        blob79
| |  date:        Wed May 01 12:14:28 2013 +0200
| |  summary:     add c
| |
| o  changeset:   1:a93228700dc1
|/   user:        blob79
|    date:        Wed May 01 12:14:28 2013 +0200
|    summary:     add b
|
o  changeset:   0:448200734313
   user:        blob79
   date:        Wed May 01 12:14:28 2013 +0200
   summary:     add a

hg log -r "::tip" --patch


changeset:   0:448200734313
user:        blob79
date:        Wed May 01 12:14:28 2013 +0200
summary:     add a

diff -r 000000000000 -r 448200734313 a
--- /dev/null   Thu Jan 01 00:00:00 1970 +0000
+++ b/a Wed May 01 12:14:28 2013 +0200
@@ -0,0 +1,1 @@
+a

changeset:   3:0fa31b1ebcf1
tag:         tip
parent:      0:448200734313
user:        blob79
date:        Wed May 01 12:14:28 2013 +0200
summary:     collapse change sets

diff -r 448200734313 -r 0fa31b1ebcf1 b
--- /dev/null   Thu Jan 01 00:00:00 1970 +0000
+++ b/b Wed May 01 12:14:28 2013 +0200
@@ -0,0 +1,1 @@
+b
diff -r 448200734313 -r 0fa31b1ebcf1 c
--- /dev/null   Thu Jan 01 00:00:00 1970 +0000
+++ b/c Wed May 01 12:14:28 2013 +0200
@@ -0,0 +1,1 @@
+c

As I already mentioned you can use the rebase extension to move change sets around. A small exercise is to change the order of two change sets. With the same setup of three change sets as in the collapse example before, we move the changes from the initial change set $ID_C before the changes made in change set $ID_B.

hg rebase --source $ID_C --dest $ID_A
hg rebase --source $ID_B --dest tip


@  changeset:   2:673889b80c51
|  tag:         tip
|  user:        blob79
|  date:        Wed May 01 12:14:29 2013 +0200
|  files:       b
|  description:
|  add b
|
|
o  changeset:   1:175e0515be8b
|  user:        blob79
|  date:        Wed May 01 12:14:29 2013 +0200
|  files:       c
|  description:
|  add c
|
|
o  changeset:   0:16ef477c54cb
   user:        blob79
   date:        Wed May 01 12:14:29 2013 +0200
   files:       a
   description:
   add a

A nice exercise for the reader is to extend the example to fourth change set. This change set is the tip of the repository as a child of change set $ID_C. After the change set move it should be the child of the change set $ID_B.

In real world repositories changes are not fully independent, so while rebasing you have to resolve conflicts, but this another blog post for another day.

Mini-Quickcheck for Python

2013-04-23T20:10:00.001+02:00

I wanted an implementation of a mini-Quickcheck in Python. This is the API I came up with. It is also a good way to see what’s at the heart of Quicheck: generators.

I cut every corner I could. Some methods are not random, but this can be easily fixed.

There is a runner decorator (dependent on the decorator library) than run’s test methods repeatedly.

import random

def oneof(*values):
    return random.choice(values)

def optional(param):
    return oneof(None, param)

def boolean():
    return oneof(True, False)

def integer(min=0,max=1<<30):
    return random.randint(min,max)

def char():
    return chr(integer(min=2,max=ord('z')))

def string(min=0):
    return "".join([char() for _ in xrange(integer(min=min, max=10))])

def nonempty_string():
    return string(min=1)

def substring(string):
    if not string:
           return string
    start = integer(0, len(string) - 1)
    end = start + integer(len(string) - start)
    return string[start:end]

def date():
    return datetime.date.today()

def subset(*vs):
    return [e for e in vs if boolean()]

def list(gen):
    return [gen() for _ in xrange(integer(max=5))]
    
@decorator
def runner(func, *args, **kwargs):
    for r in xrange(4):
        test_instance = args[0]

About

2013-04-23T20:09:00.000+02:00

My blog did not have an about page for years. Now that blogs get out of fashion I can get one as well, as the chances that someone actually reads it diminishes.

Having an about page is nice for everybody involved. There is a nice personal touch to the site full of otherwise dry articles. This is also the space where a blogger can do some navel-gazing and bragging: a nice photo from a place I was you were not, I can tell you what a great company I work for and what an amazingly smart guy I am.

The problem is: I like to stay at home most of the time, I work for a normal company and I'm definitely not smart. (My best trait is productive laziness.)

If you read this far the one thing you should consider is becoming an organ donor. You won't mind. You may help somebody in a lot of trouble. At least I tried! Maybe you like to read a bit about organ donation...

(The fact that I wrote this after reading an about page is a mere coincidence. You should not ask somebody who they are. It's probably the worst question possible.)

Analogous function testing

2012-12-19T18:21:00.000+01:00

For a long time I wanted to write about a testing technique I call analogous function testing. Some time ago I already wrote about the inverse function test pattern and the analogous function testing pattern is mentioned in my piece about functional hash functions.

Instead of describing some synthetic examples I can now present the pattern with a real world example. I wrote tests for a Java port of a small C library. The phonet library returns the same string for words with the same pronunciation in German. In this way it’s similar to the Soundex and Metaphone algorithms in English.

The test, I’m going to present, found bugs in the Java port. More surprisingly it also found bugs in the C library (that’s more than 10 years old) and the JNI-bridge code we used for years.

If you have a look at the C code you’ll find out that it is not trivial. Well, that’s the most polite way to say that it’s devilish hard to read. It’s beyond my means to read a code that initializes 10 static variables and has 26 automatic variables in the main function. It’s not obvious how you can port this code to Java. Luckily, there was already a Java port that is more or less analogous to the C code - it’s not very idiomatic Java.

My goal was to fix the port. I ignored performance and maintainability issues.

How who you figure out that the quite complex function behaves the same way in the original and the ported version?

The trivial test approach to explicit check values is costly and it’s hard to get good code coverage.

assertEquals("a", phonet("a"));

This way you need a lot of test code before you get some coverage. Line and branch coverage might be easy to attain, but the phonet library has 200+ rules that are evaluated by precedence. Now are we going to write x * 200 tests to check that the right values are returned? If we check all rules there are still all characters left that have to be processed consistently. When we are done with that phonet rules could change and we have to redo some of them. Not very exciting.

The trivial test approach is simply too expensive (and boring). Let’s try another approach that reduces our work to one line of code: write the test as an analogous function test.

The analogous test ensures that f(x) = g(x) for all x. f(x) is the C version and g(x) is the Java version. The only thing left is to generate x and see if f(x) == g(x) holds.

That’s the single line of code we have to write to get the test coverage. Okay, we have to write one line to check that f(x) = g(x) and some code to generate x. It all boils down to comparing the output of the C and the Java versions.

assertEquals(in, cPhonet.phonet(in), jphonet.code(in));

Unfortunately, we cannot create all values with brute force. The challenge is to find ways to generate the right input values and the quality of the test depends on the generated input values. I wrote the same test with multiple approaches each and every one useful for a facet of input values:

There’s a Quickcheck-based random generator test for large input
There’s a test with German dictionaries of known values
There’s a exhaustive test that generates all possible values up to a certain (small) size

As an example I’ll show you the exhaustive test for small strings. It uses Permutation.kPermutationWithRepetition(all, STRING_ADD, size) to create an Iterable that returns all permutations of size k from a set of input values with repetition of values (i.e. for the input AB and the size k = 2 it generates AA, AB, BA, BB).

@Test public void phonetDefinedCharacters() {
    Coder jphonet = createCoder();
    CPhonet cPhonet = new CPhonet();

    for (int size = 1; size <= 2; size++) {
        for (String in : kPermutationWithRepetition(
                CharEncoding.allChars(), STRING_ADD, size)) {
            assertEquals(in, cPhonet.phonet(in, rules()), jphonet.code(in));
        }
    }
}

You can have a look at the full implementation in the CPhonetVsPhonetikTest class.

As you can see from the test code the idea is simple. Following this approach will not guarantee 100% test coverage as long as you cannot generate all values. The hard part is to come up with the right test data. A pure Quickcheck-based approach is a good start, but might need to run too long to find critical bugs and additional tests are necessary.

Porting is only one class of tests to use the analogous function testing for. Another class are tests where you already have a simple implementation. You are able to verify the correctness of the simple code easily. The simple implementation is then used to test the harder implementation (with preferable non-functional characteristics). It might be a good approach if you do not trust a piece of code, but you’re unable to read all of the source code. Just check it against a simple implementation.

A blind spot of an analogous function test is that bugs in f(x) and g(x) can mask each other. If you implement the same bug in both versions the test will not indicate the problem.

The moral from this story is that black-box tests with generated values can be a valuable tool besides classical tests and code reviews - but as everything in life it’s not perfect. It’s only one tool. Quality is still expensive.

Testing: Start with the result

2012-12-12T18:38:00.000+01:00

If you write tests based on generated values, it’s incredible how often it is easier to start from the result value and work your way backwards to the input value (that results in return value).

One part of the story is that you should not repeat the algorithm in the test. The only thing you’re testing is that you can write the same code in two places. Every error in your reasoning is repeated. It’s natural to start with a known return value and create input values that lead to this result.

I’ll demonstrate the principle with a functionality I recently used: k-permutation with repetition. (k defines the size of the output values. Permutation indicates that we care about order. Repetition says that output values can be created by reusing input values.) For example the k-permutation with repetition for the input set {A B} and k = 3 is {AAA, AAB, ABA, ABB, BAA, BAB, BBA, BBB}. We know that the size of the permutation is n^k. (n is the size of the input set.)

@Test public void kPermutationWithRepetitions() {
    int maxSize = 6; 
    
    for(String expected : toIterable(strings(1, maxSize))) {
        Set<Character> allowed =
            newHashSet(Chars.asList(expected.toCharArray()));
        Set<Character> additional =
            anySet(characters(), 0, maxSize - allowed.size());
        allowed.addAll(additional);
    
    
        int k = expected.length();
        Iterable<String> actual =
            kPermutationWithRepetition(allowed, STRING_ADD, k);
        
        assertEquals(
            Math.pow(allowed.size(), k), Iterables.size(actual), 0);
        assertTrue(Iterables.contains(actual, expected));
    }
}

The test uses Quickcheck and Guava. It starts by generating one expected result value. The interesting input value is the set of allowed characters. This set contains all characters of the result string plus some additional characters. We have to take care that the input set is small enough, otherwise the result size is too large to run the test in practice. With this in place we can call kPermutationWithRepetition that generates the actual result. The test checks that the expected value is in the result and that the size of the permutation is correct.

This example demonstrates that it can be much easier to start with an expected return value. Depending on the situation it’s appropriate to create the entire result or just aspects of it. If you can generate at least one result value, the probability of the generated values is evenly distributed over all results and you know the total result size, your test is sufficient.

My little stupid Raspberry Pi project: 1 LED and 1 Switch

2012-11-28T18:38:00.001+01:00

I recently bought a Raspberry Pi. My goal is to build a living room mp3 player. As I’m a software guy the programming part isn’t a problem, but the mp3 player has to be able to respond to some buttons having a semantic like: “Hey, next song please!”. So I have to do some simple hardware. Sadly, my knowledge about electronics has faded. This is the journey of electronics ignorant in hardware land.

As the a first step, I put two examples from the raspbian user guide together. Pressing a switch toggles a LED, enabling and disabling it. Right now I’ve no idea how to calculate the pull-up-resistor for the switch and the resistor for the LED. I was happy to see that it simply worked.

Parts: Breadboard, 10k Ohm, 150 Ohm, Switch, 1 Led, Jumper Wires

What I learnt so far

There are four and five band resistor color codings. One half of the resistors I use are four band coded and the other half five band coded. Man, this is like software world with downward compatibility. Can’t they just use five bands?

The world outside of a computer has x,y,z axis. There is a difference between nice, short and too f***ing short cables. Because they had only male jumper wires at my store, things are a bit improvised anyway. I had to fit an end connector pin on the male wire. The guy at the shop said I should solder it on. Yeah right, the first thing I’ll do is soldering; adding some shrink tubing is also a good idea. The x,y,z problem also applies to switches. They actually have to fit into the Breadboard.

Software-wise the project is very simple. I use the raspberry pi GPIO python library. I read the state of the switch from port 12 and enable the LED on the first and disable the LED on the second state change.

import RPi.GPIO as GPIO
import time
GPIO.setmode(GPIO.BOARD)
GPIO.setup(11, GPIO.OUT)
GPIO.setup(12, GPIO.IN)
state = False

def wait():
 time.sleep(0.1)
while True:
 wait() #don't burn the CPU
 if not GPIO.input(12):
    print state
    state = not state
    GPIO.output(11, state)
    while not GPIO.input(12):
            wait()
            continue

Not very surprisingly it’s a good idea to not poll the input port constantly. Otherwise the python process would use 100% of the available CPU time. I don’t know yet how to replace the polling with GPIO interrupts.

I think I won the first round of the hardware game:
a) It works.
b) My Raspberry is still alive.
c) It’s quite clear I have to learn and practice a lot.

How to restore the subversion history of a file

2012-11-08T06:11:00.001+01:00

Every now and then the history of a file in a subversion repository is lost. I’ll describe how you can restore the history.

If you do a svn copy the file history of the new location shares the history of the old location. This is way subversion supports svn move and branches. Restoring a file’s history is only a matter of copying the right file. Given the file $A_FILE to restore, the new file location $B_FILE and the last revision $A_REV of $A_FILE (before $A_FILE was removed) the operations to restored the history of $A_FILE are:

svn delete $B_FILE
svn cp $URL/$A_FILE@$A_REV $B_FILE
svn cat $URL/$B_FILE > $B_FILE

After you restored $A_FILE’s history all changes directly to $B_FILE are lost. Only the content of the file $B_FILE survives.

Here’s a commented script demonstrates the problem and show how to fix it:

#!/bin/bash -ex

#do everything in a working directory that's removed after the script ran
mkdir -p work
WORK=$(readlink -f work)
cd "$WORK"
trap 'rm -rf "$WORK"' EXIT SIGINT SIGTERM

#setup a repository
REPOSITORY=$(readlink -f repo)
svnadmin create "$REPOSITORY"
URL=file://$REPOSITORY

#checkout the local workspace
svn co $URL workspace
cd workspace

A_FILE=a.txt
B_FILE=b.txt

#create the file to restore
echo $A_FILE content > $A_FILE
svn add $A_FILE
svn commit -m"initial version of $A_FILE"
svn update
A_REV=$(svnversion)

#history of the file is lost
svn delete $A_FILE
svn commit -m"deleted $A_FILE"

#the new file
echo $B_FILE content > $B_FILE
svn add $B_FILE
svn commit -m"initial version of $B_FILE"
svn update
B_REV=$(svnversion)

#restore the file history and keep the file content
svn delete $B_FILE
svn cp $URL/$A_FILE@$A_REV $B_FILE
svn cat $URL/$B_FILE > $B_FILE
svn status
svn commit -m"history from $A_FILE in new path $B_FILE"

#check the result
svn update && svn info
svn log --diff $URL/$B_FILE
svn cat $URL/$B_FILE
svn diff $URL/$B_FILE $URL/$B_FILE@$B_REV

My best Python HTTP test server so far

2012-10-10T19:36:00.001+02:00

I’ve implemented a bunch of test HTTP servers in Python and now I think I’ve got the implementation right:

Test client code doesn’t have to care about the specific HTTP port. Any free port can be used by the test without interference with other running processes.
The HTTP server is up and running when the test is ready.
Resource handling uses the the with statement. The HTTP server is shut-down at the end of the test.
The concrete request urls (host and port) are transparent for the test.
Test can be run fully in memory. The only resource allocated is the HTTP socket.

The actual test is brief. We call the http_server with a HTTP handler and the function url is returned. The test can use the url function to create the request url. This is handy as the allocated port of the HTTP server is not fixed. In the test below we check that the returned content matches.

with http_server(Handler) as url:
    assert list(urllib2.urlopen(url("/resource"))) == [content]

To run this test we need a HTTP handler implementation. You could use the SimpleHTTPServer.SimpleHTTPRequestHandler that comes with python and work with files served from a directory. This is any good point to start but setting up a test folder with the necessary content is cumbersome and inflexible.

This handler runs in memory without any additional setup. It will always returns with the 200 response code writes the content into the request.

code, content = 200, "Ok"
class Handler(BaseHTTPServer.BaseHTTPRequestHandler):
  def do_GET(self):
      self.send_response(code)
      self.wfile.write("\n" + content )

The http_server implementation starts a thread, opens a socket and yields the url function. The HTTP request handler runs in the spawned thread.

@contextlib.contextmanager
def http_server(handler):
  def url(port, path):
      return 'http://%s:%s%s' % (socket.gethostname(), port, path)
  httpd = SocketServer.TCPServer(("", 0), handler)
  t = threading.Thread(target=httpd.serve_forever)
  t.setDaemon(True)
  t.start()
  port = httpd.server_address[1]
  yield functools.partial(url, port)
  httpd.shutdown()

I leave it as an exercise to you to write an implementation that reuses the HTTP server in multiple tests. This could be necessary if the overhead of allocating ports dominates the test running time.

Convert videos to wav and mp3 audio files

2012-10-08T20:53:00.000+02:00

On the one hand, I listen almost daily to a fluctuating list of podcasts - mostly while commuting and jogging. I find it very relaxing. On the other hand, I have an ever increasing list of presentation videos I would like to watch, but I don’t actually do it. For some talks and most interviews the video is nice but not necessary. The audio is sufficient to get the information and it’s handy to be able to listen to interesting stuff on the run.

The following bash script downloads the video from youtube and encodes the audio into an low quality mp3 file.

#!/bin/bash -e
cd $(mktemp -d youtubedownXXXX)
youtube-dl -f worst -t $1
name=$(echo *)
ffmpeg -i $name -vn -acodec pcm_s16le -ar 44100 -ac 2 $name.wav
lame -m m -q 0 --vbr-new -B 64 $name.wav $name.mp3

The script depends on youtube-dl, ffmpeg and lame. It creates a temporary folder where it puts the downloaded video, wav and mp3 files.

Archiving twitter messages without shortened URLs with Python Twitter Tools

2012-08-05T16:44:00.001+02:00

Version 1.9 of the Python Twitter Tools include the new feature to follow redirects of tweeted urls. The developer behind twitter tools Mike Verdone accepted my patch.

The goal is to archive readable URLs in the tweet archive by replacing all URLs from shortening services. You can run it with twitter-archiver -f ThomasAJung to follow all links or twitter-archiver -r bit.ly,t.co,goo.gl ThomasAJung to follow the links of selected hosts.

The archived output changes from:
228771182638419968 2012-07-27 10:37:54 CEST Believe it or not there are people who see interop as a bad thing. It interferes with their business model, ... http://t.co/AhAg41x9

to:
228771182638419968 2012-07-27 10:37:54 CEST Believe it or not there are people who see interop as a bad thing. It interferes with their business model, ... http://scripting.com/stories/2012/07/26/oauth1IsFine.html

You can read the Shortcomings section of the Wikipedia article about URL shortening if you are interested why you should replace the shortened URLs.

Using Bash Shell Parameter Expansions

2012-05-23T19:59:00.001+02:00

Knowing shell parameter extensions is quite handy to manipulate parameters in Bash. You can remove and replace characters, select substrings, check if a variable is present, find variables and convert strings to upper and lower case.

A simple example is to emulate the behavior of the coreutils commands dirname and basename. After setting a variable FILE=/tmp/a.txt the output of dirname $FILE and echo ${FILE%/*} is /tmp. We can also create a parameter expansion that emulates basename $FILE that returns a.txt: echo ${FILE##*/}.

As you can see from these two examples parameter expansions are powerful. It’s also obvious that the parameter expansions are quite cryptic. They do not use a syntax that is used elsewhere and are hard to remember. The advantage of parameter expansions is that you can use them directly for ad hoc scripting in the shell.

The remainder of this blog demonstrates all parameter expansions with examples. The Bash manual contains a complete description of parameter expansions.

Examples

Lines starting with $ contain the command line input and lines starting with > contain the output of commands. Each section contains a shell session - all variables have the scope of this section.

Get the value of a variable

$ A=value
$ echo $A
> value
$ echo ${A}
> value

Assign variable if null

$ echo $F
>
$ echo ${F:=A}
> A
$ echo ${F:=B}
> A
$ echo $F
> A

Substitute variable if null

$ echo $F
>
$ echo ${F:-A}
> A
$ echo $F
>
$ G=B
$ echo ${G:-A}
> B

Exit shell if variable is null

$ bash -c "echo ${A:?died}"
> bash: A: died
$ echo $?  #return code
> 1
$ A=ok
$ bash -c "echo ${A:?died}"
> ok
$ echo $?  #return code
> 0

Substitute variable if not null

$ echo ${F:+A}
>
$ F=G
$ echo ${F:+A}
> A

Substrings from index

$ A=abcdefg
$ echo ${A:1}
> bcdefg
$ echo ${A:1:3} #with optional length parameter
> bcd

Length of a string

$ A=abc
$ echo ${#A}
> 3

Delete from left

$ A=abcdefabc

$ echo ${A#a}
> bcdefabc

$ echo ${A#ab}
> cdefabc
$ echo ${A#b} #no match
> abcdefabc
$ echo ${A#?b}
> cdefabc
$ echo ${A#?c} #no match
> abcdefabc
$ echo ${A#*c}
> defabc

$ echo ${A#*b} #shorted match
> cdefabc
$ echo ${A##*b} #longest match
> c

Delete from right

$ A=abcdefabc

$ echo ${A%c}
> abcdefab

$ echo ${A%b*} #shortest match
> abcdefa
$ echo ${A%%b*} #longest match
> a

Replace variable value

$A=abcdefabc

$ echo ${A/a/x}
> xbcdefabc
$ echo ${A/b/x}
> axcdefabc
$ echo ${A/#a/x} #from left
> xbcdefabc
$ echo ${A/#b/x}
> abcdefabc
$ echo ${A/%c/x} #from right
> abcdefabx
$ echo ${A/%b/x}
> abcdefabc
$ echo ${A/b*b/x}
> axc
$ echo ${A//b/x} #replace all
> axcdefaxc
$ echo ${A//b}
> acdefac

Case conversion

$ A=abc
$ echo ${A^} #upper case first
> Abc
$ echo ${A^^} #upper case all
> ABC
$ B=DEF
$ echo ${B,} #lower case first
> dEF
$ echo ${B,,} #lower case all
> def

Find variable name with prefix

$ echo ${!US*}
> USER

Get the value of a parameter defined by the value of another parameter

$ A=B
$ B=b_value
$ echo ${!A} #Indirect expansion
> b_value

You should always keep in mind that the shell parameter expansions can be used with nested parameter expansion, tilde expansions, command substitutions and arithmetic expansions.

Used together with parameter expansions

$ A=abc
$ B=a
$ echo ${A#$B}
> bc
$ echo ${A#$B?}
> c

Used together with tilde expansion

$ A=./a
$ echo ${A/#./~}
> /home/user/a

Used together with command substitution

$ A=./a
$ cd /usr
$ echo ${A/#./$(pwd)}
> /usr/a

Used together with arithmetic expansion

$ echo ${A/#./$((1 + 10))}
> 11/a

After getting so far in the examples I suggest you have also a look at the other shell expansions supported by Bash.

Why I Created Daily Feed Recycler

2011-12-13T20:04:00.000+01:00

The motivation behind Daily Feed Recycler is that there is too much content on the Internet: good and bad. Once I found a good source, I do not have enough time to actually read everything. I just can’t read 5k word articles in a row. Even if I had the time to do it, I cannot mentally. It’s not fun.

Feeds are a wonderful tool for authors and readers. They allow to stay informed about the changes on a page. Feed aggregators are part of this reading experience. They let you manage feeds: read articles, mark as read, subscribe and aggregate feeds. Applications creating feeds get this for free and there are a multitude of feed aggregators available.

Feeds help with time problem to some extent, but they - wonderful as they are - have a dark side. Instead of solving the problem of organizing content in a way that you can read really good stuff, they organize content in a way that you can read the latest stuff. Yes, the new content is cool, but the classics are here to stay. You will not get a tweet from Goethe. With Daily Feed Recycler the good stuff is on an equal footing with the latest stuff. The content for every day is presented as new stuff in your feed aggregator.

Content has to be presented in a digestible manner. Deep reading, not just skimming to get an overview. Time and mental energy has to be organized in a way to allow reading. The Daily Feed Recycler is thought of as a way to break down content in smaller parts and add a reminder that there’s still good content to read. Using your feed aggregator you can decide when you read it, if you ignore it or if you read it at all. Feed aggregators are quite good nowadays and their flexibility is useful here.

You can now create your own channel of daily content: the full bash reference, the mayor Linux man pages or the list of all decision biases from Wikipedia. Everything you like. I’m often looking for daily feeds but there are not that many of them out there, because it’s a lot of work to do and depends on a certain level of expertise. Daily Feed Recycler is not a competitor for the existing curated daily feeds. A curated feed can have a much better quality through an expert selection and logical ordering of content.

Daily Feed Recycler follows the “Release early, Release often” philosophy. There are bugs, missing features and rough edges. I hope you find it useful nonetheless.

Find cruft with a new Mercurial extension

2011-07-15T07:10:00.001+02:00

After some fun with the quick and untested shell script that finds the oldest code in a Subversion repository, it is the next step to write a Mercurial extension. The simple Mercurial extension cruft does basically the same job as the shell scripts for Subversion. Being an extension it’s nicely integrated into Mercurial as the other extensions.

Python and Mercurial are relatively easy to get into. Mercurial provides the Developer Info page which is really good. Additionally, there’s a guide how to write a Mercurial extension. The guide is good start for the Mecurial development. The rest can be easily picked up by reading the code of other commands and extensions.
The code is readable and there are no big hurdles.

The only thing I missed while writing the extension is type information in method signatures. As much as I like Python it’s ridiculous to write the type information in the pydoc and let the developer figure out the types. This one of the trade-offs you have to live with.

Testing Mercurial extensions

It suffices to understand the integration tests tool Mercurial uses to test the extension itself. There’s some documentation for this as well. The basic idea behind Cram is to start a process and check against the expected output.

The integration test tool defines a small language. All lines that have no indentation are comments. Indented lines starting with $ are executed and all other lines are the expected output. For example a test looks like this:

init

 $ hg init

 $ cat <<EOF >>a
 > c1
 > c2
 > EOF
 $ hg ci -A -m "commit 0"
 adding a

cruft

 $ hg cruft
 0 a c1
 0 a c2

First a repository is initialized: a file called a with the content (c1,c2) is committed and then the Mercurial is started with the cruft command. Without options the cruft extension prints all lines with newest lines first. The expected output is (0 a c1, 0 a c2) which is means the revision 0 file a and line c1; revision 0 file a and line c2.

It’s fairly easy to get started with this tool. The only downside in my tests is that the they reuse the same test fixture and do not reset the fixture for each test. They are not executed in isolation, which has a whole range of problems - redundancy and readability for example - but I didn’t feel that it was worth the effort to structure the tests otherwise.

Installing the extension

The easiest way to install the extension is to download the cruft.py to a local folder and add a link to the extension file in the .hgrc file.

[extensions]
cruft=~/.hgext/cruft.py

Using the extension

After the installation you can execute pretty much the same commands as with the shell script version.

hg help cruft

hg cruft

(no help text available)

options:

-l --limit VALUE   oldest lines taken into account
-c --changes       biggest change sets
-f --files         biggest changes per file
-X --filter VALUE  filter lines that match the regular expression
    --mq            operate on patch repository

use "hg -v help cruft" to show global options

I use here the quickcheck source code to show some sample output.

hg cruft -l 5 -X "^(\s*}\s*|\s*/.*|\s*[*].*|\s*|\s*@Override\s*|.*class.*|import.*|package.*)$" quickcheck-core/src/main

This finds the oldest 5 lines using the Java source code specific exclusion pattern (parentheses, imports, class definitions etc.) for the quickcheck-core/src/main folder. The output contains the revision number, source file and source code line.

5 quickcheck-core/src/main/java/net/java/quickcheck/generator/support/TupleGenerator.java public Object[] next() {
5 quickcheck-core/src/main/java/net/java/quickcheck/generator/support/TupleGenerator.java ArrayList<Object> next = new ArrayList<Object>(generators.length);
5 quickcheck-core/src/main/java/net/java/quickcheck/generator/support/TupleGenerator.java for (Generator<?> gen : generators) {
5 quickcheck-core/src/main/java/net/java/quickcheck/generator/support/TupleGenerator.java next.add(gen.next());
5 quickcheck-core/src/main/java/net/java/quickcheck/generator/support/TupleGenerator.java return next.toArray();

You can also find the biggest change sets for the last 500 lines.

hg cruft -X "^(\s*}\s*|\s*/.*|\s*[*].*|\s*|\s*@Override\s*|.*class.*|import.*
|package.*)$" -l 500 -c quickcheck-core/src/main

This prints the revisions number, number of changed lines and commit comment of the change set.

49 41 removed getClassification method from Property interface
moved Classification into quickcheck.property package
177 43 MutationGenerator, CloningMutationGenerator and CloningGenerator added
139 50 fixed generic var arg array problems
5 53 initial check in

Finally, you can find the files with the most lines changed by a single change set (again with the filter and for the 500 oldest lines).

hg cruft -X "^(\s*}\s*|\s*/.*|\s*[*].*|\s*|\s*@Override\s*|.*class.*|import.*
|package.*)$" -l 500 -f quickcheck-core/src/main

This prints the revision number, file name, number of changes and change set commit comment.

176 quickcheck-core/src/main/java/net/java/quickcheck/generator/support/AbstractTreeGenerator.java 27 added tree generator
177 quickcheck-core/src/main/java/net/java/quickcheck/generator/support/CloningGenerator.java 28 MutationGenerator, CloningMutationGenerator and CloningGenerator added
139 quickcheck-core/src/main/java/net/java/quickcheck/generator/support/DefaultFrequencyGenerator.java 36 fixed generic var arg array problems
49 quickcheck-core/src/main/java/net/java/quickcheck/characteristic/Classification.java 41 removed getClassification method from Property interface
moved Classification into quickcheck.property package

Conclusion

Developing a Mercurial extension is relatively easy given Python, the good Mercurial documentation, the good readability of the code and integration test tool. If you’re using Mercurial you should give Mercurial extension development a try. I’ve only recently read into Python again so this is the Python beginner’s version of a Mercurial extension. Help to improve the implementation is always appreciated.

Learning Python and seeing how things are implemented there is fun. Looking at the PEPs and the associated process, they feel much more accessible and open than JSRs. The PEPs are also a track record of the advances the language makes and problems it tries to solve one after the other. There’s stuff in Python that you’ll probably never see in Java like the generator expressions. Everyone who had to replace an internal loop with an iterator will understand that this is not a toy. The language features seem to sum up quite nicely and result in a productive environment. As always, some things are unfamiliar or missing but there’s no perfect platform.

Find cruft in your source code repository

2011-05-30T09:35:00.002+02:00

Micheal Feathers wrote in his blog post “The Carrying-Cost of Code: Taking Lean Seriously”
that is necessary to remove old code from your product to be able to add new features. His argument is that you get a better understanding of your production code this way. Rewriting your code constantly leads to more readable and compact code.

"There are many places in the industry where existing mountains of code are a drag on progress.
[..]
Younger organizations without as much software infrastructure often have a competitive advantage provided they can ramp up to a base feature set quickly and provide value that more encumbered software-based companies can't. It's a scenario that plays out over and over again, but people don't really talk about it.
[..]
I'd like to have code base where every line of code written disappears exactly three months after it is written.
[..]
I have the suspicion that a company could actually do better over the long term doing that, and the reason is because the costs of carrying code are real, but no one accounts for them."

His reasoning goes so far as to ask product owners to remove features that are not needed. Software size seems to increase strictly monotonic. This makes maintenance harder and more costly. I’m not sure if you have to follow the advice strictly too improve your situation. Before you start arguing with your boss about removing features, it is a good idea to look for low-hanging fruit first: the oldest lines.

Metric

The heuristic comes from the observation that a) software has bugs and that b) if the software is actually used bugs will be found and fixed. Fixing the bugs leads to new code as does changes in coding style, new APIs etc. Old unchanged code is either bug-free, feature-complete and state-off-the-art or something nobody cares about. I’d say the metric is not too bad to find some victims. (A metric like this one should be a tool to find problems not an absolute measurement. Metrics should not be taken too seriously and nobody should be tempted to cheat.)

To put the idea into practice I’ve hacked some scripts to find suspects in a subversion repository. The scripts are:

find the oldest lines in your repository
find biggest change sets in your repository considering the oldest lines
find files that are changed the most by a change set considering the n oldest lines

Too get some data I’ll use the legacy subversion repository of the Quickcheck project. Quickcheck moved to Mercurial some time ago. It’s a test to see if something significant can be found with this metric.

The readme contains instructions how you can run the scripts with your subversion repository. The script are based on a local repository mirror to speed up analysis. The analysis can be execute on any subtree of the repository.

Oldest lines

Finding the oldest lines is quite simple first get all file names with svn list and then use svn blame to get the date for every line. These output is sorted by the revision (descending) of each line.

The output of the oldest_lines.sh is unfiltered. To extract useful information it has to be filtered. The filter.sh does this for Java source code: removing empty lines, single closing braces, package declaration, imports and comments.

These are the last lines of the filtered output for Quickcheck:

$ ./filter.sh | tail -5

6     blob79    public int compare(Pair<Object, Double> o1, Pair<Object, Double> o2) { File: characteristic/Classification.java Line: 162
6     blob79    next.add(gen.next()); File: generator/support/TupleGenerator.java Line: 34
6     blob79    ArrayList<Pair<Object, Double>> toSort) { File: characteristic/Classification.java Line: 150
6     blob79    @SuppressWarnings("unchecked") File: generator/CombinedGenerators.java Line: 126
6     blob79     Object[] next = generator.next(); File: generator/CombinedGenerators.java Line: 128

A potential victim here is the Classification class. It’s rudiment from the original Quickcheck implementation but never was used heavily. It’s a nice idea to do statistical testing but Classification could be removed from Quickcheck without loosing a significant feature.

Biggest change sets

The second script top_change_sets.sh finds the biggest change sets considering only the n oldest lines. This results in an interesting output for the code base (oldest 1500 lines, top 5 change sets):

$ ./top_change_sets.sh 1500 5

r182 | blob79 | 20071219 19:15:24 +0100 (Wed, 19 Dec 2007) | 1 line
basic failed test instances serialization feature implementation
136 changes

r270 | blob79 | 20090603 18:52:52 +0200 (Wed, 03 Jun 2009) | 1 line
added pojo (a.k.a object) generator for interfaces
104 changes

r6 | blob79 | 20070707 07:29:14 +0200 (Sat, 07 Jul 2007) | 1 line
initial check in
68 changes

r204 | blob79 | 20080323 19:29:28 +0100 (Sun, 23 Mar 2008) | 1 line
added svn keyword id
52 changes

r198 | blob79 | 20080323 18:29:26 +0100 (Sun, 23 Mar 2008) | 3 lines
fixed logging for serializing and deserializing runner
mandate a user set characteristic name (for serialization of test values)
added system property for number of runs
48 changes

Revision 182,198 were commits related to the obscure test data serialization and deserialization scheme. Something I’ve already removed in the latest release. The two changes resulted in 184 lines still present in the current source.

The revision 270 is not less obscure. It’s a declarative POJO object generator. The revision is so high in the list because it forced a lot of changes. This is not a good sign: obscure feature and lots of changes. That’s something worth to investigate.

Revision 6 is the initial check in. So this should be okay.

The last open issue revision 204 is the attack of the code formatters. They should be used with prudence as long as the down-stream tools can’t handle the changes properly. (Source control system should understand the AST of the source language.)

File changes

Now we can take a look at the files with most changes from a single revision. If you execute the query top_changes_in_file.sh (500 oldest lines, top 5) for the Quickcheck source code you’ll see:

$ ./top_changes_in_file.sh 500 5
r182 | blob79 | 2007-12-19 19:15:24 +0100 (Wed, 19 Dec 2007) | 1 line
basic failed test instances serialization feature implementation
34 changes | file: RunnerImpl.java

r180 | blob79 | 2007-12-07 18:59:59 +0100 (Fri, 07 Dec 2007) | 3 lines
MutationGenerator, CloningMutationGenerator and CloningGenerator added
26 changes | file: generator/support/CloningGenerator.java

r182 | blob79 | 2007-12-19 19:15:24 +0100 (Wed, 19 Dec 2007) | 1 line
basic failed test instances serialization feature implementation
24 changes | file: SerializingRunnerDecorator.java

r6 | blob79 | 2007-07-07 07:29:14 +0200 (Sat, 07 Jul 2007) | 1 line
initial check in
22 changes | file: characteristic/Classification.java

r179 | blob79 | 2007-10-13 09:14:30 +0200 (Sat, 13 Oct 2007) | 1 line
added tree generator
22 changes | file: generator/support/AbstractTreeGenerator.java

Besides the usual suspects serialization support and the Classification class two new suspects emerge: mutation generator and a tree generator. In favor of the tree generator and mutation generator implementation, they might be useful but aren’t widely used so this something worth to look at.

Conclusion

The metrics found multiple source files that are worth investigating. One feature that is already removed (serialization support), one likely victim (classification) and multiple places that are worth checking (mutation generator, tree generator, declarative POJO generator). The metrics seems to find unloved children in the code that are good candidates for removal or implementation improvements.

I always like to remove code. Fewer lines of code means fewer spots where problems may emerge. Nobody can argue that if you can remove unused code that it’s better to keep the useless code - even if it’s tested and production-quality. That’s something like a reverse YAGNI. If you really care the code will never disappear. You can find it in your source code management system. You should be okay with that fact that the old code will lose it’s relevance due to changes to the production system implementation. It can be a inspiration how it could be done if the world hadn’t changed. The burden of these changes are also the reason why it’s better to remove the code in the first place. Dragging it with you without any gain is plain waste.

Quickcheck 0.6 Release

2011-05-09T18:55:00.001+02:00

The version 0.6 of Quickcheck is ready to use. The main features are support for deterministic execution of generators and improvements on generators. The JUnit runner support was removed in this release.

You can read an in detail description of deterministic execution in this blog post.

The 0.6 release adds the following generators:
- map generator maps(Generator<K> keys, Generator<V> values)
- subset generator sets(Set<T>)
- submap generator maps(Map<K, V>)
- unique generator using a Comparator<T> to decide if two values are considered equivalent: uniqueValues(Generator<T>, Comparator<? super T>)
- excluding generator based on a collection of input and a collection of excluded values: excludeValues(Collection<T> values, Collection<T> excluded
- content generator type parameter are now co-variant for lists, iterators, sets and object arrays to allow creation of super type container generators (like Generator<List<Object>> = lists(integers()))
- PrimitiveGenerator added generators:
- generator for java.lang.Object instances

The dropped Junit Runner support means that the @ForAll annotation is no longer supported. Until lambda expression are supported in Java (hopefully) the Iterable adapter is a good workaround that allows to execute tests without too much boiler-plate. If you need all features the inner class will work just fine. Inner classes will become much better with the SAM-type conversion in Java 8 which is part of the language changes in Project Lambda.

The general development direction and the main theme I’ve been working on besides the release was the support for generator expressions. This is a good way to implement tests for equals methods where a equals method should return false when one of the significant attributes of an object is not equal. You have to write a lot of boiler-plate to test a simple statement like: “This is not equal if one of the attributes is not equal” now. With generator expression this should become much easier.

It’s quite tricky to create a nice API for expressions. One cul-de-sac was to implement it as a builder with a fluent interface. The method chaining is not an adequate. The chaining forces you to linerialize the definition of the expression. This does not fit well into the world of generators where delegation and nesting are natural concepts. I burned some time before I got that the underlying problem cannot be fixed with a clever API. I hope the current approach terminates and the expression support is something you can work with in the 0.7 release.

Using deterministic generators with Quickcheck

2011-03-22T20:48:00.001+01:00

The 0.6 release of Quickcheck supports deterministic generators. The goal is to be able to make the generation of values reproducible. This is useful when you are working with a bug from your favourite continuous integration server or when you would like to run a piece of code in the debugger repeatedly with the same results.

A non-goal of the support is to remove the random nature Quickcheck. Values are still random to allow a good coverage but reproducibility is supported when needed. This way you have the best of both worlds.

Quickcheck uses internally the linear congruential random number generator (RNG) implemented in the Java’s Random class. The interesting property of the RNG in the context of reproducible values is stated in the javadoc.

If two instances of Random are created with the same seed, and the same sequence of method calls is made for each, they will generate and return identical sequences of numbers.

You can configure the seed used by Quickcheck with the RandomConfiguration class. It’s important to set the seed for every individual test method otherwise a RNG’s return values are dependent on execution order of the test methods. If you run different tests, add a new tests or execute the tests in a different order other values will be generated.

The seed is generated randomly for the normal execution. This is the result of the RandomConfiguration.initSeed method call. This way Quickcheck still produces random values. Use the setSeed method to set the seed for a test method.

Instead of using the RandomConfiguration directly you should use the SeedInfo JUnit method rule that will run with every test method. Additionally, it adds the seed information, that is needed to reproduce the problem, into the AssertionError thrown.

The SeedInfo can be used like every other JUnit method rule. It’s added as an member of the test class. The example generates values in a way that the assertion always fails.

@Rule public SeedInfo seed = new SeedInfo();

@Test public void run(){
  Generator<Integer> unique = uniqueValues(integers());
  assertEquals(unique.next(), unique.next());
}

An example error message is:

java.lang.AssertionError: expected:<243172514> but was:<-917691317> (Seed was 3084746326687106280L.)

You can also use the SeedInfo instance to set the seed for a test method to reproduce the problem from the AssertError.

Rule public SeedInfo seed = new SeedInfo();

@Test public void restore(){
  seed.restore(3084746326687106280L);
  Generator<Integer> unique = uniqueValues(integers());
  assertEquals(unique.next(), unique.next());
}

Instead of setting the seed for individual tests you can also set the initial seed once for the random generator used by the JVM. If you run the test example from above (without the SeedInfo method rule member) and the configuration -Dnet.java.quickcheck.seed=42:

@Test public void run(){
   Generator<Integer> unique = uniqueValues(integers());
   assertEquals(unique.next(), unique.next());
}

You should get the result:

java.lang.AssertionError: expected:<977378563> but was:<786938819>

The configuration of seed values replaces the serialization and deserialization support of earlier Quickcheck versions. Setting the seed is a much simpler way to reproduce values over multiple JVM executions.

Revert - Sometimes going back is the way forward

2011-03-21T18:38:00.000+01:00

Revert is the reverse gear of your version control software. It removes all local changes and brings the local workspace back to clean state of a committed revision. It is an important tool in the revision control software tool box. Once in a while there is no way forward so you have to go backward to make progress.

This may sound unintuitive. We are trying to make a change not reverse it. Why should the tool that destroys all this hard work be the best option in some circumstances? Firstly, you do not lose everything. Even if you revert everything you gain some knowledge. At least that this exact way does not work. This is a good data point. Secondly and more obviously, revert let’s you start with a fresh state. More often than not we are able to reach a working state again. Removing everything is the fastest way to get to there.

I see mainly two scenarios for the revert command: planned mode and the accidental mode.

Planned mode revert

Starting with a working state of your committed source code you can do some exploratory work. Find out what you where looking for and revert.

Now you can start the work in a informed way from a working state. The artifacts of the exploration are removed. After reverting you do know that the state you are starting from works. To verify that an workspace state works you do need tools to catch the problems: a decent test coverage and other quality assurance measures.

A corollary is that because you are planning to revert anyway you can change your workspace in every way you need for the exploration.

Accidental mode revert

The first scenario was a bit too idyllic: you started your work with an exploratory mind set, found the precious information and clean up after yourself. Everything is planned, clean and controlled. This scenario is valid. You can do the exploratory work voluntarily. More often it is the case that you have dug yourself in. You need to find a way out.

Is this a hole or the basement?

The first issue is to know when your in a hole and there is little chance to get out.

Say you commit roughly every hour. Now you did not commit for four hours. Your change set becomes bigger and bigger. You see no way to get your tests running again. Different tests are broken after multiple trials to fix everything. Your in a hole.

You made a change and it resulted in absolutely unexpected problems. Your tests are broken. You do not know why. There are red lights all over the place. Your in a hole.

You made small, controlled, incremental changes for some time without committing. You did not bother to commit because everything was so simple. Now the changes become bigger you would like to commit but you can’t because you can’t bring the whole system to run again. You are in a hole.

The commonality of the three examples is that your not in control of the process. The world you created determines your next steps. This happens to everyone. It’s normal. It happens all the time. Otherwise our work would be predictable day in and out - how boring. (I would go so fare as to say that in other circumstances it’s a good sign that you can follow the inherent conclusions of your system. This way it’s productive that you are determined by the conclusions of your system because it is consistent.)

If there is such a thing as experience in hole digging it’s to see the problem coming and to stop early. If it happened often enough to you, you should know the signs. You’ll know that knee deep holes are deep enough to stop and that it’s not necessary to disappear completely.

Ways out

Now after you found out that have a problem all energy should be put in it. Don’t try to be too smart. Solve this one problem. You have two options to get out of the hole: fixing the current state or revert.

Fixing the current state can work. You find enough information to fix the problem. You’ll lose some time but nothing of your work. Once the current state works it’s good a idea to commit now. This creates a save point. If there are more problems lurking down the road you can always come back to this state. The problem is that you might not find the problem. Finding a way out now is hard. Your change set adds to the complexity of the underlying problem. Your changes obfuscate the problem and make it harder to analyze. Everything you do will increase the change set complexity further.

When fixing the current state is too hard, you have to revert your work to keep up the pace. Now you have the problem that you have already sunk so much time and the next step is to roll everything back to the state you started from. This does not feel pleasant. The upside is that even though you reverted the code not everything is lost. You still have more knowledge about the problem. This knowledge can be used on the second and hopefully last attack. Make notes if you need them to remember the information you gathered.

The first attempt was in the wrong direction and/or too big. It is a good idea to make smaller steps with interim commits to create save points you can revert to. This creates a safety net if you bump into the problems again. You can revert repeatedly to chop smaller portions of the problem until it is solved. You decrease the size of the changes until you can understand a problem. Once in a while strange things happen and a single line change has crazy effects. After removing such road blocks you can make bigger steps again.

There is of course a middle way: trying to revert only partially. Without creating and applying patches you have only one direction to go (revert) and you'll swiftly have to revert everything (because your change history is lost). I’ll come back to an approach to use diff and patch to do partial reverts in a controlled way later.

Bringing the costs of reverts down

The problem with reverts is that they are expensive. Work you've already done is removed from the source tree. Not something we are especially proud of.

The problem is only as big as the change set that is flushed down the toilet. You should commit as often as your infrastructure allows: the execution time of tests and the integration costs are the main factor here. (You can move some of the cost into the continuous integration environment you're using.) As always this is a trade-off between the work lost and the overhead created by frequent commits. Committing every hour is probably a good idea. Just do whatever fit’s your needs best.

The other factor is the right attitude to the revert operation. If you have already spent a lot of time on a problem and could not find a fix, it’s likely you won’t find it in this direction and a fresh approach is needed. You can actually save a lot of effort by aborting this failed attempt. This will also bring the total costs of a inevitable later revert down.

Conclusion

Failed attempts are not the problem. We have to learn from our failures. They are just too often and valuable to loose. Making failures is okay. Samuel Beckett put it nicely:

Ever tried. Ever failed. No matter. Try again. Fail again. Fail better.

Thomas Andreas Jung's Blog

Rust - parallel topfew on ARM

Leaving Flickr

You can do basic math in Python, right?

How big is the problem space?

How bad is that?

Domain Knowledge

This should be true, right?

The Quiz

Not a number is fun

Not a number is even funnier

Inifinity is also fun

Losing it

Big and small

Small and big

Precisely wrong

Bonus round of weirdness

Wrap up

Thinking about edge cases

Career advice

How to untangle a commit

﻿Revert 2016 edition

Speed-up your project build without the tmpfs hassle

How to use the Mercurial Rebase extension to collapse and move change sets

Mini-Quickcheck for Python

About

Analogous function testing

﻿Testing: Start with the result

My little stupid Raspberry Pi project: 1 LED and 1 Switch

What I learnt so far

How to restore the subversion history of a file

My best Python HTTP test server so far

Convert videos to wav and mp3 audio files

Archiving twitter messages without shortened URLs with Python Twitter Tools

Using Bash Shell Parameter Expansions

Examples

Why I Created Daily Feed Recycler

Find cruft with a new Mercurial extension

Testing Mercurial extensions

Installing the extension

Using the extension

Conclusion

Find cruft in your source code repository

Metric

Oldest lines

Biggest change sets

File changes

Conclusion

Quickcheck 0.6 Release

Using deterministic generators with Quickcheck

Revert - Sometimes going back is the way forward

Planned mode revert

Accidental mode revert

Is this a hole or the basement?

Ways out

Bringing the costs of reverts down

Conclusion

Revert 2016 edition

Testing: Start with the result