Code crumbs, by Clément Pit-Claudel

The cereal box problem: How many boxes does it take to find all prizes?

Clément — Fri, 20 Dec 2013 17:48:29 +0000

Children’s cereal manufacturers often attract the attention of young clients by including small prizes and toys in every box; sometimes all prizes are identical, but most often individual prizes are part of a collection, and kids are encouraged to collect them and try to complete a full collection. How long does it take ?

Simple probabilistic modeling shows that on average $n (1 + 1/2 + \ldots + 1/n)$ boxes are required to complete a full set of $n$ prizes: for example, it takes on average $14.7$ boxes to complete a full set of six prizes.

Numerical and experimental simulation

The expected number of boxes to buy before finding all $n$ prices is shown below:

This histogram shows the calculated average number of purchases required to accumulate at least one copy of each prize, for various sizes $n$ of prizes sets: for example, finding all prizes in a series of twelve requires on average 37.24 purchases. Note that the curve is somewhat steeper than linear, due to a multiplicative logarithmic factor $\left(1 + \frac12 + \frac13 + \ldots + \frac1{n}\right) \sim \log(n)$

And here is an experimental simulation for $n = 6$: the graph below shows how many purchases were required to obtain six prizes out of six, over 1,000,000 simulations.

This histogram shows the experimental probability distribution of $T_6$; that is, of the number of cereal boxes that one needs to purchase before finding all prizes in a series of six. Interestingly, despite the peak at $t = 11$, the distribution’s long tail offsets the average to $14.65$ in this experiment — close to the theoretical prediction of $14.7$.

Interested in how these plots were obtained? Read on to find out, or skip directly to the comments to share your thoughts!

Calculations and simulation code

Theoretical average

Assume there are $n$ different prizes, all of which are equally distributed in cereal boxes. Model buyer behavior as a sequence of purchases, and call $T_k$ the time at which they find the $k$^th prize : clearly $T_0 = 0$ (no purchases are needed to find 0 prizes) and $T_1 = 1$ (one finds their first prize immediately after buying their first cereal box). Naturally, $T_n$ denotes the time required to find all prizes.

To calculate $\newcommand{\E}[1]{\mathrm{\mathbf{E}}\left[#1\right]}\E{T_n}$, call $\Delta_k$ the elapsed time between the $k$^th and $k+1$^th prizes, i.e. $\Delta_k = T_{k+1} – T_k$. Summing over $k$ yields $\sum_{k=0}^{n-1} \Delta_k = T_n – T_0$ which, given the linearity of the expectation operator, yields $$
\sum_{k=0}^{n-1} \E{\Delta_k} = \E{T_n}
$$

All that remains to be done now is to calculate $\E{\Delta_k}$ for all $k \in \{ 0, \ldots, n-1 \}$. By definition $$
\newcommand{\P}[1]{\mathrm{\mathbf{P}}\left(#1\right)}\E{\Delta_k} = \sum_{\delta = 1}^\infty \P{T_{k+1} = T_{k} + \delta} · \delta
$$

$\P{T_{k+1} = T_{k} + \delta}$ is the probability of finding the $k+1$^th exactly $\delta$ purchases after obtaining the $k$^th; that is, the probability of completing $\delta – 1$ unproductive purchases, then a successful one. Now the probability of hitting a sequence of $\delta – 1$ unproductive purchases is exactly $(k/n)^{\delta-1}$; hence $$
\P{T_{k+1} = T_{k} + \delta} = \left(\frac{k}{n}\right)^{\delta-1}\left(1 – \frac{k}{n}\right)
$$

Summing over positive values of $\delta$ yields $$
\begin{align*}
\E{\Delta_k} &= \sum_{\delta = 1}^\infty \P{T_k = T_{k-1} + \delta} · \delta = \sum_{\delta = 1}^\infty (k/n)^{\delta-1} · (1 – k/n) · \delta\\
&= (1 – k/n) \sum_{\delta = 1}^\infty \delta (k/n)^{\delta-1} = (1 – k/n) \frac{1}{(1 – k/n)^2}\\
&= \frac{1}{1 – k/n} = \frac{n}{n-k}
\end{align*}
$$

And summing over $k$ yields $$
\begin{align*}
\E{T_n} &= \sum_{k=0}^{n-1} \E{\Delta_k} = \sum_{k=0}^{n-1} \frac{n}{n-k} = n \sum_{k=1}^{n} \frac{1}{k}\\
&= n \left(1 + \frac12 + \frac13 + \ldots + \frac1{n}\right)
\end{align*}
$$

Which, for large $n$, is $n \left(\log(n) + \gamma + \frac1{2n} + o\left(\frac1n\right)\right)$ (where $\gamma \approx 0.57722$ denotes the Euler–Mascheroni constant).

Note that, as expected, $\E{\Delta_k}$ quickly increases as $k$ grows larger: the more prizes have already been found, the harder it is to get a new one, to the point that finding the last prize requires $n$ purchases on average.

Numerical simulations

The experimental data above was simulated using the following code:

	repeat = 1000*1000
	time_to = defaultdict(Counter)
	
	for _ in range(repeat):
		purchased = 0
		collected_prizes = set()
		
		while len(collected_prizes) < n: # Until all prizes have been found
			purchased += 1 # Purchase one more box
			prize = random.randint(1, n) # Open it
			if prize not in collected_prizes: # Is it a new prize ?
				time_to[len(collected_prizes) + 1][purchased]  += 1 # Yes, count it !
				collected_prizes.add(prize)	# And add it to list of already found prizes

Conclusion

One might want to find some intuitive way to comprehend the formula derived above, $\E{T_n} = n\left(1 + 1/2 + \ldots + 1/n\right)$. A relatively easy way to get such an intuition is to summarize the calculation above, realizing that each new packet bought after finding the $k$^th prize brings an extra prize with probability $(n-k)/n$ ; since each purchase is independent, it takes on average $n / (n-k)$ purchases to find a new prize; then summing over $k$ yields the final formula.

Have you ever completed a full set of cereal box prizes? How long did it take?

An experimental estimation of the entropy of English, in 50 lines of Python code

Clément — Sat, 07 Dec 2013 23:49:04 +0000

“Th_ onl_ wa_ to ge_ ri_ of a tempta____ is to yie__ to it. Resi__ it, an_ you_ soul gro__ sic_ wi__ longi__ fo_ th_ thin__ it ha_ forbi____ to itse__.”

(Osc__ Wil__, The Picture __ ______ ____)

Thanks to the verbosity of the English language, proficient English speakers generally find it relatively easy to decipher the above passage despite the numerous omissions.

How does one quantify this redundancy? This article introduces the notions of Shannon entropy and information rate, and experimentally estimates the information rate of written English by training a Markov model on a large corpus of English texts. This model is finally used to generate gibberish that presents all the statistical properties of written English. Best of all, the entire source code fits in 50 lines of elegant Python code.

Introduction and theoretical background

Entropy, a notion formalized in the context of information theory by Claude Shannon in his 1948 article “A Mathematical Theory of Communication“, is a measure of how much information an observer can gain by learning the result of a certain random process.

Similarly, the entropy of an information source, also known as its information rate, or its source entropy, is defined as the limit of the conditional entropy of each consecutive letter given the previous ones; in other words, the entropy of an information source quantifies how much extra information is gained by reading one extra letter of a text.

As the example above demonstrates, not every letter of a text is needed to decipher it properly; indeed, humans rely on a huge amount of statistical knowledge to speed up their reading: for example, they know that ‘q’ is almost always followed by ‘u’, and that ‘wh’ is most often followed by ‘o’ (‘who’), ‘y’ (‘why’), ‘at’ (‘what’), ‘en’ (‘when’), ‘ere’ (‘where’), or ‘ich’ (‘which’). ((This statistical knowledge extends far beyond character sequences: indeed, anyone can easily fill in the blanks in the following quote: “To be or ___ __ __, ____ __ ___ ________”.)) Source entropy yields a simple, practical measurement of this intrinsic predictability, and of the amount of information carried by each character in a language.

Definitions

Given a random variable $X$, Shannon define its entropy as
$$ H(X) = – \sum_{x} \P{X=x} \lg(\P{X=x}) $$

Generalizing, the entropy of $X$ conditioned on $Y$ is defined as the expected entropy of $X$, conditioned on the values of $Y$:
$$
\begin{align*}
H(X|Y) &= \sum_{y} \P{Y = y} H(X|Y=y)\\
&= – \sum_{y, x} \P{Y = y} \P{X=x|Y=y} \lg(\P{X=x|Y=y})\\
&= – \sum_{y, x} \P{X = x, Y = y} \lg(\P{X=x|Y=y})
\end{align*}
$$

Finally, the entropy rate of an information source is the limit of the entropy of the $n+1$^th symbol conditioned on the $n$ previous ones (if that sequence converges):
$$ h(X) = \lim_{n \to \infty} H(X_{n+1} | X_1, \ldots, X_n) $$

Experimental estimation of the entropy rate of English

This section presents an algorithm to estimate the entropy of English text, using a large corpus of texts. As entropy fundamentally deals with streams, the code presented below handles textual samples as streams of tokens (either characters or words, depending on whether we’re estimating the entropy per character of the entropy per word).

Note: this algorithm assumes that the distribution of the Markov process being modelled is stationary; that is, that any infinite stream of English text satisfies the property that
$$ \forall k, \forall n, \P{X_{k+n+1}|X_{k+1}, \ldots, X_{k+n}} = \P{X_{n+1}|X_{1}, \ldots, X_{n}} $$

First, define a utility function that reads the content a file and turns them into a continuous stream of tokens (characters, words, etc.). tokenize takes a path file_path, and a function tokenizer that splits a line of text into a list of tokens.

	def tokenize(file_path, tokenizer):
		with open(file_path, mode="r", encoding="utf-8") as file:
			for line in file:
				for token in tokenizer(line.lower().strip()):
					yield token

Two typical tokenizers are presented below (the words function will be useful in the next section, when we start generating random English text):

	def chars(file_path):
		return tokenize(file_path, lambda s: s + " ")
		
	import re
	def words(file_path):
		return tokenize(file_path, lambda s: re.findall(r"[a-zA-Z']+", s))

Then, define a function to build a statistical model of a stream. Since computing the limit $h(X)$ presented above is not possible in practice, we approximate it by computing the sum for large enough values of $n$. The model_order parameter below denotes $n-1$. This function, called markov_model, returns two objects: stats and model

stats is a Counter that matches each key in model to its total number of occurrences. For example:
```
	stats = {	
		('w', 'h'): 53, 
		('t', 'h'): 67,
		...
	}
```
model is a dictionary mapping $(n-1)$-character prefixes to a Counter; that Counter maps each possible $n$^th character to the number of times this character followed the $(n-1)$-character prefix. For example, model could be
```
	model = {
		('w', 'h'): {'y':25, 'o':12, 'a':16, ...}, 
		('t', 'h'): {'i':15, 'a':18, 'e':34, ...},
		...
	}
```
indicating that ‘why’ appeared 25 times, ‘who’ 12 times, ‘wha’ 16 times, ‘thi’ 15 times, ‘tha’ 18 times, ‘the’ 34 times, and so on.

To construct these objects, iterate over the stream of tokens, keeping the last $n-1$ tokens in a buffer and progressively building stats and model:

	from collections import defaultdict, deque, Counter
	def markov_model(stream, model_order):
		model, stats = defaultdict(Counter), Counter()
		circular_buffer = deque(maxlen = model_order)
		
		for token in stream:
			prefix = tuple(circular_buffer)
			circular_buffer.append(token)
			if len(prefix) == model_order:
				stats[prefix] += 1
				model[prefix][token] += 1
			
		return model, stats

Finally, implement the calculation of the entropy:

	import math
	def entropy(stats, normalization_factor):
		return -sum(proba / normalization_factor * math.log2(proba / normalization_factor) for proba in stats.values())

	def entropy_rate(model, stats):
		return sum(stats[prefix] * entropy(model[prefix], stats[prefix]) for prefix in stats) / sum(stats.values())

And pack everything together using (for a model of order 3)

	model, stats = markov_model(chars("sample.txt"), 3)
	print("Entropy rate:", entropy_rate(model, stats))

…all in less than 40 lines of neat python code!

Experimental results

Running this algorithm on the entire Open American National Corpus (about 95 million characters) yields the following results:

Estimated entropy rates using increasingly complex Markov models. As expected, the estimated entropy rate decreases as the order of the model (i.e. the number of characters being taken into account) increases.

Note that, because the Markov model being constructed is based on a necessarily finite corpus of textual samples, estimating the entropy rate for large values of $n$ will yield artificially low information rates. Indeed, increasing the order of the model will decrease the number of samples per $n$-letters prefix, to the point that for very large values of $n$, knowledge the first $n-1$ letters of a passage uniquely identifies the passage in question, deterministically yielding the $n$^th letter.

In the graph presented above, it is therefore likely that the estimates for $n$ over 5 or 6 are already plagued by a significant amount of sampling error; indeed, for any given $n$ there are in total $27^{n}$ possible $n$-grams consisting only of the 26 alphabetic letters and a space, which exceeds the size of the corpus used in this experiment as soon as $n > 5$.

Extrapolating these results to large values of $n$ is particularly tricky, as in the general case nothing is known about the shape of this sequence of values except for the fact that it’s positive decreasing: deriving the limit entropy rate from this set of measurements requires building a model of the consecutive estimates, and fitting this model to the measured estimates.

As a rough example, call this sequence of values $F_k$ and assume that it verifies the recurrence equation $F_{k+1} – F_k = \alpha (F_{n} – F_{n-1})$. Then the $\alpha$ that yields the best approximation (taking the two initial values for granted since they are less likely to suffer from sampling errors) is $\alpha \approx 0.68$ ($\mathcal{L}^2$ error: $6.7 · 10^{-3}$), and the corresponding entropy rate is $h \approx 1.14$ bits/character.

Extrapolated entropy rate values for $\alpha \approx 0.68$. In this heuristic model, the limit entropy rate is $h \approx 1.14$ bits/character.

Generating fake English text

The statistical model of English built in the previous section can be used to generate fake English text : to this end, we introduce a pick function that, given a Counter (that is, a mapping from elements to numbers of occurrences), picks an element at random, respecting the distribution described by that Counter object.

	import random
	def pick(counter):
		sample, accumulator = None, 0
		for key, count in counter.items():
			accumulator += count
			if random.randint(0, accumulator - 1) < count:
				sample = key
		return sample

A detailed explanation of this function will be the subject of a later post: efficiently sampling from an unordered, non-uniformly distributed sequence without pre-processing it is a non-trivial problem.

Given a length-$n$ prefix, the generation algorithm obtains a new token by sampling from the distribution of tokens directly following this prefix. The token thus obtained is then appended to the prefix and the first token of the prefix is dropped, thus yielding a new prefix. The procedure is then repeated as many times as the number of characters that we want to obtain.

The following snippet implements this logic: model is a model obtained using the previously listed functions, state is the initial prefix to use, and length is the number of tokens that we want to generate.

	def generate(model, state, length):
		for token_id in range(0, length):
			yield state[0]
			state = state[1:] + (pick(model[state]), )

All that remains to be done is to pick a random initial prefix, using pick once again (the fill function wraps text at 80 columns to make the display prettier):

	prefix = pick(stats)
	gibberish = "".join(generate(model, prefix, 100)
	
	import textwrap 
	print(textwrap.fill(gibberish))

Results: a collection of random Shakespeare samples

Feeding the whole of Shakespeare’s works into the model yields the following samples. Note how increasing the order of the model yields increasingly realistic results.

$n=2$: second-order approximation

me frevall mon; gown good why min to but lone’s and the i wit, gresh, a wo. blace. belf, to thouch theen. anow mass comst shou in moster. she so, hemy gat it, hat our onsch the have puck.

$n=3$: third-order approximation

glouces: let is thingland to come, lord him, for secress windeepskirrah, and if him why does, we think, this bottle be a croyal false we’ll his bein your be of thou in for sleepdamned who cond!

$n=5$: fifth-order approximation

a gentlewomen, to violence, i know no matter extinct. before moment in pauncheon my mouth to put you in this image, my little join’d, as weak them still the gods! she hath been katharine; and sea and from argies, in the sun to be a strawberries ‘god servant here is ceremonious and heard.

$n=8$: eigth-order approximation

gloucester, what he was like thee by my lord, we were cast away, away! mark that seeth not a jar o’ the tiger; take it on him as i have, and without control the whole inherit of this.

Words as tokens

The code above defines tokens as characters, and builds a model of sequences of $n$ consecutive characters. Yet, to generate more realistic-looking text, we can also split the training corpus by words, using

	model, stats = markov_model(words('sample.txt', encoding='utf-8'), 3)
	print(textwrap.fill(" ".join(generate(model, pick(stats), 100))))

Here’s an example of a statistically sound sequence of English words, generated using a Markov model of order 3 of Shakespeare’s writings:

Third-order word approximation

he rouseth up himself and makes a pause while she the picture of an angry chafing boar under whose sharp fangs on his back doth lie an image like thyself all stain’d with gore whose blood upon the fresh flowers being shed doth make them droop with grief and extreme age.

Conclusion

Shannon’s 1948 article laid the foundations of a whole new science, information theory. The original paper is an easy yet extremely interesting read; it extends the notions presented here to derive theorems that link source entropy to the maximal possible compression of a source, and goes on to describe a mathematical framework for error correction.

Shannon himself also worked on the estimation of the information rate of English. However, since large-scale statistical experiments were impractical then, Shannon instead constructed an experiment involving humans: subjects were asked to guess the letters of a text one by one, and incorrect guesses per character were recorded to estimate the source entropy of sentences; in this experimental setting Shannon obtained a an estimate of the information rate of English between 0.6 and 1.3 bits per letter, in line with the results presented in this article.

Want to experiment by yourself? Download the full source code (also available for python 2.7)!

Did you find the initial quote hard to decipher? Does that pseudo-Shakespeare text look realistic? Do you know other beautiful applications of information theory? Let us know in the comments!

Linear time probabilistic pattern matching and the Rabin-Karp algorithm

Clément — Sat, 30 Nov 2013 21:41:14 +0000

Most linear-time string searching algorithms are tricky to implement, and require heavy preprocessing of the pattern before running the search. This article presents the Rabin-Karp algorithm, a simple probabilistic string searching algorithm based on hashing and polynomial equality testing, along with a Python implementation. A streaming variant of the algorithm and a generalization to searching for multiple patterns in one pass over the input are also described, and performance aspects are discussed.

The algorithm is probabilistic in that it doesn’t always return correct results; more precisely, it returns all valid matches and (with reasonably small probability) a few incorrect matches (algorithms such as this one that tend to be over-optimistic in reporting their results are usually said to be true-biased).

Algorithm

Given a string s of length $n$, and a pattern of length $k$, the algorithm computes rolling checksums for both the pattern (“needle”) and consecutive substrings of length $k$ of the searched string (“haystack”). Every time a substring’s checksum equals that of the pattern, a match is reported.

matches = {}

pattern_checksum = checksum(pattern)
substring_checksum = checksum(s[0 : k[)

for position from 0 to n - k - 1 do
	update substring_checksum to checksum(s[position : position + k[)
	if substring_checksum = pattern_checksum then
		add position to matches

return matches

Checksum functions

Formally, checksum functions map fixed-length strings to numbers, with good checksum functions satisfying two properties: first, collisions (two different strings being mapped to the same checksum) must be rare — indeed, the higher the collision rate, the more likely incorrect matches are to turn up. Second, checksums must be computed in a way that allows for efficient updating: that is, the value of the checksum of s[i+1:i+k+1[ should be quickly derivable from that of s[i:i+k[.

Polynomial hash functions

One easily implemented such hash function is based on polynomials. Given a list of $k$ numbers $a_0, \ldots, a_{k-1}$, we define a polynomial ((Using $\sum_{j < k} a_j X^{(k-1)-j}$ instead of $\sum_{j < k} a_j X^j$ yields a cleaner rolling checksum update formula in the next section.))
$$ H(a)(X) = a_0 X^{k-1} + a_1 X^{k-2} + \ldots + a_{k-2} X + a_{k-1} = \sum_{j < k} a_j X^{(k-1)-j} $$
Assuming that each character of a string $s$ can be matched to numbers, we then define the checksum of a substring s[i:i+k[ of length $k$ of $s$ as $H(s[i:i+k[)(X) = \sum_{j < k} s_{i + j} X^{(k-1)-j}$. ((The weakness of this approach is that it may require a normalization pass to successfully process some strings using extended character sets; indeed, some Unicode characters can be represented in multiple ways: the capital A with diaeresis character Ä, for example, can be represented either as U+00C4 (Ä), or U+0041 (A) + U+0308 (¨) yielding Ä. Similarly, Unicode includes support for ligatures — that is one can encode the pair “fi” as either U+0066 U+0069 “fi” or the ligature U+FB01 “ﬁ”. MSDN and Wikipedia have more details on this.))

Polynomial identity testing

As it stands, the mapping from s[i:i+k[ to $H(s_{[i:i+k[})$ is bijective, and the problem of checking whether s[i:i+k[ matches a pattern p can thus be reformulated as checking whether $H(s_{[i:i+k[})(X) = H(p)(X)$. This problem is usually called polynomial identity testing, and can be deterministically solved in time $O(k)$ by comparing polynomials coefficient-wise.

Allowing for infrequent errors, however, takes the complexity of this process down to probabilistic $O(1)$: instead of comparing two polynomials $P$ and $Q$, one can pick a random value $a$ and compare the values of $P(a)$ and $Q(a)$, accepting if and only if $P(a) = Q(a)$ (since values of $P(a)$ and $Q(a)$ may be too large to fit in standard-sized integers, computations are generally made modulo a certain large prime modulus $m$). The reasoning is that if indeed $P=Q$ then the test will always accept $P$ and $Q$ as equal, while if $P \not= Q$ then the probability of hitting a bad witness and incorrectly accepting $P$ and $Q$ as equal is low.

More precisely, if $P$ and $Q$ are polynomials of degree $≤ k$, and $m$ is a prime number larger than the largest coefficient of $P – Q$, then the probability of hitting a root of $P-Q$ is lower than $k/m$ if $P$ and $Q$ differ ((Given a prime number $m$, $\mathbb{Z}/(m\mathbb{Z})$ is a finite field and a $k$-polynomial whose coefficients fall in the range $0 \ldots m-1$ has at most $k$ roots in $\mathbb{Z}/(m\mathbb{Z})$.)).

Picking a large enough random prime modulus $m$ and a random seed $a \in \{ 0 \ldots m – 1 \} $ therefore allows for the safe definition of $$h_i = H(s_{[i:i+k[})(a) \mod m = \sum_{0 ≤ j < k} s_{i + j} a^{(k-1)-j} \mod m$$ as the checksum of s[i:i+k[.

Rolling checksum update

This definition allows for the checksum $h_{i+1}$ of s[i+1:i+k+1[ to be derived from that of s[i:i+k[ ($h_i$) in constant time (provided the value of $a^k$ is pre-computed once and for all before running the algorithm), using the formula $h_{i+1} = a · h_i – s_i a^k + s_{i+k}$. Indeed,
$$
\begin{align}
h_i = H(s_{[i+1:i+k+1[})(a) &= \sum_{0 ≤ j < k} s_{i + 1 + j} a^{(k-1)-j}\\
&= \sum_{1 ≤ j < k + 1} s_{i + j} a^{(k-1)-(j-1)}\\
&= \sum_{0 ≤ j < k} s_{i + j} a^{(k-1)-j+1} – s_i a^k + s_{i+k}\\
&= a · H(s_{[i:i+k[})(a) – s_i a^k + s_{i+k}\\
&= a · h_i – s_i a^k + s_{i+k}
\end{align}
$$

Python implementation

import random
# See the implementation notes below for details. "(→ n)" indicates a footnote. 

def checksum(string, seed, modulus, length):
	"""Computes the checksum of string[0 : length+1]"""
	checksum = 0
	for pos in range(0, length):
		checksum = (checksum * seed + ord(string[pos])) % modulus
	return checksum

def string_search(haystack, needle):
	"""Yields the list of positions p such that 
	   haystack[p : p + len(needle) - 1] == needle"""
	
	LARGEST_CODEPOINT = 0x10FFFF # (→ 1)
	
	# Sanity check
	
	if len(haystack) < len(needle): 
		return
	
	# Initialization
	
	needle_length = len(needle)
	haystack_length = len(haystack)
	modulus = 0xB504F32D # (→ 2)
	
	seed = random.randint(0, modulus - 1)
	seed_k = pow(seed, needle_length, modulus) # Pre-compute a^k
	
	needle_checksum = checksum(needle,   seed, modulus, needle_length)
	substr_checksum = checksum(haystack, seed, modulus, needle_length) # (→ 3)
	
	# String processing
	
	for pos in range(0, haystack_length - needle_length + 1):
		if needle_checksum == substr_checksum: # (→ 4)
			yield pos
			
		if pos < haystack_length - needle_length: # (→ 5)
			substr_checksum = (seed * substr_checksum
			                   - ord(haystack[pos]) * seed_k 
			                   + ord(haystack[pos + needle_length])) % modulus

Implementation notes

LARGEST_CODEPOINT = 0x10FFFF

Characters are mapped to polynomial coefficients using the python ord function, which returns the Unicode code point of any character. The largest such code point is 0x10FFFF.

modulus = 0xB504F32D

Checksum polynomials coefficients are smaller than LARGEST_CODEPOINT, and as the modulus must exceed all checksum polynomial coefficients it must exceed LARGEST_CODEPOINT ((Using a modulus smaller than LARGEST_CODEPOINT+1 would introduce spurious polynomial equalities; for example, $X^3 + 3X^2 + 1 \not= X^3 + 1 \mod 7$, but $X^3 + 3X^2 + 1 = X^3 + 1 \mod 3$.)). Further requiring that the modulus be as large as possible helps reduce the probability of checksum collisions, and hence of spurious matches. Finally, enforcing that it be prime bounds the number of integer roots of degree-k polynomials. 0xB504F32D was chosen as it is the largest value satisfying these requirements whose square fits in a 64-bit signed integer. ((This definition definition could be rewritten as follows, assuming the availability of a the availability of a random_prime(n) function to sample a random integer from the range $n, \ldots, 2n – 1$: modulus = random_prime( max(2**30.5, LARGEST_CODEPOINT + 1) )))

substr_checksum = checksum(haystack, seed, modulus, needle_length)

Before starting the rolling computation, substr_checksum is initialized to the checksum of $s[0:k[$.

if needle_checksum == substr_checksum:

To prevent checksum collisions from yielding incorrect matches, one could explicitly verify matches ((If such a verification is implemented, then this algorithm is exactly the Rabin-Karp string searching algorithm)) by defining

		def verify(haystack, pos, needle):
			return all(haystack[pos+k] == needle[k] for k in range(0, len(needle)))

and replacing the checksum test with

if needle_checksum == substr_checksum and verify(haystack, needle, pos):

This slows down the algorithm a bit, however, taking its worst-case performance to $O(nk)$ ((This worst-case bound is attained when both the haystack and the needle are composed of only one character, e.g. when searching for $aaa$ in $aaa \ldots aaa$)) — in general, however, the performance remains good as long as the pattern is not redundant. ((One could furthermore argue that the cost of this additional post-checking step is not significant, as string searching programs like grep generally display matches anyway, taking linear time in the length of the pattern for each match. Since verifications only occur on matches detected by checksum comparison, all that remains to be checked is that spurious matches are rare enough to not adversely affect execution time.))

if pos < haystack_length – needle_length:

This check prevents an out-of-bounds access after the last comparison.

32- vs 64-bits performance

The sample code above manipulates relatively large numbers (close to $2^{31.5}$), with intermediate calculation results reaching close to $2^{63}$. Such calculations are fast on 64 bit machines, but will trigger promotion to Python’s arbitrary-size integers on 32-bit machines, thus causing the code to perform more poorly. Redefining modulus as 0xB501 would do, at the cost of increasing the rate of erroneous matches.

Incorrect matches: theoretical bounds and experimental results.

As previously noted, the probability of a spurious match turning up at position $i \in \{0, \ldots, n-1\}$ is lower than $k/m$ ($n$ denotes the length of the haystack, $k$ that of the needle, and $m$ is the modulus). The probability of at least one incorrect match turning up at any position in the string, however, is higher — indeed, in the general case, it can only be shown to be lower than $n$ times probability of an error popping up at an arbitrary given position ; that is, $n · k/m$. An ideal implementation would choose $m$ to be as large as $n · k$ times a certain large constant $C$, thus taking the probability of one or more incorrect matches popping up down to $1/C$.

Real-life implementations, however, are constrained by the size of machine integers (avoiding automatic promotion to Python’s arbitrary-sized integers is crucial to ensuring acceptable performance). The implementation above therefore chooses the largest safe modulus value on a 64-bits architecture, 0xB504F32D ((Python standard integers can go up to $2^{63} – 1$ before being promoted to arbitrary-precision integers)). This takes the probability of obtaining one or more incorrect matches down to about $ n · k / 2^{32} $, which is below $0.5\%$ for typical string searching problems ((For example, searching for a 10-character string like “Petersburg” in the plain text version of Dostoyevsky’s Crime and punishment (~ 1.1 million characters long).)).

The culprit here is the checksum update calculation, and in particular the seed * substr_checksum multiplication, which yields values as big as the square of the modulus. Unfortunately there is no way (that I am aware of) to perform the multiplication and integer division that does not resort to large integers if the intermediary multiplication result exceeds the size of machine integers. And it gets even worse on 32-bit machines, where the modulus must be chosen smaller than $2^{15} – 1$.

Low-level optimizations, however, allow for the use of larger moduli:

In C, one bit can be saved by using unsigned integers, pushing the modulus up to 0xFFFFFFFB. In that case, special care must be taken in the checksum update code, so that the subtraction does not cause wrapping. The naïve implementation, shown below, performs incorrectly:
```
 substr_checksum = (seed * substr_checksum - outgoing * seed_k + incoming) % modulus 
```
Specifically, if the result of the seed * substr_checksum - outgoing * seed_k is below 0, then the value is wrapped modulo (1 << 64) - 1 ((Wrapping behaviour is defined in sub-clause 6.2.5 paragraph 9 of the C Standard: “A computation involving unsigned operands can never overflow, because a result that cannot be represented by the resulting unsigned integer type is reduced modulo the number that is one greater than the largest value that can be represented by the resulting type”.)). Instead, the correct calculation is as show below:
```
	uint64_t mul = (seed * substr_checksum) % modulus;
	uint64_t sub = ( (modulus - outgoing) * seed_k ) % modulus;
	substr_checksum = (mul + sub + incoming) % modulus;
```
In 64-bits assembler, using the unsigned multiplication MUL instruction to calculate the product above yields a 128-bits result stored in the EDX:EAX registers, whose remainder in the division by a 64-bits modulus can be calculated using the DIV instruction (MOD on HLA).

More importantly, this $n · k/m$ bound on the error rate is only theoretical, and the effective error rate on real-life texts will often be much lower. Experimental results are hard to obtain given the time it takes to search through large files, but searching 200 times for a 5000-characters random string in Project Gutenberg’s pg2554 (Crime and punishment), didn’t return a single false positive, though the theoretical bound that can be derived from the formula above regarding the probability of getting at least one erroneous match in such an experiment was above one.

In a second experiment, I searched for the string TATAAA (TATA box consensus sequence, length 6) in the genome of a female platypus (length 1 684 663 807). Again, no incorrect matches were found.

Streaming

One nice property of the algorithm above is that it does not require that the entire string be loaded into memory; instead, it just requires a buffer whose length equals that of the pattern being searches for. This makes it especially suitable for looking for relatively small patterns in huge strings, like gene patterns in DNA strings. In this particular case verifying each match is especially advisable, given that the input stream is discarded after being processed, making it impossible to verify matches after the algorithm has finished running. Plus, the enormous length of the input does increase the probability of an incorrect match popping up at some point.

The modifications required to allow for streaming are presented below:

import random, itertools

def verify(match, needle):
	return all(match[k] == needle[k] for k in range(0, len(needle)))

def checksum(input, seed, modulus):
	checksum = 0
	for byte in input: # (→ 1)
		checksum = (checksum * seed + byte) % modulus
	return checksum

def stream_search(stream, needle):
	"""Yields the list of positions p such that 
	   stream[p : p + len(needle) - 1] == needle"""
	
	# Initialization
	
	needle_length = len(needle)
	modulus = 0xB504F32D
	
	seed = random.randint(0, modulus - 1)
	seed_k = pow(seed, needle_length, modulus) # Pre-compute a^k
	
	prefix = list(itertools.islice(stream, needle_length - 1)) # (→ 2)
	needle_checksum = checksum(needle, seed, modulus, needle_length)
	substr_checksum = checksum(prefix, seed, modulus, needle_length - 1)
	
	# Stream processing
	
	pos = 0
	substr = collections.deque(prefix) # (→ 3)
	
	for incoming in stream:
		substr.append(added)
		outgoing = substr.popleft() if pos > 0 else 0 # (→ 4)
		substr_checksum = (seed * substr_checksum 
		                   - outgoing * seed_k 
						   + incoming) % modulus
		
		if needle_checksum == substr_checksum and verify(substr, needle):
			yield pos
		
		pos += 1

Streaming implementation notes

for byte in input:: We now process bytes objects instead of string objects, as we are mostly going to process huge byte streams loaded from files or from a network.
prefix = list(itertools.islice(stream, needle_length – 1)): The itertools.islice(stream, n) function consumes n values from a stream. We only take the needle_length - 1 first ones here, and the first round of update will complete the checksum calculation.
substr = collections.deque(prefix): Values corresponding to the substring whose checksum is being calculated are stored in a sliding window implemented as a double-ended queue. This window is required as the first and last values that it contains are needed to update the rolling checksum.
outgoing = substr.popleft() if pos > 0 else 0: This conditional check prevents popleft from being called on an incomplete window in the first rolling checksum update step. This is needed because on the first iteration of the loop the checksum is incomplete, and the sliding window is not full yet.

Performance

As it stands, the algorithm presented above performs rather poorly for two reasons:

Implementation weaknesses: Using iterators to loop over the input string introduces a heavy overhead in Python
Algorithm weaknesses: Extracting a modulus is a slow operation. Plus, the algorithm used processes all characters; CPython’s default implementation, in contrast, uses a deterministic algorithm based on two popular string searching algorithms, Boyer-Moore and Horsepool, that skips parts of the input.

Though it performs acceptably, the python implementation of this algorithm is therefore no match for the highly optimized built-in string.find CPython function. The first culprit pointed out above can easily be fixed by moving to another, more low-level language ((Indeed, a C implementation of this simple algorithm runs only about 20 times slower than Python’s heavily-optimized routines)); the second, which is directly related to the algorithm being used, is much harder to mitigate.

Using a non-prime modulus is sometimes advocated; in that case, the modulus most often chosen is generally $2^{64}$ (or whatever the size of machine integers is (+1), the idea being that C wrapping semantics guarantee that the modulus operation will be performed almost for free by the hardware; in that case the modulus operations can be removed altogether, but care must be taken to ensure a proper choice of seed, so as to minimize the probability of checksum collisions. This yields an experimental speedup of about 4 times in C; the resulting code is in the order of 5 to 10 times slower than the optimized CPython routine, but it is deterministic, and hence vulnerable to collision-based attacks.

Multi-pattern search

Where Rabin-Karp does shine is when searching for multiple, same-length patterns. With trivial modifications, the algorithm above can indeed be extended to match the current substring’s checksum against a list of pre-computed pattern checksums. Provided this lookup operation is implemented efficiently enough, Rabin-Karp becomes an efficient alternative to repeatedly running a more efficient single-pattern search algorithm.

The modified pseudo-code is presented below:

matches = {}

patterns_checksums = [checksum(pattern) for pattern in patterns]
substring_checksum = checksum(s[0 : k[)

for position from 0 to n - k - 1 do
	update substring_checksum
	if substring_checksum belongs to pattern_checksum then # (→ 1)
		add position to matches

return matches

All that remains is to find a proper implementation of the belongs to (→ 1) operation. Hash-tables, though they might seem appropriate, require a large amount of memory to not introduce extra collisions ((The streaming implementation presented above only requires keeping the last $k$ tokens in memory, and is hence rather memory-efficient)), and they introduce a slight performance overhead. Instead, sorting patterns checksums and binary-searching the resulting array does not require any additional memory, and introduces very low overhead for reasonable patterns count.

The following graph demonstrates the effects of using this modified algorithm to search for large numbers of patterns. Four implementations are compared:

Naive python, C implementations:

	for pattern in patterns:
		if haystack.find(pattern) > 0:
			return True

	for (int pattern_id = 0; pattern_id < patterns_count; pattern++)
		if (strstr(haystack, patterns[pattern_id]))
			return true;

Execution time grows linearly with the number of patterns. Useless for more than 50 patterns. Requires multiple passes on the input.

Regular-expressions–based python implementation:
```
	regexp = "|".join(patterns)
	if re.search(regexp, haystack):
		return True
```
Execution time also grows linearly with the number of patterns, but the growth is much slower, making this solution practical for patterns counts under 5000. ((Ideally, though, the regular expression being generated would be transformed into a deterministic automaton whose performance would only marginally depend on the number of patterns (incidentally, this is the basis of the Aho-Corasick algorithm). Unfortunately, Python doesn’t do that by default.)) Requires only one pass on the data. Requires sanitization of the input patterns for the regular expression to compile properly.
C++ multi-patterns Rabin-Karp implementations:
Based on the description above. Execution time grows logarithmically in the number of patterns. Much slower than other techniques for small patterns count, but reasonable for patterns count above 50, and optimal above ~2000/5000 depending on implementation-dependent details. Requires only one pass. Using a hardware-based modulus (wrapping at $2^{64}$) does yield a significant performance improvement.

The following graph shows the respective performance of these implementations.

Multi-pattern search performance. While naive implementations quickly become impractical, the regex-based implementation yields pretty good performance. It is eventually superseded, however, by Rabin-Karp–based implementations, notably by the hardware modulus implementation. (Experimental settings: searching for random strings of length 11 in pg2554.)

Conclusion & further reading

Although its performance if only middling on single-pattern string searching problems, the randomized algorithm presented above shines by its conceptual simplicity, its ease of implementation, and its performance on multi-string searching problems. One of its most successful applications is the implementation of similarity-detection tools, which search for a large number of known sentences in newly submitted texts (articles, exams, …).

The original paper by Rabin & Karp (1986) is a fascinating read; it includes much more details on the algorithm’s expected performance in various settings, and generalizations to a broader class of problems. Further results on the average-case complexity of this algorithm were presented by Gonnet & Baeza-Yates in 1990.

Another approach to the multi-string matching problem, that relies on the remarks made above on converting regular expressions to finite automaton, was presented by Aho & Corasick in 1975. It basically preprocesses the patterns input set to construct a finite deterministic automaton, thus achieving one-pass, linear-time matching of all patterns in the set. Its implementation, however, is much trickier than that of Rabin-Karp.

Have you ever had to use string searching functions other than that provided by the standard library of you programming language? Which algorithms did you use?

Six months of Windows Phone Development — Tips, tricks, and performance considerations

Clément — Sat, 09 Mar 2013 00:35:10 +0000

TL;DR: I’ve taken plenty of notes while developing my YiXue Chinese dictionary for Windows Phone. Topics covered in this article include performance tips, best practices, and plenty of code snippets for Windows Phone.

Introduction

This article presents notes and remarks that I gathered while working on a Chinese Dictionary App for Windows Phone, YiXue Chinese Dictionary: mistakes I made, fun tips I wrote down, and so on.
I initially didn’t really intend to create a full blog post out of these notes, but their increasing number, and my app recently placing second in Microsoft France’s App Awards contest, gave me enough motivation to share them with the community. Along with various tips and tricks and answers to often-asked (but seldom answered) questions, I will discuss a number of performance improvements that specifically apply to Windows Phone apps.

You can skip the DOs and DON’Ts and jump directly to the tips and tricks section, or even straight to the performance tips section, if you feel really confident with Windows Phone development.

App development tips: DOs and DON’Ts

DO keep the size of your XAP package (reasonably) small

XAP files are just ZIP files with a fancy file extension.Shown above, the contents of the YiXue Dictionary XAP package.

As of March 2013, the limit for users to be able to download your app over the air is 20MB. Above this limit, users will have to switch to connect to a WiFi network before they can download your app. Below 20MB, you’re safe.
That said, keeping an eye on your XAP size is definitely a good idea. PC World ran two studies in 2012, concluding that the average 3G download speed in the United States is about 2.5mbps. That’s about 3.2 seconds per MB in your app’s XAP — about 1 minute for a 20MB app download over 3G. In areas with less good coverage, this will probably climb up to 2 or 3 minutes (assuming 1mbps).

DO use the Silverlight Toolkit for Windows Phone

The Silverlight toolkit for Windows Phone is a library distributed by Microsoft, which includes a number of controls which didn’t make it into the official SDK. It’s frequently updated, it contains a number of gems that can widely improve your app, and you can download it using nuget straight from Visual studio (more information here).

Though I was initially reluctant to increase my download size by bundling a 500kB dll with my app, I quickly realized it can dramatically improve the user experience. I’ll focus on just two things here: The PerformanceProgressBar, and page transitions (the Windows Phone Geek website has covered in great details pretty much every control in the toolkit).

The `PerformanceProgressBar`

It’s been said pretty much everywhere, but using the default Windows Phone ProgressBar in Indeterminate mode is a bad idea. The animation is based on custom-themed sliders, which eat up a lot of processing power to display the animation (more on that below). The PerformanceProgressBar included in the Silverlight Toolkit fixes that.

Load the toolkit by adding the following to the phone:PhoneApplicationPage tag:

xmlns:tk="clr-namespace:Microsoft.Phone.Controls;assembly=Microsoft.Phone.Controls.Toolkit"

and create a progress bar using the following:

Page transitions (`toolkit:TransitionService.NavigationInTransition`)

It’s pretty standard in all system apps on Windows Phone; when you tap a button or a link, the current page will flip out, and a new page will flip in.

Apart from the eye candy, the main advantage of this animation is that it vastly improves the perceived speed of your app, by virtually supressing the delay that separates user actions (tap a button, click a link, pick a ListBox item) and the corresponding reaction; the animation starts immediately, and the constructor and OnNavigatedTo method of your new page are called while the animation runs.

This way, your users have direct feedback that their tap was taken into account. I initially implemented that in YiXue because the pretty complex layout of the Dictionary page would take a split second to render, and users would tap two or three times on the “Dictionary” button, thinking the tap hadn’t been taken into account.

To improve the visual consistency of your app, I recommend using the animation on all your pages. One neat thing is that this animation can be bundled in a style, and reused on all pages without repeating the same code on each. That’s something I didn’t know at first, and that was another reason I was reluctant to generalize the use of that animation to all my pages. But it’s actually rather simple:

First declare the toolkit namespace in your App.xaml using the following line :

xmlns:toolkit="clr-namespace:Microsoft.Phone.Controls;assembly=Microsoft.Phone.Controls.Toolkit"

Then declare the following style in your App.xaml file:

Finally, reuse your new style on all pages like this:

DON’T show a minimized application bar on a page with TextBoxes

Trying to hit the “change language” button (繁) of the keyboard when the application bar is minimized generally results in a lot of frustration.

That’s a usability tip which plenty of apps on the market place fail to follow, especially translation / dictionary apps. If have TextBoxes on a page showing an ApplicationBar, make sure it is not minimized. Otherwise, your international users will have the frustration of hitting the application bar instead of the “change language” button on the keyboard, thereby expanding the application bar, above the keybord. See the screenshot if you’re not sure what I mean by that.

The solution is pretty simple : either hide the application bar when your TextBox gets the focus, or just set the ApplicationBar‘s Mode to Default instead of Minimized.

DO:

DON’T:

DON’T use the `GestureService` from the Silverlight toolkit

However awesome it might be, the Silverlight toolkit still has a few rough edges. In particular, it doesn’t unregister it’s listener when you use a GestureService (GestureService allows you to catch touch events like Flick, Slide, and so on). This will bite you if you use an InkSurface for drawing at some point, because a GestureService loaded on another page will keep firing touch events although the page it initially came from should have been disposed. Using the GestureService, furthermore, will prevent your pages from being properly disposed.
Fortunately, there’s a pretty simple workaround: use the XNA TouchPanel instead.

First, enable the listener (can be done in yoru app initialization code):

	Microsoft.Xna.Framework.Input.Touch.TouchPanel.EnabledGestures = 
		Microsoft.Xna.Framework.Input.Touch.GestureType.Flick;

Then, register the ManipulationCompleted event on a control:

	private void Control_ManipulationCompleted(object sender, ManipulationCompletedEventArgs e) {
		GestureSample sample = TouchPanel.ReadGesture();
		int delta_x = gesture.Delta.X, delta_y = gesture.Delta.Y;
		
		if (gesture.GestureType == GestureType.Flick 
				&& Math.Abs(delta_x) > 1.5 * Math.Abs(delta_y))
					//User flicked horizontally (delta_x < 0: flicked left; otherwise right)
	}

That works well, but not perfectly. Can you guess why? — The catch is that the TouchPanel will record all gestures made on the page; even if only part of your page is touch-enabled using ManipulationCompleted, the call to TouchPanel.ReadGesture will also report gestures made outside of that area.

The problem, luckily, is easily fixed by discarding all gestures but the last when ManipulationCompleted fires; add the following snippet to the previous function:

	GestureSample? sample = null;
	while (TouchPanel.IsGestureAvailable) //Discard all gestures but the last
		sample = TouchPanel.ReadGesture();
	
	if (sample.HasValue)
		if (gesture.GestureType == GestureType.Flick 
				&& Math.Abs(delta_x) > 1.5 * Math.Abs(delta_y))
					//User flicked horizontally (delta_x < 0: flicked left; otherwise right)

Cool development tips and code snippets

Windows Phone users can choose between light and dark themes, which at the core are just white text on a dark background, or black text on a light background.

Override system-default colors and brushes if your app uses a custom background

If your app includes a background image (see my tip on that below), you’ll probably want to force one of the color themes (light / dark). If you do that I’d advise going for a rather dark background, since that will use much less battery ; in that case you will want to enfore the dark theme in your app. I did that in YiXue dictionary, including a piece of famous Chinese calligraphy in the background, and users told me they really liked it.

First, note that you can’t directly replace an existing value in the App.Current.Resources dictionary. Removing the key and adding it again, however, does work (took me a while to find out):

	private static void SetAppResource(object key, object value) {
		//Cannot replace an existing value:
		//using 'App.Current.Resources[foo] = bar' won't work.
		
		if (App.Current.Resources.Contains(key)) 
			App.Current.Resources.Remove(key);

		App.Current.Resources.Add(key, value);
	}

You can also override properties of already existing objects, like system brushes. For your convenience, here is a quick snippet setting most of the common color settings to their default Light Theme values.

//Code from http://pit-claudel.fr/clement/blog/ ; attribution much appreciated!
(App.Current.Resources["PhoneForegroundBrush"] as SolidColorBrush).Color = Colors.White;
(App.Current.Resources["PhoneBackgroundBrush"] as SolidColorBrush).Color = Colors.Black;

(App.Current.Resources["PhoneDisabledBrush"] as SolidColorBrush).Color = Color.FromArgb(102, 255, 255, 255);
(App.Current.Resources["PhoneInactiveBrush"] as SolidColorBrush).Color = Color.FromArgb(51, 255, 255, 255);

(App.Current.Resources["PhoneContrastBackgroundBrush"] as SolidColorBrush).Color = Colors.White;
(App.Current.Resources["PhoneContrastForegroundBrush"] as SolidColorBrush).Color = Colors.Black;

(App.Current.Resources["PhoneTextBoxBrush"] as SolidColorBrush).Color = Color.FromArgb(191, 255, 255, 255);
(App.Current.Resources["PhoneTextBoxForegroundBrush"] as SolidColorBrush).Color = Colors.Black;
(App.Current.Resources["PhoneTextBoxSelectionForegroundBrush"] as SolidColorBrush).Color = Colors.White;
(App.Current.Resources["PhoneTextBoxEditBackgroundBrush"] as SolidColorBrush).Color = Colors.White;
(App.Current.Resources["PhoneTextBoxEditBorderBrush"] as SolidColorBrush).Color = Colors.White;
(App.Current.Resources["PhoneTextBoxReadOnlyBrush"] as SolidColorBrush).Color = Color.FromArgb(119, 0, 0, 0);

(App.Current.Resources["PhoneRadioCheckBoxBrush"] as SolidColorBrush).Color = Color.FromArgb(191, 255, 255, 255);
(App.Current.Resources["PhoneRadioCheckBoxCheckBrush"] as SolidColorBrush).Color = Colors.Black;
(App.Current.Resources["PhoneRadioCheckBoxCheckDisabledBrush"] as SolidColorBrush).Color = Color.FromArgb(102, 0, 0, 0);
(App.Current.Resources["PhoneRadioCheckBoxDisabledBrush"] as SolidColorBrush).Color = Color.FromArgb(102, 255, 255, 255);
(App.Current.Resources["PhoneRadioCheckBoxPressedBrush"] as SolidColorBrush).Color = Colors.White;
(App.Current.Resources["PhoneRadioCheckBoxPressedBorderBrush"] as SolidColorBrush).Color = Colors.White;

(App.Current.Resources["PhoneBorderBrush"] as SolidColorBrush).Color = Color.FromArgb(191, 255, 255, 255);
(App.Current.Resources["PhoneSubtleBrush"] as SolidColorBrush).Color = Color.FromArgb(153, 255, 255, 255);

As a side note, remember that you can’t set the application bar’s color to pure white. Just set it to something like #EFEFEF instead, and it will look just as if it was all white.

DO:

shell:SystemTray.ForegroundColor="#FEFEFE"

DON’T:

shell:SystemTray.ForegroundColor="#FFFFFF"

Let users select and copy text

A TextBox styled like a TextBlock, with some text highlighted

Since text can’t be directly copied from TextBlocks, most websites advise that you use a ReadOnly TextBox instead, styled to look just like a TextBox. That’s pretty sweet, but it fails to mention that users need to give input focus to a TextBox before they can select text. Which means that they’ll have to tap on your modified TextBox twice: once to give it focus (but the caret won’t appear, since the TextBox is read-only), and a second time to actually select text. As a user, this drives me nut.

Luckily there’s a pretty simple work-around: if it’s a very small block of text, intercept the Tap event, and select all text immediately. Otherwise, use the GotFocus event. Users can then refine their selection if they want, but they can also copy all contents straight away if they want to.

	private void CopyableTextArea_Tap(object sender, GestureEventArgs e) {
		(sender as TextBox).SelectAll();
	}

Share a background on all your pages

Pages sharing a common background, with the Background property of their LayoutRoot set to Transparent.

If you want to set an app-level background image (perhaps the same as your splash screen, although I’d advise using something a bit darker to improve the legibility of text over your background), manually setting the background image on all your pages is not the best, especially if you use page-turning animations.
Indeed, if you set the background on a per-page level, you’ll quickly notice that the background flips with the rest of the page content when you change pages. It would be much better if it stayed in place, and only page contents flipped out when changing pages. It’s actually pretty easy to do if you know the trick (that, too, took me a while to find):

In App.xaml:

In App.xaml.cs

(App.Current as App).RootFrame.Background = App.Current.Resources["MainBackground"] as Brush;

Now the catch is, if some of your pages show an application bar, and other don’t, your background will be resized to fit only the available space: the application bar will force it to shrink — that’s pretty ugly. Instead, what I do is add a 72 pixels high row to the LayoutRoot grid, and set the ApplicationBar’s transparency to .99 (choosing a value other than 1 draws the app bar above your page instead of shrinking it). That way, the Application bar is drawn above your page, and nothing is drawn below it because there’s an empty row of just the right height below:

DO:
In your Grid‘s RowDefinitions:

Further below:

	
		
			(...)

Performance tips

Load your data files faster

Abandonning your ProgressBars in favour of text-only loading messages can yield substantial performance improvements.

Really that’s pretty universal advice. If you never tried, you’ll probably be surprised how much performance you can gain by simply disabling your “loading” progressbar and replacing it with a Loading... message, possible including a percentage at the end, as in Loading... 10%. In Yixue, where the initial loading of the dictionary has been an area that I’ve worked a lot on, disabling the Loading... progressbar has reduced the overall loading time from 3.2 seconds to 2.7s. That’s a sizable 0.5 seconds, or 15% percent performance improvement :)

Another common mistake when displaying progress is to use a timer running in the background and firing every x milliseconds to update your progressbar. I did this in YiXue, because the animation of the progress bar that I used back then would be smoother that way. The loading time was 3.8 second then, and the UI timer fired every 20ms. Increasing this delay to 100ms took the loading time down to 3.5 seconds. Removing the UI timer altogether and firing a ReportProgress event after each significant loading step took that down to 3.2 seconds… and getting rid of the progress bar I went down to 2.7s.

Load your pages faster

XAML complexity is, in my experience, the main cause for pages loading slowly. Complex bindings (or even simple ones, really, if you’re populating a list) are also very time-consuming. If you’re after the last bit of performance, consider populating your list manually, possibly be removing your listbox alltogether, and using a vertical StackPanel to which you directly add your items, in code-behind. I know that doesn’t really fit in the MVVM pattern, but when I had to display a grid of 214 elements, binding a ListBox using a WrapPanel as its ItemsPanel took close to 2 seconds, while populating a StackPanel only took 0.3 seconds.

To the best of my knowledge, the toolkit’s WrapPanel doesn’t do any kind of UI virtualization, making it useless for displaying moderately large amounts of information.

Don’t use the `PivotControl` for what it’s not made for

The PivotControl has to render all its pages before it’s displayed. That’s a huge performance killer when you use it to diplay multiple data sets (plus, it’s against the design guidelines: Pivots are used throughout the system to show multiple visualisations of the same data set (all mail, unread, starred, and so on, for example)).

In YiXue Dictionary I have study lists (roughly equivalent to stacks of flash cards, only better ;), and I initially displayed them using a Pivot control, one list per pivot page. When I tested that page with 12 study lists contining about 40 words each, not only did it take 2 seconds to render the page, but the memory profiler reported a 130MB memory usage (!). I’ve now replaced it with a single list, letting users flip between each page using gestures, which doesn’t use more than a few MBs.

Get better I/O performance

The WP emulator is really neat, but there’s one thing that it really doesn’t mimic realistically, namely IO performance. Although it’s much better on my Lumia than on my old HTC, I/O is still much slower on a phone than on your development computer. That’s really something to keep in mind when testing your app, especially if you don’t have a device to test on. Very roughly, my testing indicates that an I/O operation that takes 1 seconds on your computer will take 2 or 3 seconds on the Lumia 920, 3 to 5 seconds one Lumia 800, and 5 to 7 seconds on the Trophy. Take these timings with a pitch of salt since they are really gross approximations, but the general figures are about that. Similarly, garbage collections will be much slower on your phone than in the emulator; if your loading procedure involves a lot of text parsing, it’ll suffer a lot from that on your phone.

So how do you improve that loading performance? Turns out that there is a number of simple tricks that you can use. I won’t get into too much details here, because there’s so much to say, and because you obviously should run benchmarks for your micro-optimisations, but the following general principles will help :

Ditch String.Split

Screenshot from MDSN’s remarks section regarding String.Split.

String.Split is nice, but it’s a garbage collector nightmare. It allocates plenty of intermediate buffers (and the documentation is pretty straightforward about that, with a section called “Performance Considerations” well worth reading), and it’s too generic to be as fast as it could be. If you’re really after the last bit of performance, re-code the split function by hand. It’s really not hard, but here it is:

	int start = -1, end = -1;
	List parts = new List();
	
	while (end + 1 < string_to_split.Length) { //Assumes that the string to split ends with a '\t'
		start = ++end;
		while (string_to_split[end] != '\t')
			++end;

		parts.Add(string_to_split.Substring(start, end - start));
	}

In YiXue, that shaved an extra 30% off the loading time.

Set images as Content, not resource

Changing the build action of your images from Resource to Content will neatly improve the splash screen time, because your images will only be loaded when needed, instead of being loaded when your app starts.

Minify your XML files.

Left: efficient representation. Right: Human-readable version of the same data. Smaller is better.

Most apps store use data as XML, using an XMLSerializer. That’s really neat, but it can get slow if you don’t take a few precautions. There are two things that really helped me in YiXue: first one is disabling white space in the XML output; your files won’t be as legible by a human as they were before, but it easily saved me 20% off the size of my user data files, and the parser won’t see the difference.

	XmlWriterSettings settings = new XmlWriterSettings { Indent = false, NewLineHandling = NewLineHandling.None };

Use defaults in your XML serialization attributes

Second one is specifying defaults. In YiXue there can be plenty of info associated to each word in your study lists, but most words don’t have all that info filled in. You can get the Xml serializer to omit the empty properties by simply specifying the default value for the attribute this way:

	[DefaultValue(0)] public int CorrectAnswersGiven;

I’d also advise choosing very short aliases for your attributes and elements names; for example:

	[XmlAttribute(AttributeName = "f")] public string foo;

This will further shrink your data files, and will make it much easier to rename your variables should you want to.

Improve perceived performance by showing a message with actual content while the app loads

If you need time to load app resources (beyond the split-second), entertain your users while your app is loading by including tips and fun facts: they won’t notice the delay if they have something else to focus on — game developers know this, and always include tips in their loading screens. This principle is easily extended to other apps.
In YiXue, I show tips and fun facts about learning Chinese while the app loads.

Conclusion

That’s pretty much all for now. Hopefully that helps you create even better apps!

You can follow me on Twitter here (me) and here (my YiXue Dictionary app). Of course, links to http://pit-claudel.fr/clement/yixue/ and to this blog are most appreciated. And since I’m sure I forgot plenty:

DO: share your own tips and tricks in the comments!

Happy WP development!
Clément.

PS: If you liked this article, you can flattr this blog post using the link below, or even better buy the app!

Modeling and measuring string comparison performance in C, C++, C# and Python.

Clément — Sun, 30 Sep 2012 00:57:08 +0000

Comparing strings is often — erroneously — said to be a costly process. In this article I derive the theoretical asymptotic cost of comparing random strings of arbitrary length, and measure it in C, C++, C# and Python.

Theoretical amortized cost

Infinite strings

Suppose one draws two random strings $A = “a_0a_1a_2…”$ and $B = “b_0b_1b_2…”$ of infinite length, with $a_n$ and $b_n$ respectively denoting the $n$^th character of $A$ and $B$. We will assume that characters are independent, and uniformly drawn over the set $\{a, b, c, …, z\}$. For now, we will consider all strings to be of infinite length — we’ll study bounded-length strings in the next section.

Given that characters making up $A$ and $B$ are independent, the probability of $A$ differing from $B$ at index $n$ is $25/26$. Conversely, the probability of $A$ matching string $B$ at index $n$ is $1/26$.

Consequently, the probability of $A$ matching $B$ up to index $n – 1$, and differing at index $n$, is $(1/26)^n ⋅ (25/26)$. Therefore, given that comparing strings up to index $n$ included is $(n+1)$ the expected number of characters to compare to discriminate $A$ and $B$ is $$ C = \sum_{n=0}^{\infty} (n+1) ⋅ (1/26)^n ⋅ (25/26) = 26/25$$

Indeed, defining $C(x) = (1-x) ⋅ \sum_{n=0}^{\infty} (n+1) x^n$, the expected number of comparisons $C$ required to discriminate $A$ and $B$ is $C(1/26)$. Now $C(x) = (1-x) ⋅ \sum_{n=0}^{\infty} \frac{∂ x^n}{∂ x} = (1-x) ⋅ \frac{∂ \sum_{n=0}^{\infty} x^{n+1}}{∂ x} = (1-x) ⋅ (1-x)^{-2} = 1/(1-x)$. Plugging $x=1/26$ yields $C = 26/25 \approx 1.04$.

In other words, it takes an average of $1.04$ character comparisons to discriminate two random strings of infinite length. That’s rather cheap.

Finite strings

In this section we assume that $A$ and $B$ are bounded in length by a constant $L$. This implies that the probability of the comparison routine reaching any index greater than $(L-1)$ is $0$; hence the expected number of comparisons $C_L$ is $$C_L = \sum_{n=0}^{L-1} (n+1) ⋅ (1/26)^n ⋅ (25/26) = (26/25) ⋅ (1 – \frac{25L + 26}{26^{L+1}})$$
For any $L > 3$, this is virtually indistinguishable from the previously calculated value $C = 1.04$.

Conclusion

The theoretical amortized time required to compare two strings is constant — independent of their length. Of course, there may be a sizeable overhead associated to comparing strings instead of ints: experimental measurements of this overhead are presented below.

Experimental measurements

The following charts present the time it took to perform 100’000 string comparisons on my laptop, compared to a baseline of 100’000 integer comparisons, in 3 different programming languages and for string lengths varying from 1 to 500 characters. Despite large variations for small string lengths, measured times quickly converge to a fixed limit as strings lengthen ((If you want to experiment with the tests, you can download the full python, C++, or C# source code.)).

Time required to compare 100’000 pairs of random strings, as a function of string length, in Python (3.2.3)

Time required to compare 100’000 pairs of random strings, as a function of string length, in C# (csc 4.0.30319.17929)

Time required to compare 100’000 pairs of random strings, as a function of string length, in C++ (gcc 4.3.4).

Remarks

Though large strings (over ~200 characters) experimentally match our statistical model, small strings exhibit a much more complex behaviour. I can think of a two causes for such variations:

String interning: Python interns some strings — exactly which depends on the implementation. On my laptop, this turns small (< 64 characters) string comparisons into direct pointer comparisons, which are about as fast as integer comparisons.
Locality: In C++ in particular, under certain conditions, applying a function to a char* gets slower as the referred string lengthens, even if the function only access the first few elements. The effect is barely noticeable for strings over 200 characters.

Conclusion

Comparing strings, especially small strings, is very fast. The amortized time required to compare two strings is bounded by a constant, which is experimentally about twice the time required to compare two integers in high level languages like Python or C#, and 5 to 7 times higher in low-levels languages like C++.

Do you usually try to avoid comparing strings? Do you have a more detailed explanation of the irregularities observed for lengths < 200 characters? Tell us comments!

Generating uniformly random data from skewed input: biased coins, loaded dice, skew correction, and the Von Neumann extractor

Clément — Mon, 03 Sep 2012 21:38:48 +0000

In a famous article published 1951 ((Various techniques used in connection with random digits. NIST journal, Applied Math Series, 12:36-38, 1951. This article does not seem to be available online, though it is widely cited. It is reprinted in pages 768-770 of Von Neumann’s collected works, Vol. 5, Pergamon Press 1961)), John Von Neumann presented a way of skew-correcting a stream of random digits so as to ensure that 0s and 1s appeared with equal probability. This article introduces a simple and mentally workable generalization of his technique to random dice, so a loaded die can be used to uniformly draw numbers from the set $\{1, 2, 3, 4, 5, 6\}$, with reasonable success.

Von Neumann skew-correction algorithm

Biased coins

Von Neumann’s originally proposes the following technique for getting an unbiased result from a biased coin :

If independence of successive tosses is assumed, we can reconstruct a 50-50 chance out of even a badly biased coin by tossing twice. If we get heads-heads or tails-tails, we reject the tosses and try again. If we get heads-tails (or tails-heads), we accept the result as heads (or tails).

This algorithm is illustrated below:

Flowchart recapitulating Von Neumann’s bias-suppressing algorithm

This amazingly simple strategy does yield heads and tails with equal probabilities, because coin tosses, though biased, are assumed to be independent. Therefore, if heads come out with probability $p$, and tails with probability $1-p$, then both (head, tails) and (tails, heads) sequences occur with probability $p(1-p)$.

Removing the bias from a stream of random digits

Generalizing this technique to a stream of random digits is straightforward: pack digits two-by-to, discard any pair of two equal digits (00 or 11), and finally convert 01 to 0 and 10 to 1.

Illustration of Von Neumann’s biased-suppressing algorithm on a stream of random digits

Probabilistic remarks

Expected rejection rate

Continuing on previous calculations, since each coin toss is taken to be independent from the others, both (heads, tails) and (tails, heads) sequences will occur with probability $p(1-p)$, while (heads, heads) and (tails, tails) sequences will occur with respective probabilities $p^2$ and $(1-p)^2$.
Define the expected rejection rate $r$ to be the the probability of obtaining (heads, heads) or (tails, tails) (and thus of discarding the sequence). Then $r = p^2 + (1-p)^2 = 1 – 2 ⋅ p(1 – p) = (1/2) + 2 ⋅ (p – 1/2)^2$, which implies that $r ≥ 1/2$. In other words, in the best case (when the coin is perfectly balanced), half of the coin sequences will be discarded.

Rejection rate as a function of heads probability

Expected number of tosses

Since the rejection rate is $r = 1 – 2 ⋅ p(1 – p)$, the odds of discarding exactly $n$ two-coins sequences before obtaining an acceptable sequence are $r^n ⋅ (1-r)$, and the expected number of two-coins sequences to be discarded is $d = \sum_{k=0}^\infty k ⋅ r^k ⋅ (1-r) = \frac{r}{1-r}$.

Indeed, $$\sum_{k=0}^\infty k ⋅ r^k ⋅ (1-r) = r(1-r) ⋅ \sum_{k=0}^\infty k r^{k-1} = r(1-r) ⋅ \frac{∂}{∂r} \sum_{k=0}^\infty r^k = r(1-r) ⋅ \frac{∂}{∂r} \frac{1}{1-r} = \frac{r}{1-r}$$

Now, if $d$ two-coins sequences are discarded before an acceptable sequence is obtained, then the number of coin tosses required is $2 ⋅ (1 + d) = \frac{1}{p(1-p)} ≥ 4$. In other words, it will take in the best case an average of 4 single-coin tosses to obtain an acceptable sequence. Or, as Von Neumann puts it,

The resulting process is rigorously unbiased, although the amended process is at most 25 percent as efficient as ordinary coin-tossing.

Expected number of tosses before an acceptable two-coins sequence is found, plotted as a function of heads probability

A generalization to loaded dice

The general idea behind Von-Neumann style skew-correction techniques is to consider sequences of coin tosses instead of isolate ones, and pick a sequence length long enough that an even number of possible outcomes have equal probabilities. These outcomes can then be split in two groups, one representing heads and the other representing tails, while other outcomes are discarded.

Generalizing this strategy to weighted dice excludes picking a sequence length $n ≤ 2$, since no subset of possible outcomes of equal probability for $n=1$ or $n = 2$ has a cardinal divisible by six. For $n=3$, on the other hand, the outcomes of any dice (crooked or not) can be partitioned into six categories of equal probability according to the relative ordering of the successive numbers rolled, provided these numbers are all different — that is, sequences can be grouped according to whether the second number is greater (↑) or smaller (↓) than the first, and whether the third number is greater (↑↑↑, ↓↑↑) or smaller (↑↓↓, ↓↓↓) than both the first and the second, or between them (↑↑↓, ↓↓↑).

This strategy is illustrated in the diagram below, which provides a reference table for interpreting sequences of three rolls of a loaded dice to produce unbiased results. Should the same number occur twice, all three rolls should be discarded, and the process should be started over.

Decision table for suppressing dice bias. Given a sequence of three throws, chart the three throws so as to form an arrow pattern, and use this table to obtain the corresponding result. Examples are given in green for each pattern. Using this table, rolling $1 2 5$ would yield a $1$, while $1 5 2$ would yield a $2$, and $5 2 1$, a $4$.

All six possible orderings will occur with equal probability, because each of the sequences belonging to any of these orderings matches exactly one sequence of equal probability in every other ordering: for example, $1 2 4$ matches the ordering ↑↑↑, whereas $2 1 4$, which is equally likely, matches ↓↑↑.

Implementation

Given a sequence of three independent rolls of a possibly biased die, this function returns a fair dice roll using the previously exposed technique:

	def translate(*rolls):  
		# returns False if 'rolls' contains duplicates, and a fair dice roll otherwise
		
		# Possible relative ordering of the three rolls. 1 2 3 is ↑↑↑, 2 1 3 is ↓↑↑.
		deltas = [(True, True, True),   (True, True, False), 
				  (True, False, False), (False, False, False),
				  (False, False, True), (False, True, True)] # True stands for ↑
		
		(r1, r2, r3) = rolls;
		unique = len(set(rolls)) >= 3
		return unique and 1 + sequences.index((r2 > r1, r3 > r1, r3 > r2))

Here’s a test function. The bias parameters measures how unbalanced the die is: the higher the bias, the higher the probability to roll a 6.

	def unfair_roll(bias):
		return random.choice([1, 2, 3, 4, 5] + [6]*bias)

	def test(number_of_rolls, bias):
		count = [0]*7
		bias = max(bias, 1)
		
		for roll_id in range(0, number_of_rolls):
			rolls = unfair_roll(bias), unfair_roll(bias), unfair_roll(bias)
			count[translate(*rolls) or 0] += 1

		print("Rejection rate: {:.2f}%".format(count[0] / number_of_rolls * 100))
		print("Relative digit frequency:")
		for digit in range(1, 6 + 1):
			print("  {}: {:.2f}%".format(digit, count[digit] / number_of_rolls * 100))
		print("Before correction, 6 was {} times more likely to occur"
					" than other digits.".format(bias))

Here’s a single run for 50 000 sequences of three rolls of a die ten times more likely to give a 6:

	>>> test(50*1000, 10)
	Rejection rate: 80.52%
	Relative digit frequency:
		1: 3.24%
		2: 3.30%
		3: 3.26%
		4: 3.18%
		5: 3.17%
		6: 3.32%
	Before correction, 6 was 10 times more likely to occur than other digits.

Conclusion and further remarks

Though easy to work out mentally, this bias-correction technique for dice is far from optimal. Indeed, the minimal rejection rate (obtained for a perfectly balanced dice) is about 44%, and the rejection rate quickly increases as the bias gets stronger, reaching 64% when 6 is 5 times more likely to occur than other individual digits.

More subtle techniques, which I will discuss in a future article, take the rejection rate down to 1.5% for perfectly balanced dice, and keep it under 7% for dice with a 6 to 1 chance of getting a 6 over any other digit.

Do you know other strategies to even the odds when playing with a loaded dice? Sound off in the comments!

The cafeteria paradox: stop using the water dispenser while someone else does!

Clément — Wed, 22 Aug 2012 18:09:11 +0000

Most cafeteria water dispensers will let two (sometimes more) people fill a jug at the same time. This article uses simple maths to prove that it’s a waste of time. In other words, two people should never use the same water dispenser at the same time: I call this the cafeteria paradox.

An intuitive presentation

Let’s start with a little brain teaser:

Alice and Bob are at the cafeteria, seating at different tables. Alice stands up to refill her table’s water jug. Little after, Bob stands up with his own table’s jug, heading to the same water dispenser. That water dispenser is a perfectly standard one, with two taps, and Bob finds himself standing near Alice. After a small hesitation, Bob starts using the second tap to fill his own jug, thereby diverting part of the output previously devoted to Alice.

This grants him an exasperated and somewhat puzzled look from Alice. Why?

The reason is quite simple: though most people will act just like Bob, using the water dispenser while someone else does is — in most of the cases — a useless waste of time. Indeed, starting to fill his own jug before Alice is done will not save Bob any time, but it will waste some of Alice’s time. That’s the cafeteria paradox.

Fortunately, this paradox is readily explained: assuming both Alice and Bob are using similar jugs, and both start with an empty jug, Alice will finish before Bob does; when Bob leaves, therefore, the water dispenser will have dispatched precisely the countenance of two caferetia jugs.
Now, Since the combined output of the two taps equals that of only one, dispatching this amount of water will take a constant time (precisely twice the time that it would take to fill a single jug). Hence Bob will gain no time at all by starting to fill his jug right away, while some of Alice’s time will be wasted.

Not convinced? Imagine Bob and Alice are waiting for some pizza at a take-away restaurant. Given that Bob and Alice won’t start eating before they get to their respective homes, should Bob accaparate half of Alice’s pizza when it arrives, or let Alice get the full pizza first and wait for his own pizza? If Bob asks for half of Alice’s, both will wait for two pizzas to be ready; on the other hand, if Bob lets Alice go first, Alice will have only waited for one, while Bob will have waited for two. Clearly the second solution is better for Alice, and neutral for Bob; and the situation is exactly similar to that of the water dispenser.

Still not convinced? Read on for a full mathematical description of this situation. Otherwise, skip to the conclusion (and the sticker!).

A mathematically accurate description

Let $t_A$ and $t_B$ be the respective times when Alice and Bob reach the water dispenser, and $t_A’, t_B’$ the respective times when they leave. Let $t^*$ be the time when Bob starts using the water dispenser. The water dispenser has a flow rate $f$, and each jug has a capacity $c$.

From $t_A$ to $t^*$, Alice’s jug is filling at rate $f$; from $t^*$ to $t_A’$, at rate $f/2$. Hence Alice’s departure time $t_A’$ satisfies the following equation: $$ (t^* – t_A) ⋅ f + (t_A’ – t^*) ⋅ (f/2) = c$$
which yields $t_A’ = t^* + \frac{c}{f/2} – 2 ⋅ (t^* – t_A)$, so $$ t_A’ = 2 t_A – t^* + \frac{c}{f/2} $$

Similarly, Bob’s departure time $t_B’$ satisfies the following equation: $$ (t_A’ – t^*) ⋅ (f/2) + (t_B’ – t_A’) ⋅ f = c$$
which yields $t_B’ = t_A’ + \frac{c}{f} – (1/2) ⋅ (t_A’ – t^*) $, so $$ t_B’ = (1 / 2) ⋅ (t_A’ + t^*) + \frac{c}{f} $$

Finally, subtituting $2 t_A – t^* + \frac{c}{f/2}$ for $t_A’$ yields $t_B’ = t_A + (1 / 2) \cdot \frac{c}{f/2} + \frac{c}{f}$; that is $$ t_B’ = t_A + 2 ⋅ \frac{c}{f} $$

To summarize, Alice will leave at $t_A’ = 2 t_A – t^* + \frac{c}{f/2}$ (the later Bob starts, the earlier Alice will leave), while Bob will leave at $t_B’ = t_A + 2 ⋅ \frac{c}{f}$, which does not depend on $t_B$ or $t^*$.

Conclusion

The socially optimal solution is that Bob waits until Alice is done before starting to use the water dispenser; otherwise, some of Alice’s time will be wasted, though Bob will not benefit from it. The key to understanding this is that Bob will not leave the water dispenser before both jugs are full, and filling both jugs takes a constant time (the time needed for the water dispenser to deliver twice the volume of a jug, spanning from Alice’s arrival to Bob’s departure). Thus using the water dispenser while someone else does is not only selfish: it’s downright stupid.

Shocked? Leave a message in the comments! Otherwise join our campaing to stop multi-threading in water dispensers: stick the following message on your cafeteria’s water dispenser and post a picture of your sticker in the comments!

Willy Wonka’s golden tickets: certainly the most profitable marketing campaign of all times

Clément — Fri, 27 Jul 2012 00:15:28 +0000

There’s something that has troubled me since childhood. In Charlie and the chocolate factory, one of the lucky children (Veruca Salt) gets her golden ticket thanks to her father repruposing his peanut shelling factory in a chocolate-bar-unwrapping factory.

Now, there are only five golden tickets, and chocolate bars in the book seem quite popular. How many bars would one need to buy (and unwrap) just to have a seizable chance of finding one of the tickets? Most likely a lot. After doing the maths, I would estimate the number of chocolate bars that Mr. Salt had to buy to something between 12 and 40 million chocolate bars; which means this promotional campaign was most certainly one of the most profitable in history. Details below.

Background: Charlie and the chocolate factory

In Charlie and the Chocolate Factory, Roald Dahl pictured the archetype of the spoiled little child in the character of Veruca Salt (left). When Willy Wonka, an excentric, genius chocolate maker, announces that five golden tickets have been hidden underneath the wrapping of ordinary chocolate bars, granting their lucky discoverers a visit of his chocolate factory and a lifetime supply of sweets, the world goes into a chocolate-devouring frenzy. It’s only a few days before the first and second ticket are found, the later by a girl named Veruca Salt. Here is how her father explains the discovery

‘You see, boys,’ he had said, ‘as soon as my little girl told me that she simply had to have one of those Golden Tickets, I went out into the town and started buying up all the Wonka bars I could lay my hands on. Thousands of them, I must have bought. Hundreds of thousands!

So how many chocolate bars did Mr. Salt actually buy?

Assume Mr. Salt wants to have a reasonable chance of finding one of the tickets, perhaps 70 or 80%. Let $\alpha$ be the acceptable risk level, so that the probability of actually finding the ticket is $1 – \alpha$. Since televisions seem rather common in the book, and it was published in 1964, assume the action takes place in the early 1960’s. Say the children population of England at the time is $P$, and assume Mr. Wonka makes most of his business in this country, and only with children (including other countries, or adults, would further reduce Mr. Salt’s chances of success). There are $T = 5$ golden tickets in play, and we can estimate the number of bars sold per child to $b =$ three or more, given that Charlie Bucket, though extremely poor, buys three.

Excluding Mr. Salt compulsive buying, this generates a total static demand $D = P \times b$. Let $N$ be the number of chocolate bars sold in total, and $n$ the number of bars bought by the Salt family, so $N = D + n$ — the underlying assumption being that chocolate production is profitable at any level.

The chance of these $n$ bars yielding no golden ticket is $\binom{N-T}{n} / \binom{N}{n}$; conversely, the probability of at least one of these $n$ bars contains a golden ticket is $1 – \binom{N-T}{n} / \binom{N}{n}$. Assuming the family is ready to accept a probability of failure of $\alpha$, $n$ will be chosen so that $1 – \binom{N-T}{n} / \binom{N}{n} > 1 – \alpha$, or equivalently that $$ \binom{N-T}{n} < \alpha \binom{N}{n} $$ Developing this expression and substituing [latex]N - D[/latex] for [latex]n[/latex] yields [latex](N - T)! / (D - T)! < \alpha (N! / D!)[/latex], which is the same as $$ \frac{D!}{(D - T)!} < \alpha \frac{N!}{(N - T)!} $$ The children population of England in the 60's was about [latex]12.2[/latex] million kids ((Source: World dataBank, indicators SP.POP.0014.TO.ZS and SP.POP.TOTL; the population of the time was 52.4 million people, and the percentage of children aged less than 15 was 23.3%)). Assuming each bought 3 chocolate bars, the static demand was [latex]36.6[/latex] million bars. Finally, setting [latex]T = 5[/latex] and numerically solving the equation above yields [latex]N \approx 49.6[/latex] million chocolate bars, and thus [latex]n = 12.9[/latex] million chocolate bars. In other words, Mr. Salt would need to buy 12.9 million chocolate bars to reach a 70% chance of securing a golden ticket. Taking this probability up to 95% would imply buying an extra 28 million chocolate bars, bringing the grand total to about 41 million chocolate bars. ((If you include English adults in this funny little buying game, the static demand reaches 76.9 million chocolate bars, pushing the 70% to $N = 104$ million chocolate bars ($n = 27$ million), and the 95% probability threshold to $N = 162$ million chocolate bars ($n = 85.7$ million).))

Code

Here’s the Octave/Matlab code used to numerically solve the inequation above. The key is to notice that $\frac{D!}{(D – T)!}$ and $\frac{N!}{(N – T)!}$ are both values taken by the monotonically increasing polynomial $f(X) = \frac{X!}{(X-T)!} = \prod_{k=0}^{T-1} (X – k)$. Thus the inequation can be rewritten as $g(N) > 0$, where $g(N) = (f(N)/f(D)) – (1/\alpha)$.

function y = f(x, T)
  y = prod(x - (1:(T-1)));
endfunction

function y = g(N, D, T, alpha)
  y = f(N, T)/f(D, T) - 1/alpha;
endfunction

N = fzero(@(x) g(x, 36.7e6, 5, 0.3), [10e6, 10e20]);

Illustration by Quentin Blake. Title image from wikimedia.

How random is pseudo-random? Testing pseudo-random number generators and measuring randomness

Clément — Mon, 23 Jul 2012 16:20:18 +0000

After introducing true and pseudo-random number generators, and presenting the methods used to measure randomness, this article details a number of common statistical tests used to evaluate the quality of random number generators.

Introduction: true and pseudo-random number generators

Obtaining long sequences of random numbers, as required by some cryptographic algorithms, is a delicate problem. There are basically two types of random number generators: true random number generators, and pseudo-random number generators.

This article assumes no particular knowledge about the generation of random numbers, so if you already know about TRNGs and PRNGs, you may want to skip to the core of this article.

True random number generators (TRNGs)

True random number generators rely on a physical source of randomness, and measure a quantity that is either theoretically unpredictable (quantic), or practically unpredictable (chaotic — so hard to predict that it appears random).

I’ve compiled a list of sources of true randomness (capable of sufficiently fast output); please let me know in the comments if I forgot any.

Thermal noise, due to the thermal agitation of charge carriers in an electronic component, and used by some Intel RNGs
Imaging of random processes, used by LavaRnd
Atmospheric noise, mostly due to thunder storms happening around the globe, and used by random.org
Cosmic noise, due to distant stars and background radiation
Driver noise, often used by /dev/random in Unix-like operating systems ((See this post by Arnold Reinhold for a discussion on how to manually increase the entropy of /dev/random))
Transmission by a semi-transparent mirror, used by randomnumbers.info
Nuclear decay, used by HotBits
Coupled inverters, used by Intel’s new random generators.

The coupled inverters technique devised by Greg Taylor and George Cox at Intel is truly the most fascinating one : it relies on collapsing the wave function of two inverters put in a superposed state.

Pseudo-random number generators (PRNGs)

Pseudo-random number generators are very different: they act as a black box, which takes one number (called the seed ((The current system time is commonly used as a default seed, although for more security-sensitive applications user input is sometimes used (most often by asking the user to shake their mouse in a random fashion) )) and produces a sequence of bits; this sequence is said to be pseudo-random if it passes a number of statistical tests, and thus appears random. These tests are discussed in the following section; simple examples include measuring the frequency of bits and bit sequences, evaluating entropy by trying to compress the sequence, or generating random matrices and computing their rank.

Formal definition

A pseudo random number generator is formally defined by an initialization function, a state (a sequence of bits of bounded length), a transition function, and an output function:

The initalisation function takes a number (the seed), and puts the generator in its initial state.
The transition function transforms the state of the generator.
The output function transforms the current state to produce a fixed number of bit (a zero or a one).

A sequence of pseudo-random bits, called a run, is obtained by seeding the generator (that is, initializing it using a given seed to put it in its initial state), and repeatedly calling the transition function and the output function. ((By definition, since the number of possible states is bounded, this sequence will be periodic; in practice however, this period can be made as large as desired by increasing the number of bits in the state.)) This procedure is illustrated by the following diagram:

Schematic diagram of a pseudo-random number generator

In particular, this means that the sequence of bits produced by a PRNG is fully determined by the seed ((This property is very useful when using PRNGs to simulate experiments; seeds used during the simulation are recorded in case the results of one specific run need to be reproduced)). This has a few disagreeable consequences, among which the fact that a PRNG can only generate as many sequences as it can accept different seeds. Thus, since the range of values that the state can take is usually much wider than that of the seed, it is generally best to only seldom reset (re-seed) the generator ((except when the state risks being compromised — see the following section)).

Cryptographically secure pseudo-random number generators

Since PRNGs are commonly used for cryptographic purposes, it is sometimes asked that the transformation and output functions satisfy two additional properties :

Un-predictability: given a sequence of output bits, the preceding and following bits shouldn’t be predictable
Non-reversibility: given the current state of the PRNG, the previous states shouldn’t be computable.

Rule 1 ensures that eavesdropping the output of a PRNG doesn’t allow an attacker to learn more than the bits they overheard.
Rule 2 ensures that past communications wouldn’t be compromised should an attacker manage to gain knowledge of the current state of the PRNG (thereby compromising all future communications based on the same run).

When a PRNG satisfies these two properties ((Again, saying that a PRNG satisfies these properties means that either it can be proven that the rules cannot be broken, or that it is not practically possible to break them in a reasonable amount of time)), it is said to be cryptographically secure ((cryptographically secure pseudo-random number generators are often abbreviated as CSPRNGs)).

Testing pseudo-random number generators

There are a number of statistical tests devised to distinguish pseudo-random number generators from true ones; the more a PRNG passes, the closer it is considered to be from a true random source. This section presents general methodology, and studies a few example tests.

Testing methodology

Testing a single sequence

Almost all tests are built on the same structure: the pseudo-random stream of bits produced by a PRNG is transformed (bits are grouped or rearranged, arranged in matrices, some bits are removed, …), statistical measurements are made (number of zeros, of ones, matrix rank, …), and these measurements are compared to the values mathematically expected from a truly random sequence.

More precisely, assume that $f$ is a function taking any finite sequence of zeros and ones, and returning a non-negative real value. Then, given a sequence of independent and uniformly distributed random variables $X_n$, applying $f$ to the finite sequence of random variables $(X_1, …, X_n)$ yields a new random variable, $Y_n$. This new variable has a certain cumulative probability distribution $F_n(x) = \mathbb{P}(Y_n \leq x)$, which in some cases approaches a function $F$ as $n$ grows large. This limit function $F$ can be seen as the cumulative probability distribution of a new random variable, $Y$, and in these cases $Y_n$ is said to converge in distribution to $Y$.

Randomness tests are generally based on carefully selecting such a function ((In particular, the function is often chosen so that the limit random variable $Y$ has a monotonous (decreasing) density function)), and calculating the resulting limit probability distribution. Then, given a random sequence $\epsilon_1, \ldots, \epsilon_n$, it is possible to calculate the value of $y = f(\epsilon_1, \ldots, \epsilon_n)$, and assess how likely this value was by evaluating the probability $\mathbb{P}(Y \geq y)$ — that is, the probability that a single draw of the limit random variable $Y$ yields a result greater than or equal to $y$. If this probability is lower than $0.01$, the sequence $\epsilon_1, \ldots, \epsilon_n$ is said to be non-random with confidence 99%. Otherwise, the sequence is considered random.

An example: the bit frequency test

Let $(X_n)$ be an infinite sequence of independent, identically distributed random variables. Take $f$ to be $f(x_1, …, x_n) = \frac{1}{\sqrt{n}} \sum_{k=1}^n \frac{x_n – \mu}{\sigma}$ where $\mu$ is the average, and $\sigma$ the standard deviation, of any $X_k$. Define $Y_n = f(X_1, …, X_n)$. The central limit theorem states that the distribution of $Y_n$ tends to the standard normal distribution. This theorem is illustrated below.

Example of convergence in distribution to the standard normal distribution, after summing 1, 2, 3, 4, 5, 8, 10, and 50 independent random variables following the same distribution.

Consider the particular case where $X_n$ is uniformly distributed over the discrete set $\{-1,1\}$. The central limit theorem applies, and $Y_n = f(X_1, …, X_n)$ converges in distribution to the standard normal law. For practical reasons, however, we choose to consider $Y_n = |f(X_1, …, X_n)|$, which converges in distribution to a variable $Y_\infty$. $Y_\infty$ has the half-normal distribution (illustrated on the left), which has the nice property of being monotonically decreasing.

In plain colours, the half-normal distribution. In pale colours, the corresponding cumulative distribution function. If a pseudo-random sequence is picked in the red area, it is declared non-random with 99% certainty (the yellow and orange areas correspond to a certainty of 90% and 95% respectively).

Here is how the test proceeds : the sequence $(\epsilon_1, \ldots, \epsilon_n)$ given by the PRNG is converted to a sequence of elements of $\{-1, 1\}$ by changing each $\epsilon_i$ to $x_i = 2 \cdot \epsilon_i – 1$. This gives $x_1, \ldots, x_n$. Then, $|f(x_1, \ldots, x_n)| = \left|\sum_{k=1}^n \frac{x_k}{\sqrt{n}}\right|$ is numerically calculated, yielding $y$. Finally, the theoretical likeliness of a truly random sequence yielding a value equal to or greater than $y$ is evaluated using the cumulative distribution function of $Y_\infty$ (in pale colours).

If this probability is too low (red area, probability below $0.01$), the sequence is rejected as non-random, with 99% certainty. If it is greater than $0.01$, but less than $0.05$ (orange area) the sequence is sometimes ((Some papers advise to reject any sequence with a p-value $\leq 0.05$, but most choose $0.01$ as the threshold)) rejected as non-random with 95% certainty. Similarly, if it is greater than $0.05$ but less than $0.10$ (yellow area), the sequence is sometimes rejected as non-random, with 90% certainty.

High probabilities (green area, probabilities greater than $0.10$), finally, do not permit to distinguish the pseudo-random sequence from a truly random one ((Note that this generates false positives: a true random number generator will, in the long run, produce every possible sequence: sequences yielding measurements that fall in the red area will be rare, but they will occur, and will be discarded as non-random. This is very important to the PRNG-testing procedure, as explained in the next sub-section)).

Testing a pseudo-random number generator

To test whether a pseudo-random number generator is close to a true one, a sequence length is chosen, and $m$ pseudo-random sequences of that length are retreived from the PRNG, then analysed according to the previous methodology. It is expected, if the confidence level is 1%, that about 99% of the sequences pass, and 1% of the sequences fail; if the observed ratio significantly differs from 99 to 1, the PRNG is said to be non-random. More precisely, the confidence interval is generally chosen to be $p \pm 3\frac{\sqrt{p(1-p)}}{\sqrt{m}}$, where $p$ is the theoretical ratio (99% here); this takes the probability of incorrectly rejecting a good number generator down to 0.3%.

This confidence interval is obtained as follows. For large values of $m$, approximate the probability of rejecting $r$ sequences by $\binom{m}{r} p^{1-r} (1-p)^r$ (here $p = 99\%$). Write $\sigma_i$ the random variable denoting whether the $i$^th sequence was rejected. Then with the previous approximation all $\sigma_i$ are independent and take value $0$ with probability $p$, $1$ with probability $1-p$ (standard deviation: $\sqrt{p(1-p)}$). The central limit theorem states that with probability close to $\displaystyle \int_a^b \frac{e^{-x^2/2}}{\sqrt{2\pi}}$, the observed rejection frequency $\hat{p}$ lies in the interval $\left[p + b\frac{\sqrt{p(1-p)}}{\sqrt{m}}, p + a\frac{\sqrt{p(1-p)}}{\sqrt{m}}\right]$. Finally, setting $b = -a$ and adjusting $a$ so that this probability is $\approx 99.7\%$ yields $a \approx 3$.

Tests suites

A number of test suites have been proposed as standards. The state-of-the-art test suite was for a long time the DieHard test suite (designed by George Marsaglia), though it as eventually superseeded by the National Institute of Standards and Technology recommendations. Pierre l’Ecuyer and Richard Simard, from Montreal university, have recently published a new collection of tests, TestU01, gathering and implementing a impressive number of previously published tests and adding new ones. ((Ent is another implementation of a few tests.))

A subset of these tests is presented below.

A list of common statistical tests used to evaluate randomness

Equidistribution (Monobit frequency test, discussed above)

Purpose: Evaluate whether the sequence has a systematic bias towards 0 or 1 (real-time example on random.org)
Description: Verify that the arithmetic mean of the sequence approaches $0.5$ (based on the law of large numbers). Alternatively, verify that normalized partial sums $s_n = \frac{1}{\sqrt{n}} \sum_{k=1}^n \frac{2\cdot\epsilon_k – 1}{\sqrt{2}}$ approach a standard normal distribution (based on the central limit theorem).
Variants: Block frequency test (group bits in blocks of constant length and perfom the same test), Frequency test in a block (group bits in blocks of constant length and perform the test on every block).

Runs test

Purpose: Evaluate whether runs of ones or zeros are too long (or too short) (real-time example on random.org)
Description: Count the number of same-digits blocks in the sequence. This should follow a binomial distribution $\mathcal{B}(n, 1/2)$. By noting that the number of same-digits blocks is the sum of the $\sigma_i$ where $\sigma_i = 1$ if $\epsilon_i \not= \epsilon_{i+1}$ and $\sigma_i = 0$ otherwise, the previous method can be applied.
Variants: Longest run of ones (divide the sequence in blocks and measure the longest run of ones in each block).

Binary matrix rank

Purpose: Search for linear dependencies between successive blocks of the sequence
Description: Partition the sequence in blocks of length $n \times p$. Build one $n \times p$ binary matrix from each block, and compute its rank. The theoretical distribution of all ranks is however not easy to compute — this problem is discussed in details in a paper by Ian F. Blake and Chris Studholme from the university of Toronto.

Discrete Fourier transform

Purpose: Search for periodic components in the pseudo-random sequence
Description: Compute a discrete Fourier transform of the pseudo-random input $2\epsilon_1 – 1, \ldots, 2\epsilon_n – 1$, and take the modulus of each complex coefficient ((Note that, since the input sequence is real-valued, its DFT is symetrical; hence all the counting need only happen on the first half of the DFT sequence)). Compute the 95% peak height threshold value, defined to be the theoretical value that no more than 5% of the previously calculated moduli should exceed ($\sqrt{-n\log(0.05)}$ (NIST)), and count the actual number of moduli exceeding this threshold. Assess the likeliness of such a value.

Compressibility (Maurer’s universal statistical test)

Purpose: Determine whether the sequence can be compressed without loss of information (evaluate information density). A random sequence should have a high information density.
Description: Partition the sequence in blocks of length $L$, and separate these block into categories. Call the first $Q$ blocks the initialisation sequence, and the remaining $K$ blocks the test sequence. Then, give each block in the test sequence a score equal to the distance that separates it from the previous occurence of a block with the same contents, and sum the binary logarithm of all scores. Divide by the number of blocks to obtain the value of the test function, and verify that this value is close enough to the theoretically expected value. ((Since the binary logarithm is concave, this value gets smaller as the dispersion of the scores increase; informally, since the expected distance between to blocks of equal contents is approximately the number of possible different blocks ($2^L$ ) the expected value should be about $\log_2(2^L)$ — that is $L$. Computing the actual expected values yields a somewhat smaller number, which approaches $L – 0.83$ as $L \to \infty$ (Coron, Naccache) ))

Maximum distance to zero, Average flight time, Random excursions

Purpose: Verify that the sequence has some of the properties of a truly random walk
Description: Consider the pseudo-random sequence to represent a random walk starting from zero and moving up (down) by one unit every time a zero (a one) occurs in the sequence. Measure the maximum height reached by the walk, as well as the average time and the number of states visited. Also, for each possible state (integer), measure in how many cycle it appears. Evaluate the theoretical probability of obtaining these values.

The sources previously cited (in particular the NIST recommendation paper) present mathematical background about these tests, as well as lots of other tests.

Did I miss your favourite randomness test? Were you ever confronted to obvious imperfections in a pseudo-random number generator? Tell us in the comments!

Using abstract classes to simulate tagged unions (aka sum types)

Clément — Thu, 19 Jul 2012 22:41:36 +0000

Most functional languages offer support for tagged unions (also called sum types), a type of data structure capable of successively holding values of several fixed types. This article shows how to use abstract classes to emulate such behaviour in high-level object-oriented languages such as C#, Java, or VB.Net ((.Net languages have the [StructLayout(LayoutKind.Explicit)] attribute, which makes it possible to create structs which behave a lot like C++ unions. But that only works with primitive types.)).

This article features a rather long introduction to sum types in functional languages. If you already know about this, you can skip to the core of this article.

An introduction to sum types and pattern matching

A preliminary example: linked lists

Starting with a simple example, imagine you want to create a linked list structure. Most C#/Java programmers will write something like

  class MyList {
    T value;
    MyList next;

    MyList(T value, MyList next) {
      this.value = value;
      this.next = next;
    }
  }

  MyList example = new MyList(1, new MyList(2, null));

This relies on a rather ugly trick: null is used to represent the empty list. Functional languages prefer to clearly separate two cases : the empty list (usually called nil), and the concatenation of a value (called the head of the list) and another list (called the tail).

  (* 't is a type variable, just like T above *) 
  type 't MyList = Nil | Cons of 't * ('t MyList);; 
  let example = Cons(1, Cons(2, Nil))

  {- t is a type variable, just like T above -}
  data MyList t = Nil | Cons t (MyList t)
  example = Cons 1 (Cons 2 Nil)

This introduces the concept of a sum type; a type in which values belong to one of a number of sets (here, either to the singleton set {Nil}, or to the set of all values of type t — for example integers if t is int).

Here is another example, where a value of type Value can be either an integer, or a floating-point number, or a string, or a boolean.

  type Value = 
    | IntegerValue of int 
    | FloatValue of float
    | StringValue of string
    | BooleanValue of bool

  data Value = 
      IntegerValue Int
    | FloatValue Float
    | StringValue String
    | BooleanValue Bool

Quite naturally, functional languages include a special construct to handle values whose type is a sum type. This construct is called pattern matching; it is a bit like a switch statement, only the switch is on the object’s inner type (the subset of the sum type which the object belongs to). For example, to identify the contents of a variable whose type is Value, functional programmers write

  let identify = function
    | IntegerValue (val) -> "int!" 
    | FloatValue (val) -> "float!" 
    | StringValue (str) -> "string"
    | BooleanValue (b) -> "bool!"

  identify (IntegerValue val) = "int!" 
  identify (FloatValue val) = "float!" 
  identify (StringValue str) = "string"
  identify (BooleanValue b) = "bool!"

Here’s another example, on lists this time: if functional programmers want to count elements in a list, they write

  let rec count = function
    | Nil -> 0
    | Cons (head, tail) -> 1 + (count tail)
  ;;

  count Nil = 0
  count (Cons hd tl) = 1 + count tl

Let’s move on to our last example

XML node trees

In this section, we use a sum type to represent the structure of an XML document (some node types have been omitted). We first define an attribute to be a pair of two strings, and a node to be one of DocumentFragment (a list of XML nodes), Element (a name, attributes, and children elements), Commment (some comment text), or Text (plain text).

  type attr = 
    Attribute of string * string;;

  type xmlnode = 
    | DocumentFragment of xmlnode list
    | Element of string 
                  * (attr list)
                  * (xmlnode list)
    | Comment of string
    | Text of string
  ;;

  data XmlNode = 
    | DocumentFragment [XmlNode]
    | Attribute String String
    | Element String [Attribute]
                     [XmlNode]
    | Comment String
    | Text String

With such a type declaration, writing concise tree-traversal functions is easy: as an example, we define a serialize function, which generate XML code from an xmlnode object.

  let serialize_attribute (Attribute(name, value)) =
    Printf.printf " %s=\"%s\"" name value
  ;;

  let rec serialize = function 
    | DocumentFragment nodes ->
        List.iter serialize nodes (* applies serialize to every xmlnode in nodes *)
    | Element (label, attributes, nodes) ->
        Printf.printf "<%s" label;
          List.iter serialize_attribute attributes;
        Printf.printf ">\n";
          List.iter serialize nodes;
        Printf.printf "\n" label;
    | Comment (str) ->
        Printf.printf "" str;
    | Text (str) ->
        Printf.printf "%s\n" str;
  ;;

Here is a small example to illustrate the previous function:

  serialize (
    Element("a", [
      Attribute("href", "http://pit-claudel.fr/clement/blog");
      Attribute("title", "Code crumbs")
    ], [
      Text("Code crumbs -- Personal thoughts about programming")
    ])
  );;

This prints


Code crumbs -- Personal thoughts about programming

We can now move on to the core of this article:

Simulating sum types using abstract classes

The problem

Returning to a simpler example, suppose we need to represent a generic binary tree (a tree is defined as being either a branch, containing two sub-trees, or a leaf, containing a value). Functional programmers will write something like

  type 't tree = Branch of ('t tree * 't tree) | Leaf of 't
  let example = Branch (Branch (Leaf 1, Leaf 2), Leaf 3)

  data Tree t = Branch (Tree t) (Tree t) | Leaf t
  example = Branch (Branch (Leaf 1) (Leaf 2)) (Leaf 3)

On the other hand, (hurried) C#/Java programmers will often implement this as a product type — that is, they’ll pack both cases in a single class, and use an extra field to differentiate branches and leaves:

  enum TreeType = {BRANCH, LEAF};

  class BadTree {
    TreeType type;
    
    T leaf_value;
    BadTree left_branch, right_branch;

    BadTree(BadTree left_branch, BadTree right_branch) {
         this.type = BRANCH;
         this.left_branch = left_branch;
         this.right_branch = right_branch;
         // Memory waste: leaf_value is unused (!)
    }

    BadTree(T leaf_value) {
         this.type = LEAF;
         this.leaf_value = leaf_value;
         // Memory waste: left_branch and right_branch are unused (!)
    }
  }

  BadTree example = new BadTree(new BadTree(new BadTree(1), 
                                                 new BadTree(2)),
                                     new Tree(3));

This representation wastes a lot of memory and, though rather concise, quickly degenerates when more cases are added to the type definition (for example in the xmlnode case — for another real-world example, see the end of this post).

The solution

The idea is to encode the type disjunction (Branch or Leaf, DocumentFragment or Element or Comment or Text) in the inheritance hierarchy; practically speaking, the trick is to define an abstract Tree class, and make two new classes, Leaf and Branch, which both inherit from Tree:

  abstract class Tree {
    public static class Leaf : Tree {
      public T value;

      public Leaf(T value) {
        this.value = value;
      }
    }

    public static class Branch : Tree {
      public Tree left, right;

      public Branch(Tree left, Tree right) {
        this.left = left;
        this.right = right;
      }
    }
  }

  Tree example = new Tree::Branch(new Tree::Branch(new Tree::Leaf(1),
                                                        new Tree::Leaf(2)),
                                       new Tree::Leaf(3));

This allows for much cleaner code ; and the memory waste is gone!

Pattern matching on abstract classes

The BadTree class, which uses an enum to distinguish branches and nodes, makes it possible to performe pattern matchings over the Tree class, by switching on the Tree.type field. Though this approach directly translates to the Tree abstract class using the is keyword ((Java equivalent: instanceof)), it’s not ideal.

  Tree tree = (...);

  switch (tree.type) {
    case TreeType.BRANCH:
      (...); break;
    case TreeType.LEAF:
      (...); break;
  }

  Tree tree = (...);

  if (tree is Tree::Branch) {
    Tree::Branch branch = 
      (Tree::Branch) btree;
    (...);
  } else if (tree is Tree::Leaf) {
    Tree::Leaf leaf = 
      (Tree::Leaf) tree;
    (...);
  }

Though functional, this new version of the pattern-matching code is not really pretty. A much cleaner approach is to implement the pattern matching directly in the derived classes (Branch and Leaf), by declaring an abstract method in the parent class, Tree. That is, instead of having one big function doing some explicit pattern matching in the Tree class, we’ll have multiple specialized functions — one in each class inheriting Tree. Philosophically speaking, just like we implemented type disjunctions as different classes inheriting a common parent, we’ll implement logical disjunctions as different functions overriding a common parent.

For example, here is how we can write a Fork function, which returns a new Tree with every leaf split in two identical leaves:

  abstract class Tree {
    public abstract Tree Fork();

    public static class Leaf : Tree {
      public T value;

      // Constructor omitted

      public override Tree Fork() {
        return new Tree::Branch(new Tree::Leaf(value), new Tree::Leaf(value));
      }
    }

    public static class Branch : Tree {
      public Tree left, right;

      // Constructor omitted

      public override Tree Fork() {
        return new Tree::Branch(left.Fork(), right.Fork());
      }
    }
  }

Going further: a real-life example

The benefits of using this abstract class approach are particularly visible when the data types involved get more complex. Below is an example implementation of a data structure used to describe arithmetic expressions.

  abstract class ArithmeticExpression {
    // Evaluates an arithmetic expression. context is a dictionary mapping variable
    //  names to their values.
    public abstract float? Eval(Dictionary context);

    public class Variable : ArithmeticExpression {
      string label;

      public override float? Eval(Dictionary context) {
        float? value;
        context.TryGetValue(label, out value);
        return value;
      }
    }
    
    public abstract class BinaryOperation : ArithmeticExpression {
      protected ArithmeticExpression left_operand, right_operand;

      public BinaryOperation(ArithmeticExpression left_operand, ArithmeticExpression right_operand) {
        this.left_operand = left_operand;
        this.right_operand = right_operand;
      }

      protected abstract float? Apply(float left_value, float right_value);

      public override float? Eval(Dictionary context) {
        float? left_result = left_operand.Eval(context), right_result = right_operand.Eval(context);

        if (left_result.HasValue && right_result.HasValue)
          return Apply(left_result.Value, right_result.Value);
        else
          return null;
      }

      class Sum : BinaryOperation {
        public Sum(ArithmeticExpression left_operand, ArithmeticExpression right_operand) : base(left_operand, right_operand) { }

        protected override float? Apply(float left_value, float right_value) {
          return left_value + right_value;
        }
      }

      class Subtraction : BinaryOperation {
        public Subtraction(ArithmeticExpression left_operand, ArithmeticExpression right_operand) : base(left_operand, right_operand) { }

        protected override float? Apply(float left_value, float right_value) {
          return left_value - right_value;
        }
      }

      class Product : BinaryOperation {
        public Product(ArithmeticExpression left_operand, ArithmeticExpression right_operand) : base(left_operand, right_operand) { }

        protected override float? Apply(float left_value, float right_value) {
          return left_value * right_value;
        }
      }

      class Division : BinaryOperation {
        public Division(ArithmeticExpression left_operand, ArithmeticExpression right_operand) : base(left_operand, right_operand) { }

        protected override float? Apply(float left_value, float right_value) {
          return left_value / right_value;
        }
      }
    }
  }

Code crumbs, by Clément Pit-Claudel

The cereal box problem: How many boxes does it take to find all prizes?

Numerical and experimental simulation

Calculations and simulation code

Theoretical average

Numerical simulations

Conclusion

An experimental estimation of the entropy of English, in 50 lines of Python code

Introduction and theoretical background

Definitions

Experimental estimation of the entropy rate of English

Experimental results

Generating fake English text

Results: a collection of random Shakespeare samples

\(n=2\): second-order approximation

\(n=3\): third-order approximation

\(n=5\): fifth-order approximation

\(n=8\): eigth-order approximation

Words as tokens

Third-order word approximation

Conclusion

Linear time probabilistic pattern matching and the Rabin-Karp algorithm

Algorithm

Checksum functions

Polynomial hash functions

Polynomial identity testing

Rolling checksum update

Python implementation

Implementation notes

Incorrect matches: theoretical bounds and experimental results.

Streaming

Streaming implementation notes

Performance

Multi-pattern search

Conclusion & further reading

Six months of Windows Phone Development — Tips, tricks, and performance considerations

Introduction

App development tips: DOs and DON’Ts

DO keep the size of your XAP package (reasonably) small

DO use the Silverlight Toolkit for Windows Phone

The PerformanceProgressBar

Page transitions (toolkit:TransitionService.NavigationInTransition)

DON’T show a minimized application bar on a page with TextBoxes

DON’T use the GestureService from the Silverlight toolkit

Cool development tips and code snippets

Override system-default colors and brushes if your app uses a custom background

Let users select and copy text

Share a background on all your pages

Performance tips

Load your data files faster

Load your pages faster

Don’t use the PivotControl for what it’s not made for

Get better I/O performance

Ditch String.Split

Set images as Content, not resource

Minify your XML files.

Use defaults in your XML serialization attributes

Improve perceived performance by showing a message with actual content while the app loads

Conclusion

Modeling and measuring string comparison performance in C, C++, C# and Python.

Theoretical amortized cost

Infinite strings

Finite strings

Conclusion

Experimental measurements

Remarks

Conclusion

Generating uniformly random data from skewed input: biased coins, loaded dice, skew correction, and the Von Neumann extractor

Von Neumann skew-correction algorithm

Biased coins

Removing the bias from a stream of random digits

Probabilistic remarks

Expected rejection rate

Expected number of tosses

A generalization to loaded dice

Implementation

Conclusion and further remarks

The cafeteria paradox: stop using the water dispenser while someone else does!

An intuitive presentation

A mathematically accurate description

The `PerformanceProgressBar`

Page transitions (`toolkit:TransitionService.NavigationInTransition`)

DON’T use the `GestureService` from the Silverlight toolkit

Don’t use the `PivotControl` for what it’s not made for