aj's junkcode

dir2tree

Anonymous — Wed, 17 Mar 2010 03:44:39 +0000

dir2tree

A simple script to convert a list of paths (like the output of find, say) into a more tree oriented list, so it's a little easier to see the underlying structure.

#!/usr/bin/env python

import sys

The way this works is to keep track of the last full path displayed as a list of components with slashes at the ends of directory components (eg ["usr/", "bin/", "vi"]):

path = []

We then read every path on stdin:

for file in sys.stdin.readlines():
    file = file.rstrip()
    filepaths = file.split("/")
    filepaths = [ f+"/" for f in filepaths[:-1]] + [filepaths[-1]]

...and construct the corresponding list for that path.

We then match the path we just generated with the path so far, and see where they start to differ:

    skip = 0
    for i,p in enumerate(path):
        if p != filepaths[i]:
            skip = i
            break

...cut off the now irrelevant parts:

    path = path[:skip]

...and print out (and append) each of the additional path components, indented appropriately.

    for x in filepaths[skip:]:
        print "\t"*len(path) + x
        path.append(x)

hackoff10-redpill

Anonymous — Thu, 04 Mar 2010 05:59:47 +0000

Hackoff 2010 Problem 5

Problem is to extract the letters highlighted in red from an animated gif. We assume the frames of the gif have already been pulled apart into pnm files.

#!/usr/bin/env python

import pygame, sys, time, numpy

Our character grid is 10 by 12 with a one pixel border that contains the highlight, so we note that down for reference and create a surface that we'll copy our character into to play with later:

# x 10 +12x
# y  8 +14y

target = pygame.Surface((10,12))
targetrect = target.get_rect()

Once we've got the character we'll want to "OCR" it and remember it for later; we'll use two arrays for that:

found = []
redbits = []

The "found" array just keeps track of the different bitmaps for each character we've seen (so there'll be one bitmap for every "A" and one for every "4" etc), and the "redbits" tells us which bitmap was found. Later we'll associate a bitmap from found with an actual character, and can then convert redbits into a string which will contain the answer we need.

So to handle each frame we first load it, then iterate over every potential character, checking for a red border. If we found one we copy the character onto its own surface, convert that into a simple python list representing the bitmap, and then either find it in found and note the index in redbits, or add it to found, and note the index in redbits.

def handle_frame(name):
    frame = pygame.image.load(name)
    framerect = frame.get_rect()

    for x in range(10, framerect.width, 12):
        for y in range(8,framerect.height, 14):
            col = frame.get_at((x,y))
            if col == (255,0,0,255):
                target.blit(frame, targetrect, pygame.Rect(x+1,y+1,10,12))
                z = pygame.surfarray.array2d(target).tolist()
                for idx in range(len(found)):
                    if z == found[idx]:
                        redbits.append(idx)
                        break
                    else:
                        redbits.append(len(found))
                        found.append(z)

Our initial job is then to load each frame:

for i in range(1,1349):
    handle_frame("05-red-pill-blue-pill.gif.%04d.gif.pnm" % i)

And once we've done that, we need to OCR the characters. It turns out there's only 16 different ones, so there's no problem passing that off to the user:

translate = []
for f in found:
    print numpy.array(f)
    translate.append(sys.stdin.readline().strip())

Well, there is a minor problem in that numpy.array ends up getting the bitmap printed out rotated and mirror-imaged or something, but that's pretty easy to deal with.

As is getting the actual answer at that point:

print "".join([translate[i] for i in redbits])

Et voila.

py-dir

Anonymous — Thu, 05 Nov 2009 14:13:09 +0000

Reimplementing dir()

Python 2.6 introduces a special dir method for objects, which is used to override the default behaviour of dir(), which provides a list of useful attributes of an object.

Unfortunately dir() has a few special cases that probably should be handled if it's being overridden.

Here's a translation from the C source of Python's dir() builtin:

def default_dir(obj):
    d = {}

    def merge_class_dict(cls):
        if hasattr(cls, "__dict__"):
            d.update(cls.__dict__)
        for base in getattr(cls, "__bases__", []):
            merge_class_dict(base)
        return d

    if type(obj) == types.ModuleType:
        d.update(obj.__dict__)
    elif type(obj) == type or type(obj) == types.ClassType:
        merge_class_dict(obj)
    else:
        d.update( getattr(obj, "__dict__", {}) )
        d.update( (m, None) for m in getattr(obj, "__members__", []) )
        d.update( (m, None) for m in getattr(obj, "__methods__", []) )
        if hasattr(obj, "__class__"):
            merge_class_dict(obj.__class__)

    d = d.keys()
    d.sort()
    return d

This would be slightly simpler if object implemented dir, so inheritance could be used:

class DirableObject():
    def __dir__(self):
        return list(set(self.__dict__.keys() + dir(self.__class__)))

class DirableType(DirableObject):
    def __dir__(self):
        d = set(self.__dict__.keys())
        for base in self.__bases__:
            d |= dir(base)
        return list(d)

class DirableModule(DirableObject):
    def __dir__(self):
        return self.__dict__.keys()

And, of course, that in turn is completely usable right now! Huzzah!

key-deck

Anonymous — Thu, 17 Sep 2009 10:33:50 +0000

Key Deck

More of a thought experiment than anything else. Inspired by both the fact that netbooks and cloud storage make it somewhat plausible to replace your computing platform cheaply, and the prospect of having your computer pilfered, inspected or broken by immigration/security officials when traveling seems to be getting higher.

The solution? Keep your data online and encrypted, then when you get to your destination, buy a cheap netbook, and download all your stuff fresh. The only problem is transporting your secret key. Having it on an SD card in a camera might work okay, and you could even hide it steganographically in an image if you wanted. But if you don't want to carry anything technological with you, a standard deck of cards can store 225 bits in its ordering -- and if you've got an extra six cards in your deck (eg, a Knight card in each suit and two Jokers) you can handle 256 bits.

There's two possible ways to handle the conversion: you can either use a subset of the cards to specify your value (representing 0 by no cards, 1-52 as just 1 card, etc) or always use the entire deck (0 is the deck in its standard order, 1 is the first two cards swapped, etc). The former gives you the ability to represent a lot more numbers, but is less than an extra bit or so in reality, which makes it not terribly useful.

So here's the code:

#!/usr/bin/env python

# Copyright (c) 2009 Anthony Towns
# GPLv2

Deck setup

A standard 52 card deck, where "AS" is "ace of spades", "TH" is the "ten of hearts", etc:

deck52 = [b+a for a in "CDHS" for b in "A23456789TJQK"]

A 58 card deck, where "ND" is "knight of diamonds", and Jokers are represented by "XX" and "YY":

deck58 = [b+a for a in "CDHS" for b in "A23456789TJNQK"]+["XX","YY"]

You can, of course, use non-standard orders for these decks, for extra obscurity. The secrecy is intended to be encoded in the actual pack of cards, however.

Converting a number to an ordering

This is almost a base conversion function, using the cards as digits -- except that each time we use a card, our base reduces by one. Seen that way the code is fairly straightforward:

def num_to_cards(n,deck):
    assert n >= 0

    deck, base = deck[:], len(deck)
    res = []

    while deck != []:
        n, digit, base = n // base, n % base, base - 1
        res.append(deck.pop(digit))

    assert n == 0
    return res

Converting from an ordering to a number

We're essentially doing the same thing in reverse here. Note that we start with the least significant digit, though.

def cards_to_num(seq,deck):
    n, mul, base = 0, 1, len(deck)
    deck = deck[:]

    for s in seq:
        r = deck.index(s)
        deck.pop(r)
        n += r * mul
        mul *= l
        l -= 1

    return n

And there you have it. An example of it in practice looks like:

if __name__ == "__main__":
    o = num_to_cards(2**128-1, deck52)
    print ", ".join(o)
    # 9D, 5C, AC, TC, 6D, 8D, QD, 5S, 3D, TD, AH, 3C, 4S, 4C, 7D, KS,
    # JH, 3H, KH, TH, 6H, 6C, 3S, 2S, 7C, 2C, 8C, 9C, JC, QC, KC, AS,
    # 6S, 7S, 8S, 9S, TS, JS, QS, 2H, 4H, 5H, 7H, 8H, 9H, QH, AD, 2D,
    # 4D, 5D, JD, KD

    n = cards_to_num(o, deck52)
    print "n = %d = 2**128-%d" % (n, 2**128-n)
    # n = 340282366920938463463374607431768211455 = 2**128-1

Also note that the largest number representable by a deck is represented by reversing the deck, so:

    big52 = cards_to_num(reversed(deck52), deck52)
    print "big52 == %.2f%% of 2**226" % (big52 * 100.0 / 2**226)
    # big52 == 74.79% of 2**226

    big58 = cards_to_num(reversed(deck58), deck58)
    print "big58 == %.2f%% of 2**261" % (big58 * 100.0 / 2**261)
    # big58 == 63.44% of 2**261

multiple_files

Anonymous — Mon, 07 Sep 2009 07:51:32 +0000

multiple files per page

So I think the solution here is something like this...

Add a markup indicator to associate code blocks within a page with a filename (this also allows the page "Foo" to generate "foo.c" without running afoul of sputnik's behaviour for ".")
Add a sputnik action that takes gives the code for file X from page Y (eg, Foo.code?file=foo.c)
Change the markup decoding to add a panel on the right-hand side listing the files within the page, and linking to the code action for each
If there're "foo.ext" files in git, combine them into a faux "foo.mdwn" page.

py-struct

Anonymous — Tue, 01 Sep 2009 18:43:10 +0000

struct for Python

C structs are great -- they let you allocate structure a little chunk of memory so that an integer goes here, a float goes there, give names to each component, and efficiently use it.

There's a bunch of ways to simulate that in Python, such as using a tuple (efficient, but loses the names), or a dictionary (mystruct["member"] is awkward, and not very efficient) or the dict member of an object (nicer referencing, but still inefficient and somewhat annoying to construct).

The slots features provides a nice answer -- it lets you have a class-like interface, with tuple-like efficiency; and also gives you errors if you try adding new attributes at runtime, which is a sign you want more than a struct afterall.

(An alternative way of looking at it is this provides mutable named tuples)

Here's how you do it. First, a base class, with methods to initialise an object and compare it in a tuple-like manner:

class Struct(object):
    __slots__ = ()

    def __init__(self, *args, **kwargs):
        d = dict(zip(self.__slots__, args))
        d.update(kwargs)
        for x in self.__slots__:
            setattr(self, x, d.get(x,None))

    def __cmp__(self, other):
        if not hasattr(other,"__slots__"):
            return -1 # BUG -- should do something more intelligent :(
        if self.__slots__ != other.__slots__:
            return cmp(self.__slots__, other.__slots__)
        for s in self.__slots__:
            c = cmp(getattr(self,s), getattr(other,s))
            if c != 0: return c
        return 0

Usage is then to declare what fields the structure has:

class Xyzzy(Struct):
    __slots__ = ("foo", "bar", "baz")

and use it:

>>> x1 = Xyzzy()
>>> x2 = Xyzzy(foo=5, bar=6, baz=7)
>>> x3 = Xyzzy(5, 6, 7)
>>> x4 = Xyzzy(5, baz=7)
>>> [a.foo for a in [x1,x2,x3,x4]]
[None, 5, 5, 5]
>>> [a.bar for a in [x1,x2,x3,x4]]
[None, 6, 6, None]
>>> [a.baz for a in [x1,x2,x3,x4]]
[None, 7, 7, 7]

>>> Xyzzy(1,2,3)==Xyzzy(1,2,3)    # same value?
True
>>> Xyzzy(1,2,3) is Xyzzy(1,2,3)  # same object?
False
>>> cmp(Xyzzy(1,2,3),Xyzzy(1,3,3))
-1
>>> cmp(Xyzzy(1,2,3),Xyzzy(1,1,3))
1

Neato.

redcat

Anonymous — Mon, 24 Aug 2009 19:52:44 +0000

redcat

Introduction

Since about four years ago Debian has distributed patches to its indexes of packages available for download -- so that rather than having to download the full lists (currently about 30MB uncompressed or 6MB compressed) every time you want to get a package that's only just become available, you can just get the changes to the list (which after compression is only about 10kB every six hours).

See the link above for background, but the basic summary is that every time the archive is updated, a compressed "ed-style" diff is generated that lists each line that was changed in the original Packages or Sources index -- since those are text files, with stanzas sorted by package name, that works out very well. An additional index of the ed-style diffs is also constructed so apt can determine what it needs to download and apply to update its old indexes, and that's pretty much it.

Unfortunately, this turns out to be a bit slow for many people -- as attested to by apt bugs #383881 and #372712, probably among others. The problem is two fold: first, editing a 30MB file isn't always trivial -- you need to read 30MB off the disk, write 30MB to the disk, and if you want to avoid doing that multiple times, you need enough memory to cache 60MB (mmaps of both the original file and the modified file); and if you've got multiple entries in your sources.list, you need to multiply that out for each of them. That might not be a major problem if you're on a modern system; but it often becomes a problem when multiple pdiffs are required to bring your Packages file up to date. In this case, apt simply applies each pdiff in turn (pausing to download and decompress the next pdiff in turn).

It's possible, at least in theory, to do substantially better than this: downloading all the pdiff files necessary, manipulating them as necessary, and then applying them in a single pass over the original index should reduce the delays substantially. Unfortunately this isn't so easy in practice, as the manipulation necessary to apply a series of diffs in one pass isn't at all obvious.

This program aims to remedy that, providing a straightforward implementation of the necessary manipulations. It takes as input an ordered series of ed diffs, combines them internally, then either (a) outputs a single ed diff that has the same result as applying them all in sequence, or (b) applies the patches against stdin.

It's python based, so here's the boilerplate:

#!/usr/bin/env python

# Copyright (c) 2009 Anthony Towns
#
# This program is free software; you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation; either version 2 of the License, or
# (at your option) any later version.

Dealing with diffs

We're going to start with an basic class representing a set of modifications that apply to a text file. Only two sorts of modifications are permitted: inserting some lines, and deleting some lines. This matches what is handled by diff, and what is needed for this problem, though obviously it's nowhere near the only modifications that could potentially be made.

In order to manage the internal data structure a little more easily, we'll use the following "Struct" class which allows us to define classes with just a few fixed members and easily create and use them.

class Struct(object):
    __slots__ = ()
    def __init__(self, *args, **kwargs):
        d = dict(zip(self.__slots__, args))
        d.update(kwargs)
        for x in self.__slots__:
            setattr(self, x, d.get(x,None))

The data structure we use is important. It's essentially a single ordered array of changes in the form of an offset (how many lines after the end of the last change does this change begin), how many lines to delete (possibly zero), and what lines to add (possibly none).

class Change(Struct):
    __slots__ = ("offset", "del_cnt", "add_cnt", "add_lines")

In addition, since changes will generally be nearby, we'll keep a cache of the current absolute position in the file -- that is, the current change we're looking at, and the line number where the previous change was finished. It would be fantastic if we could cache this for every modification, but that would mean we'd have to update every line number after any insertion/deletion we make, so it's better to just use offsets.

class LineNrCache(Struct)
    __slots__ = ("change", "line")

class Modifications(object):
    def __init__(self):
        self.changes = []  # list of Change objects
        self.line_nr_cache = LineNrCache(change=0, line=0)

The main operation on this data structure will be inserts, like so:

    def insert(self, i, offset, del_cnt, add_lines):
        self.changes.insert(i,
            Change(offset, del_cnt, len(add_lines), add_lines) )

But before we can do anything useful with this, we need to be a bit cleverer as to how we manipulate it. The most important aspect is that we want to be able to easily make modifications at a particular line number -- so we'll create a function to prepare our data structure for precisely that. modfor() will take a line number, and return an index into the array that's calibrated so that deletions and insertions happen at the right line number, based on all the current offsets (and it will both start from linenr_cache and update it to be a little clever about what's going on).

Implementing the function is then a matter of incrementing/decrementing until either we discover we've never looked at a line number this high before, and have to extend our array, or we find a pre-existing modification that covers our line number and just have to deal with the three possible states we could have: we'd skipped past this line in the offset, this line is in the middle of text we'd added previously, or this is exactly where we started making a change previously.

    def mod_for(self, line):
        assert line >= 0

        i = self.line_nr_cache.change
        pos = self.line_nr_cache.line

        changes = self.changes # abbreviation

        assert (i == len(changes) == 0) or (0 <= i and i < len(changes))

        while True:
            assert 0 <= i and i <= len(changes)
            if i == len(changes):
                self.insert(i, line-pos, 0, [])
                break

            if pos + changes[i].offset + changes[i].add_cnt < line:
                pos += changes[i].offset + change[i].add_cnt
                i += 1
                continue
            elif line < pos:
                assert i > 0
                i -= 1
                pos -= changes[i].offset + changes[i].add_cnt
                continue

            assert i == 0 or pos <= line
            assert line <= pos + changes[i].offset + changes[i].add_cnt
            if line < pos + changes[i].offset:
                changes[i].offset -= (line-pos)
                self.insert(i, line-pos, 0, [])
                break
            elif pos + changes[i].offset < line:
                split = line - pos - changes[i].offset
                self.insert(i+1, 0, 0, changes[i].add_lines[split:])
                changes[i].add_cnt = split 
                changes[i].add_lines = changes[i].add_lines[:split]
                i += 1
                break
            else: 
                assert line == pos + changes[i].offset
                break

        assert 0 <= i and i < len(changes)
        self.line_nr_cache.change = i
        self.line_nr_cache.line = line - changes[i].offset
        return i

Of course, endlessly splitting up our data structure is going to get annoying, but there's a simple observation we can make: if a modification's offset is zero, we can easily merge it with the previous modification. So when we manipulate the array, whether by calling mod_for or more manually, we'll make sure to check for this condition, and tidy up after ourselves.

    def merge_next(self, i):
        assert i+1 < len(self.changes) and self.changes[i+1].offset == 0

        c_i = self.line_nr_cache.change
        c_p = self.line_nr_cache.line
        if c_i == i+1:
            c_i -= 1
            c_p -= self.change[i].offset + self.change[i].add_cnt
            self.line_nr_cache.change = c_i
            self.line_nr_chace.line = c_p

        self.changes.pop(i+1)

So with all that established, we can now insert lines!

    def insert_lines(self, line, strs):
        mod = self.mod_for(line)

        self.changes[mod].add_cnt += len(strs)
        self.changes[mod].add_lines = strs + self.add_lines[mod]

        if mod > 0 and self.changes[mod].offset == 0:
            self.merge_next(mod-1)
            assert mod == len(self.changes) or self.changes[mod].offset > 0

        assert mod-1 <= 0 or self.changes[mod-1].offset > 0

The other half of our job is, of course, to remove lines. This is a little bit more complicated, because we might find ourselves (a) removing original lines, (b) removing added lines, (c) overlapping previously removed lines, (d) crossing modification boundaries.

So we do this piecemeal. If our current block has added lines, these are the first to go. Having done that, if we want to remove more lines, we can, but only up to the offset to the next block. If we reduce that to zero, we need to merge the current block and the next block, at which point we can start again. (Of course, if there isn't a next block, we can remove as many lines as we like without worrying) Anyway, iterating through that eventually gets enough lines deleted, and we can just merge with the previous block if necessary, and declare ourselves successfully completed.

    def remove_lines(self, line, cnt):
        mod = self.mod_for(line)

        changes = self.changes # abbreviation

        while cnt > 0:
            k = min(changes[mod].add_cnt, cnt)
            changes[mod].add_lines = self.add_lines[mod][k:]
            changes[mod].add_cnt -= k
            cnt -= k

            if mod+1 < len(changes):
                k = min(changes[mod+1].offset, cnt)
                changes[mod].del_cnt += k
                changes[mod+1].offset -= k
                cnt -= k
                if changes[mod+1].offset == 0:
                    self.merge_next(mod)
            else:
                changes[mod].del_cnt += cnt
                cnt = 0

        if mod > 0 and changes[mod].offset == 0:
            self.merge_next(mod-1)
            assert mod-1 == 0 or changes[mod-1].offset > 0

Having gotten ourselves a list of modifications, applying them to a file is fairly easy: we copy some lines from input to output, skip some lines from the input, and then write the added lines to the output, and repeat.

    def apply_against_file(self, file, out):
        for change in self.changes):
            for _ in xrange(change.offset):
                out.write(file.readline())
            for _ in xrange(change.del_cnt):
                file.readline()
            assert change.add_cnt == len(change.add_lines)
            for line in change.add_lines:
                out.write(line + "\n")
        while True:
            line = file.readline()
            if line == "":
                break
            out.write(line)

Ed-style diffs

So far we've been pretty generic: we've defined a class with an insertlines function and a removelines function, but not any sort of file format that represents the lines to be inserted or removed. As per our motivation, we want to deal with ed-style diffs. First comes parsing, which we'll use a generator for. We'll only accept commands like "0a" (append after the 0th line), "1,2c" (change lines from 1 to 2 to the text that follows) and "1,2d" (delete lines from 1 to 2 inclusive). Note that with this syntax it's not possible to express "add a line containing a single dot (.)". You would generally use the "s/.//" command after appending a line containing two dots for that operation, however lines with a single dot aren't valid in the Packages or Sources files we're dealing with, so we simply won't address this issue.

def red_components(file):
    cmdwanted = True
    lines = []
    for line in file:
        line = line.rstrip("\n")
        if cmdwanted:
            if line == "":
                continue

            line, cmd = line[:-1], line[-1]
            if "," in line:
                vals = line.split(",", 1)
            else:
                vals = line, line
            vals = [int(x) for x in vals]

            offset = vals[0]-1
            del_cnt = vals[1]-vals[0]+1

            if cmd == "d":
                yield (offset, del_cnt, [])
            elif cmd == "c":
                cmdwanted = False
            elif cmd == "a":
                offset += 1
                del_cnt = 0
                cmdwanted = False
            else:
                raise Exception("Could not parse ed command: \"%s%s\"" % 
                    (line,cmd))
        else:
            if line == ".":
                yield (offset, del_cnt, lines)
                lines = []
                cmdwanted = True
            else:
                lines.append(line)
    if not cmdwanted:
        raise Exception("ed input hit eof within text block")
    return

With that little bit of parsing out of the way, we can extend our previous class to deal with Ed Diffs explicitly:

class EdModifications(Modifications):
    def read_diff(self, file):
        for offset, del_cnt, add_lines in red_components(file):
            if del_cnt > 0:
                self.remove_lines(offset, del_cnt)
            if add_lines != []:
                self.insert_lines(offset, add_lines)
        return

In addition, we'd like to be able to output a cumulative diff. There are any number of valid ways to do this as far as ed is concerned, but the way diff does it, and the only way apt is willing to accept, is to have each the changes listed in reverse order -- from the ones affecting the end of the file, to the start. The advantage this offers is that the line numbers specified in the ed file match the line numbers in the original file, which is kinda nifty.

Either way makes very little difference to us though. We can work out the old line numbers by simply counting the offsets (which was the text in the original file that makes it through unchanged) and the deleted text (which was also in the original file), and not counting the added text (which obviously wasn't in the original file). Iterating that way to the last modification lets us work out the line numbers we need, and then it's just a matter of iterating back to the start, and outputting in ed format.

(Note again that we don't take care to avoid lines with single dots, relying on them simply not appearing in the first place)

    def write_diff(self, out):
        line = 0
        for change in self.changes:
            line += change.offset + change.del_cnt

        for change in reversed(self.changes):
            line -= change.del_cnt

            if change.del_cnt <= 1:
                lines = "%d" % (line+1)
            else:
                lines = "%d,%d" % (line+1, line+change.del_cnt)
            if change.add_cnt == 0:
                if change.del_cnt == 0:
                    continue
                else:
                    out.write("%sd\n" % (lines))
            else:
                if change.del_cnt == 0:
                    out.write("%da\n" % (line))   # NB: not line+1
                else:
                    out.write("%sc\n" % (lines))
                for added_line in change.add_lines:
                    out.write("%s\n" % (added_line))
                out.write(".\n")
            line -= change.offset

main()

And that was our module. To demonstrate it we supply a simple main() function that accepts a series of ed diffs on the command line, and either outputs a cumulative diff, or acts as a filter on stdin applying all the patches in bulk.

def main():
    import sys, gzip

    if len(sys.argv) == 1 or (len(sys.argv) == 2 and sys.argv[1] == "-f"):
        print "Usage: %s [-f] ..."
        print """
One of more diff files must be specified. Gzip compression is assumed if
.gz extension is used.

If -f is specified, patch will be applied to stdin, with result on stdout.
Otherwise, combined diff is produced on stdout."""
        sys.exit(0)

    if sys.argv[1] == "-f":
        just_dump_ed = False
        files = sys.argv[2:]
    else:
        just_dump_ed = True
        files = sys.argv[1:]

    diffs = EdModifications()
    for filename in files:
        if filename.endswith(".gz"):
            file = gzip.GzipFile(filename, "r")
        else:
            file = open(filename, "r")
        diffs.read_diff(file)

    if just_dump_ed:
        diffs.write_diff(sys.stdout)
    else:
        diffs.apply_against_file(sys.stdin, sys.stdout)

if __name__ == "__main__":
    main()

distrocmp

Anonymous — Wed, 19 Aug 2009 02:53:12 +0000

Distribution comparisons

With Debian and Ubuntu so closely related, it's often interesting to see just how close the results are in technical terms. One of the easiest ways to do that is just by a raw package-by-package comparison. I've done that a few times now, including when warty released and when hoary released and more recently in reviewing Debian and Ubuntu security support.

Comparing a release of Debian with a release of Ubuntu in this manner isn't too hard: it's a matter of getting the Sources (or Packages) files for the releases, possibly merging main and universe for Ubuntu, or skipping the universe packages in Debian, and then looking at the version strings for the corresponding packages. It's usually interesting to check three different components of the version strings -- the upstream component gives you some idea how up to date the package is, the maintainer revision hints at how many patches have been applied, and if there's an "ubuntuNN" at the end you know you've got some Ubuntu patches in the package.

So here we go. Boilerplate to start:

#!/usr/bin/env python

# Copyright (C) 2009  Anthony Towns 
#
# This program is free software; you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation; either version 2 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
# GNU General Public License for more details.

For parsing purposes we need a few external libraries -- regexps to pull version strings apart, compression libraries to get at the actual package info, and aptpkg to do accurate comparisons of Debian version strings. Unfortunately aptpkg needs to be initialised before it'll do anything useful, so we'll get that out of the way here too.

import re, bz2, gzip, apt_pkg
apt_pkg.init()

So the first bit of work we'll do is to parse the Packages/Sources files. We'll make a class that does that for us, and only keep the things we're really interested in -- package name, version, and section (so we can pull out main-v-universe info). We'll also allow ourselves to specify a couple of files at once, so that merging of main+universe in Ubuntu (or even main+contrib+non-free in Debian) is trivial.

class Packages(object):
    def __init__(self, filenames):
        if type(filenames) != list:
            filenames=[filenames]

        self.packages={}
        self.section={}
        for f in filenames:
            self.read(f)

    def read(self, filename):
        if filename.endswith(".gz"):
            f = zlib.GzipFile(filename)
        elif filename.endswith(".bz2"):
            f = bz2.BZ2File(filename)
        else:
            f = open(filename)

        p,v,s=None,None,None
        for l in f.xreadlines():
            if l.startswith("Package: "):
                p=l[9:].strip()
            elif l.startswith("Version: "):
                v=l[9:].strip()
            elif l.startswith("Section: "):
                s=l[9:].strip()
            elif l == "\n":
                if p is not None and v is not None: self.packages[p]=v
                if p is not None and s is not None: self.section[p]=s
                p,v=None,None

So having gotten that far the next interesting thing to do is the actual comparisons. For probably no good reason we combine the packages objects into a hash -- the "d" entry is the Debian instance, the "u" entry is the Ubuntu one. The regular files are called "norm" (as in normal), the security ones "sec".

We handle the basic comparison by iterating through all packages and assigning a two letter code to each package; "==" if they're the same, otherwise the first letter is "U" or "D" representing whoever's better, and the second letter is "*" if it's not in the other distro, or "u" if it's a newer upstream, "m" if it's a new Debian revision, or "p" if it's just an ubuntuN patch.

def compare_debian_ubuntu(norm):
    d=norm["d"].packages
    u=norm["u"].packages

    pkgs=list( set(d.keys()) | set(u.keys()) )
    pkgs.sort()

    re_deb=re.compile("^(.*?)(-([^-]*))?$")
    re_ubu=re.compile("^(.*?)(-([^-]*?))?(ubuntu[0-9]+)?$")

    res={}

    for p in pkgs:
        if p not in d:
            res[p]="U*"
            continue
        elif p not in u:
            res[p]="D*"
            continue

        if d[p]==u[p]:
            res[p]="=="
            continue

        (d_up,_,d_de) = re_deb.match(d[p]).groups()
        (u_up,_,u_de,u_ub) = re_ubu.match(u[p]).groups()

        def vercmp(x,y,char):
            x=apt_pkg.VersionCompare(x,y)
            if x<0:
                return "U"+char
            elif x>0:
                return "D"+char
            else:
                return "="+char

        if d_up != u_up:
            res[p]=vercmp(d_up,u_up,"u")
            continue
        elif d_de != u_de:
            res[p]=vercmp("%s-%s"%(d_up,d_de),"%s-%s"%(u_up,u_de),"m")
            continue
        elif u_ub is not None:
            res[p]="Up"
            continue
        else:
            res[p]="??"
            continue
    return res

We also supply a function to summarise those results, and expand the meaning of the two letter codes.

def print_cnts(res):
    cnts={}
    str=""
    for p in res.keys():
        x=res[p]
        cnts[x]=cnts.get(x,0)+1

    codes=["==","D*","Du","Dm","U*","Uu","Um","Up"]
    what=["same version", "only in Debian", 
            "Debian has newer upstream","Debian has newer patches",
            "only in Ubuntu", "Ubuntu has newer upstream", 
            "Ubuntu has newer Debian patches",
            "Ubuntu has ubuntuX patches"]
    total=sum(cnts.values())
    for c,w in zip(codes, what):
        if cnts[c] == 0: continue
        str += "%7d (%2.0f%%) %s\n" % (cnts[c],cnts[c]*100.0/total,w)
    return str

Comparing security responses builds on the earlier comparison -- with the added conundrum that it's occassionally possible that packages are introduced in security updates, even though they weren't in the original release. But otherwise it's pretty much the same -- and we're mostly about returning a three character string that's a "D" or a "U" if there's only an update in one distro, or an "=" if both included an update, along with the previous two characters we already worked out (and a separating space).

def compare_debian_ubuntu_security(norm,sec,res):
    deb_nonsec,deb=norm["d"],sec["d"]
    ubu_nonsec,ubu=norm["u"],sec["u"]
    d,u=deb.packages,ubu.packages

    pkgs=set(d.keys()) | set(u.keys())
    pkgs=list(pkgs)
    pkgs.sort()

    not_present = {"d":[],"u":[]}
    pull_for = {"d":[],"u":[]}
    counts={}

    for p in pkgs:
        if p in d and p not in deb_nonsec.packages:
            not_present["d"].append(p)
        if p in u and p not in ubu_nonsec.packages:
            not_present["u"].append(p)

        if p not in res:
            continue

        resp=res[p]

        if p not in d:
            l="U"
            if resp == "==": pull_for["d"].append(p)
        elif p not in u:
            l="D"
            if resp == "==": pull_for["u"].append(p)
        else:
            l="="
        which = "%s %s" % (l,resp)
        counts[which] = counts.get(which,0)+1

    return (counts,not_present,pull_for)

Because of all the special cases, summarising the results isn't quite as easy. We simplify it a bit by having a sub-function to collapse the results that are all essentially the same, but otherwise it's a matter of just looking at every possibility, and adding an explanation of what it means.

def print_sec_cmp(x,norm,secs):
    counts,not_present,pull_for = x

    eq = ["= ==", "= Du", "= Dm", "= Uu", "= Um", "= Up",
          "D D*", "U U*",
          "D ==", "U ==",
          "D Du", "D Dm", "D Uu", "D Um", "D Up",
          "U Du", "U Dm", "U Uu", "U Um", "U Up"
        ]
    counts_o = [counts.get(x,0) for x in eq]

    for x in counts:
        if x in eq: continue
        print "Weird result: %s %d" % (x,counts[x])

    what=["same version",
            "Debian has newer upstream","Debian has newer patches",
            "Ubuntu has newer upstream", "Ubuntu has newer Debian patches",
            "Ubuntu has ubuntuX patches"]
    def report_subset(str,cmps):
        print str % sum(cmps)
        for n,w in zip(cmps, what):
            if n == 0: continue
            print "    %3d %s" % (n,w)

    report_subset("%d packages with security updates in both Debian and Ubuntu",
        counts_o[0:6])

    print "\n%d updates in Debian to Debian only packages" % (counts_o[6])
    print "%d updates in Ubuntu to Ubuntu only packages" % (counts_o[7])

    print "\n%d updates in Debian to packages with the exact same source in Ubuntu" % (counts_o[8])
    print "%d updates in Ubuntu to packages with the exact same source in Debian" % (counts_o[9])

    report_subset("\n%d packages updated in Debian but not Ubuntu",
        [0]+counts_o[10:15])

    report_subset("\n%d packages updated in Ubuntu but not Debian",
        [0]+counts_o[15:20])

    for distro in ["Debian","Ubuntu"]:
        l=distro[0].lower()
        if not_present[l]:
            print "\nUpdates in %s for packages not in %s:" % (distro,distro)
        for np in not_present[l]:
            print "    %s %s %s" % (np,secs[l].packages[np],secs[l].section[np])
        if pull_for[l]:
            print "\nPackages %s should pull advisories for:" % (distro)
        for pf in pull_for[l]:
            print "    %s %s %s" % (pf,norm[l].packages[pf],norm[l].section[pf])
    print ""

With all that worked out, we can do the actual deed! This is all hardcoded -- I pulled the various Sources files from the Debian and Ubuntu archives and snapshot.debian.net, and put them in the current working directory, and if you want the same results, you'll have to do that too... Anyway, the idea is to populate a dictionary with two packages instances for each distro -- the original release, and the latest security updates.

suites = {
    "etch": (Packages("Sources_etch_20070408.bz2"), 
             Packages("Sources_etch_security.bz2")),

    "lenny": (Packages("Sources_lenny_20090215.bz2"),
              Packages("Sources_lenny_security.bz2")),

    "feisty": (Packages(["Sources_feisty_main_20070417.bz2", 
                "Sources_feisty_universe_20070417.bz2"]),
               Packages(["Sources_feisty_main_security.bz2", 
                "Sources_feisty_universe_security.bz2"])),
    "hardy": (Packages(["Sources_hardy_main_release.bz2", 
                "Sources_hardy_universe_release.bz2"]),
              Packages(["Sources_hardy_main_security.bz2", 
                "Sources_hardy_universe_security.bz2"])),
    "intrepid": (Packages(["Sources_intrepid_main_20081120.bz2",
                  "Sources_intrepid_universe_20081120.bz2"]),
                 Packages(["Sources_intrepid_main_security.bz2",
                "Sources_intrepid_universe_security.bz2"])),
    "jaunty": (Packages(["Sources_jaunty_main_20090423.bz2",
                "Sources_jaunty_universe_20090423.bz2"]),
               Packages(["Sources_jaunty_main_security.bz2",
                "Sources_jaunty_universe_security.bz2"]))
}

And having done that, we add a little function that does the full comparison between two distros (using the right input format for the previously defined functions)...

def pretty_compare(dname,uname):
    name="%s vs %s" % (dname,uname)
    norm={"d":suites[dname][0], "u":suites[uname][0]}
    sec={"d":suites[dname][1], "u":suites[uname][1]}

    cmp=compare_debian_ubuntu(norm)
    print "%s\n%s\n\n%s" % (name,"="*(len(name)),print_cnts(cmp))

    x=compare_debian_ubuntu_security(norm,sec,cmp)
    print_sec_cmp(x,norm,sec)

And then compare the distros that might be interesting:

pretty_compare("etch","feisty")
pretty_compare("lenny","intrepid")
pretty_compare("lenny","jaunty")

pretty_compare("etch","hardy")
pretty_compare("lenny","hardy")

And finally make sure you get linked on LWN, and you're done!

cloneproof-ssd

Anonymous — Tue, 18 Aug 2009 00:28:36 +0000

Cloneproof Schwartz Sequential Dropping

This script is designed to calculate the winner of a Debian vote, using the Cloneproof Schwartz Sequential Dropping method. This is described in wikipedia as the Schulze method, and basically takes a list of preferences from voters, works out which options are preferred by a majority of the voters to which other options, and then resolves any cycles.

Anyway: boilerplate, first.

#!/usr/bin/perl -w

# Copyright (c) 2001, 2002 Anthony Towns 
#
# This program is free software; you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation; either version 2 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program; if not, write to the Free Software
# Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA

use strict;

Parse input

On stdin, we expect the processed votes, of the form:

V: -324-1

which is interpreted as rating the 6th option as your 1st preference, etc, and ranking the 1st and 5th options as equal last. Equal rankings are allowed and treated as expected, as is skipping a ranking.

Note that SPI's vote tabulation has in the past treated those as "unranked", and not expressing any preference between unranked options and ranked options. This code doesn't support that behaviour.

my %beats;   # number of votes preferring A to B == $beats{"$a $b"}
my $candidates = 0;

while(my $line = <>) {
    chomp $line;
    next unless ($line =~ m/^V: (\S+)\s*$/);
    my $vote = $1;
    $candidates = length($vote) unless ($candidates >= length($vote));
    for my $l (1..($candidates-1)) {
        for my $r (($l+1)..$candidates) {
            unless (defined $beats{"$l $r"}) {
                $beats{"$l $r"} = 0;
                $beats{"$r $l"} = 0;
            }

            my $L = substr($vote, $l-1, 1); $L=hex($L) unless $L eq "-";
            my $R = substr($vote, $r-1, 1); $R=hex($R) unless $R eq "-";

            $L = $candidates + 1 if ($L eq "-");
            $R = $candidates + 1 if ($R eq "-");

            if ($L < $R) {
                $beats{"$l $r"}++;
            } elsif ($R < $L) {
                $beats{"$r $l"}++;
            } else {
                # equally ranked, tsk tsk
            }
        }
    }
}

Determine defeats

We determine "defeats" based on how many votes ranked some candidate above another candidate. This is both to abbreviate some of the code later, and more importantly to allow us to easily drop defeats as part of the "sequential dropping".

my %defeat = ();

print "Calculating defeats\n";
for my $l (1..$candidates) {
    for my $r (1..$candidates) {
        next if ($l == $r);

        my $LR = $beats{"$l $r"};
        my $RL = $beats{"$r $l"};
        if ($LR > $RL) {
            $defeat{"$l $r"} = "$LR $RL";
        } elsif ($l < $r && $LR == $RL) {
            print "Exact tie between $l and $r : $LR $RL\n";
        }
    }
}
print "\n";

Cloneproof SSD Calculation

Finally we determine the winner according to Cloneproof SSD. The process is as follows:

Calculate Schwartz set according to uneliminated defeats.
If there are defeats amongst the Schwartz set, then eliminate the weakest defeat(s), and go back to 1.
Otherwise, there are no defeats left amongst the Schwartz set!
If there is only one member in the Schwartz set, it wins.
Otherwise, there is a tie amongst the Schwatz set.

We do pretty verbose output, since this is meant for double-checking the vote results, not just figuring it out.

my $phase = 0;
while(1) {
    $phase++;
    print "Defeats at beginning of phase $phase:\n";
    for my $d (sort keys %defeat) {
        my ($l, $r) = split /\s+/, $d;
        print "    $l beats $r: $defeat{$d}\n";
    }

    my @schwartz = calculate_schwartz();
    print "Schwartz set: " . join(",", @schwartz) . "\n";

    my @schwartzdefeats = 
        grep { defined $defeat{$_} } crossproduct(@schwartz);

    if (!@schwartzdefeats) {
        print "No defeats left in Schwartz set!\n";
        if (@schwartz == 1) {
            print "Winner is: $schwartz[0]\n";
        } else {
            print "Tie amongst: " . join(", ", @schwartz) . "\n";
        }
        last;
    }

    my $weakest = (sort { defeatcmp($defeat{$a},$defeat{$b}) }
         @schwartzdefeats)[0];

    my $weakstrength = $defeat{$weakest};
    print "Weakest defeat amongst schwartz set is $weakstrength\n";

    for my $d (@schwartzdefeats) {
        die "Defeat weaker than weakest! $d, $weakstrength, $defeat{$d}" 
            if (defeatcmp($defeat{$d}, $weakstrength) < 0);
        if (defeatcmp($defeat{$d}, $weakstrength) == 0) {
            print "Removing defeat $d, $defeat{$d}\n";
            delete $defeat{$d};
        }
    }
    print "\n";
}

Subs

Okay, we cheated a bit there, just using crossproduct defeatcmp and calculateschwartz. Here's their definitions:

sub crossproduct {
    my @l = @_;
    return map { my $k = $_; map { "$k $_" } @l } @l;
}

sub defeatcmp {
    my ($Awin, $Alose) = split /\s+/, shift;
    my ($Bwin, $Blose) = split /\s+/, shift;

    return 1 if ($Awin > $Bwin);
    return -1 if ($Awin < $Bwin);
    return 1 if ($Alose < $Blose);
    return -1 if ($Alose > $Blose);

    return 0;
}

Calculate Schwartz set

This is a bit complicated. Here's the theory.

We note that given two unbeaten subsets, S and T, either, then S^T (the intersection of sets S and T -- ie, the elements in both) is also unbeaten, so either S^T is empty, or S^T is a smaller unbeaten subset.

We can thus find a unique, smallest unbeaten set containing each candidate by a simple iterative method. This is findunbeatensuperset.

So given the smallest supersets for each candidate, we have all the smallest unbeaten subsets (since each one will be the smallest superset of any of its members). So, each set is either a proper superset of another set (and can thus be discarded), or it's a smallest unbeaten subset.

We eliminate the sets and then union the remainder (which are either equal or disjoint), and we've thus found the Schwartz set. This is done in calculate_schwartz.

sub find_unbeaten_superset {
    my @l = @_;
    for my $r (1..$candidates) {
        my $add = 1;
        for my $l (@l) {
            if ($l == $r) {
                $add = 0;
                next;
            }
        }
        next unless ($add);
        $add = 0;
        for my $l (@l) {
            if (defined $defeat{"$r $l"}) {
                $add = 1;
            }
        }
        if ($add) {
            return find_unbeaten_superset(@l, $r);
        }
    }
    return sort(@l);
}


sub is_subset {
    my @l = @{$_[0]};
    my @r = @{$_[1]};

    for my $x (@r) {
        last if (!@l);
        shift @l if ($l[0] == $x);
    }

    return !@l;
}


sub calculate_schwartz {
    my @schwartz = ();
    for my $k (1..$candidates) {
        my @us = find_unbeaten_superset($k);
        my $new = 1;
        for my $x (@schwartz) {
            if (is_subset($x, \@us)) {
                $new = 0;
            } elsif (is_subset(\@us, $x)) {
                $x = \@us;
                $new = 0;
            }
        }
        if ($new) {
            push @schwartz, \@us;
        }
        #print "$k : " . join(",", @us) . "\n";
        #print "schwartz : " . join(":", map { join(",", @{$_}) } @schwartz) . "\n";
    }
    my @result = ();
    for my $x (@schwartz) {
        if (!is_subset($x, \@result)) {
            @result = sort(@result, @{$x});
        }
    }
    return @result;
}

rm-arch-qual

Anonymous — Thu, 06 Aug 2009 21:23:28 +0000

Introduction

So this is a reasonably quickly hacked up script to make it easier to maintain the Debian release team's architecture status page for squeeze. It was prompted by a mail from Philipp Kern complaining that managing the HTML tables was a bit of a pain.

So, boring preamble.

#!/usr/bin/env python

# Copyright (c) 2009 Anthony Towns
#
# This program is free software; you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation; either version 2 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
# GNU General Public License for more details.

The obvious idea to maintain the page would be exporting from a spreadsheet -- it's already tabular, so minimal interface changes, and there's plenty of usable spreadsheets around. But it's not easy to do things that way, and automating it further would also suck.

So the next obvious solution is to have some plain text files, and generate HTML from them. And the easiest way to to text files at the moment is YAML.

import sys, yaml

Formatting Helpers

The way we'll handle the YAML contents is to pull each particular bit of information about an architecture out, interpret it, and then dump some HTML code for the table cell. In order to make that a bit easier, we'll use a bunch of interpretation functions that will take the data from YAML, and spit back a state (fail/pass), and the actual contents for the cell (colouring happens later).

Boring helpers:

def FAIL(value): return ("fail",value)
def WARN(value): return ("warn",value)
def PASS(value): return ("pass",value)

Dealing with "yes"/"no" states -- note YAML automatically translates "yes" to True, etc.

def c_truth(value):
    if value == True:
        return PASS("yes")
    elif value == False:
        return FAIL("no")
    else:
        return WARN(value)

def c_untruth(value):
    if value == True:
        return FAIL("yes")
    elif value == False:
        return PASS("no")
    else:
        return WARN(value)

def c_str(value):
    if not value: return FAIL("-")
    return PASS(value)

For the porterbox field, we'll try turning it into a link to the machine db:

def c_host(value):
    if not value: return FAIL("none")
    return PASS('yes'%(value))

For numbers, the function needs to know what value ranges are acceptable -- so we specify a maybe value (which will demote a failure to a warning), and an okay value (which gets a pass). Bigger numbers are assumed to be better.

def c_num(maybe,okay):
    def c_list_f(value):
        if value < maybe: return FAIL(value)
        if value >= okay: return PASS(value)
        return WARN(value)
    return c_list_f

Same thing for lists of values: how many elements are okay, how many just need a warning? Since the cell can only display a count, we also have a scrollover field that gives the full list.

def c_list(maybe,okay):
    def c_list_f(value):
        n=len(value)
        str='%s' % (", ".join(value),n)
        if n < maybe: return FAIL(str)
        if n >= okay: return PASS(str)
        return WARN(str)
    return c_list_f

Criteria

So that gets us to the point where we can defined the release criteria for architectures:

criteria = [
    ("available",         c_truth),
    ("portbox",           c_host),
    ("porters",           c_list(2,5)),
    ("users",             c_num(30,50)),
    ("installer",         c_str),
    ("archive-coverage",  c_num(90,95)),
    ("archive-uptodate",  c_num(95,97)),
    ("buildds",           c_list(2,2)),
    ("buildd-redundancy", c_truth),
    ("buildd-fast-sec",   c_truth),
    ("buildd-247",        c_truth),
    ("concerns-rm",       c_untruth),
    ("concerns-srm",      c_untruth),
    ("concerns-dsa",      c_untruth),
    ("concerns-sec",      c_untruth),
    ("candidate",         c_truth),
]

Table Output

Output of the HTML table is pretty straightforward.

def dump_table(info,waivers):
    arches=info.keys()
    arches.sort()

    colour = {"pass":"lime", "warn":"yellow", "fail":"red", "waiv":"cyan"}

    candidacy_at_risk = {}

    print ""
    print ""
    for arch in arches:
        print "" % (arch)
        candidacy_at_risk[arch] = False
    print ""

    for c,c_fn in criteria:
        print "" % (c)
        for arch in arches:
            v=info[arch].get(c,None)

            if c=="candidate" and candidacy_at_risk[arch]:
                if v == True: v = "at risk"

            if v is not None:
                col,contents = c_fn(v)
            else:
                col,contents = FAIL("-")

            w = waivers.get(arch,{}).get(c,None)
            if w:
                col="waiv"
                contents += ' (w)' % (w)

            if col=="fail":
                candidacy_at_risk[arch]=True

            print '' % (colour[col],contents)

        print ""

    print " %s
%s %s"

main()

Having done all the interesting work, we just need a way to invoke it. The main() function here loads one or two YAML files (the arch details and the waivers list), and then hands them over.

def main(argv):
    if len(argv) < 2 or argv[1] == "-h":
        print "Usage: %s  []" % argv[0]
        sys.exit(1)

    info = yaml.load(open(argv[1]))
    if len(argv) >= 3:
        waivers=yaml.load(open(argv[2]))
    else:
        waivers={}

    dump_table(info,waivers)

if __name__ == "__main__":
    main(sys.argv)

Voila!

imo2009q6

Anonymous — Wed, 05 Aug 2009 18:11:25 +0000

Apparently it's IMO season, and in honour of such, Terry Tao reposed one of the questions on his blog as a "polymath" project -- something to be solved collaboratively over the web, rather than by individual effort. Two hundred comments or so later, and a complete proof was achieved, though how much that benefited from the polymath aspect might be debatable. Anyway, it's a pretty neat problem:

Problem 6. Let a₁, a₂, ..., a_n be distinct positive integers and let M be a set of n-1 positive integers not containing s = a₁ + a₂ +...+ a_n. A grasshopper is to jump along the real axis, starting at the point 0 and making n jumps to the right with lengths a₁, a₂, ..., a_n in some order. Prove that the order can be chosen in such a way that the grasshopper never lands on any point in M.

Depending on your interest in such things, you might have fun trying to puzzle it out on your own. If so, you might or might not want to read the comment thread above for ideas, but be cautious about the comments to Terry's followup post which include at least one complete proof.

Anyway, since I'm obviously not a l33t enough maths dude to come up with a proof myself, I thought it might be fun to see about converting the proof into a program (ie, "find a sequence of hops the grasshopper can take without landing on any point in M"). Something along the lines of this blog post from not too long ago. Happily it's a constructive proof, so that's an entirely reasonable challenge.

First, I'm going to generalise the problem slightly to make it a little easier to deal with: we'll allow the grasshopper to start at any point, z, on the number line; and we'll allow M to contain between zero and n-1 values, instead of exactly n-1 values, and we'll sort the two sequences from least to greatest. (That turns out to not be any more general in practice, but will make the notation easier)

As such, we'll define a function steps(a,m,z=0) that takes the two sorted lists of numbers, and the starting position (defaulting to 0), and returns a list of steps that when taken misses everything in m. We'll resolve it recursively, starting with the base case of m being empty, in which case we can just take the steps in order, without worrying about hitting a mine:

#!python
def steps(a,m,z=0):
    assert(sorted(a)==a and sorted(m)==m)
    assert(len(m) < len(a))
    if m == []: return a

The ideal first step to take would be the longest possible -- that's the most likely to pass a bunch of values in m, and generally gets us moving as fast as possible. The key insight from the proof (IMO) is taking this approach, and looking at how that interacts with the various possible arrangements of the first few values of m. There are two important issues: one, the most obvious, is whether one of the values of m is hit immediately by taking the largest possible step; and two, whether the largest possible step actually passes any of the values of m. We continue on by establishing these possibilities:

    a_max = a[-1]
    m_min = m[0]

    a_max_in_m = (z+a_max) in m   # NB: an O(n) or O(lg(n)) step
    a_max_passed_any_m = (m_min < z+a_max)

The most complicated case turns out to be when amaxinm and amaxpassedanym are both True. Since we know there are at least two distinct values in m (mmin and z+amax), we can approach that by considering two jumps together, aother followed by amax. This won't hit (z+amax), and there are at most n-2 other values in m, while there are n-1 values for aother. So some pair of (z+aother, z+aother+amax) has neither value in m, and avoids at least two members of m. That gives us an inductive step, though we have to check every element of a to find it:

    if a_max_in_m and a_max_passed_any_m:
        for i,a_other in enumerate(a):
            if z+a_other not in m and z+a_other+a_max not in m:
                z_new = z+a_other+a_max
                a_rem = a[:i]+a[i+1:-1]
                m_rem = [m_i for m_i in m if m_i > z_new]
                return [a_other,a_max]+steps(a_rem, m_rem, z_new)
        assert(False) # unreachable

Having dealt with that, we have the nice result that removing the smallest member of m ensures that there's no longer anything getting in the way of taking a_max as the next step. By doing that, we can recurse into a simpler problem, and manipulate the answer we get back to come up with a solution:

    simpler = steps(a[:-1], m[1:], z+a_max)

Now we know this series of steps avoids every value of m with the possible exception of mmin, but we need to figure if it does hit mmin, and if so where. So let's do that:

    sofar = z+a_max
    for i, s in enumerate(simpler):
        if sofar >= m_min:
            break
        sofar += s

At this point we've either safely passed mmin, safely reached the end of our path, or hit mmin. Avoiding m_min is great: if we managed that, we're done.

    if sofar != m_min:
        return [a_max]+simpler

But if we didn't, we're still not too badly off: we have a sequence of steps, s[:i] that end on mmin, followed by steps s[i:] that safely avoid all the other values in m. But we can rearrange those steps to avoid all the mines entirely by taking step s[i] first, then steps s[1:i] and step s[0] and finally steps s[i+1:]. Since our first step, s[0], was the largest possible step, we know z+sum(s[i]+s[1:i]) is less than z+sum(s[0]+s[1:i])=mmin and thus that all those steps avoid any value in m, and we also know that z+sum(s[i]+s[1:i]+s[0])>mmin, because that is precisely z+sum(s[:i+1]) which is a step past mmin, and known to avoid all the other values of m. And at that point we're really done:

    return [simpler[i]] + simpler[0:i] + [a_max] + simpler[i+1:]

So there you have it. (That actually turned out slightly neater than the original proof in some respects, since a couple of the cases ended up merged)

Also somewhat interesting is where the assumptions of the problem make their way into the code. The fact that m doesn't include the total sum is relied upon to ensure simpler[i] actually exists; that each value in a is distinct is relied upon to ensure a_max is actually larger than simpler[i]; and that there are fewer distinct values in m than distinct values in a is relied upon to apply the pigeon-hole principle in the first branch.

Anyway, pretty neat, I think!

index

Anonymous — Tue, 04 Aug 2009 19:25:28 +0000

Notes

Code is a sputnik instance. It can be browsed (cgi/wiki, including revisions) or built (tangle to make/tar) and have bugs tracked. The result should be code that looks attractive, that's easy to edit, and that's usable.

This is my junkcode instance, which is meant to hold a bunch of lightweight modules. Larger projects should have their own wikis.

How this works

Setup a sputnik wiki (done!)
Use python-markdown to go from "wiki" markup to pretty-printed display (done!)
Have an action to extract code from a page (done!)

What's still needed

Some way to have multiple files in a single page (code and data, tests, examples, formal verification, etc)
Some way to dump code in directly without having to mark it up or enclose it in lua on disk
Some way to have a project of multiple files/pages (maybe can setup a tgz action on a page with links to other pages that creates a tarball of those pages, properly renamed and versioned?)
Some way to do literate-programming-style tangling, for code that's rearranged into readable order, rather than the way the programming language wants it (cf pascal). This may not be worthwhile though...

What would be awesome

bug tracking, somehow? should be orthogonal from the revision history of the code, which maybe means it might as well be entirely separate...

aj's junkcode

dir2tree

dir2tree

hackoff10-redpill

Hackoff 2010 Problem 5

py-dir

Reimplementing dir()

key-deck

Key Deck

Deck setup

Converting a number to an ordering

Converting from an ordering to a number

multiple_files

multiple files per page

py-struct

struct for Python

redcat

redcat

Introduction

Dealing with diffs

Ed-style diffs

main()

distrocmp

Distribution comparisons

cloneproof-ssd

Cloneproof Schwartz Sequential Dropping

Parse input

Determine defeats

Cloneproof SSD Calculation

Subs

Calculate Schwartz set

rm-arch-qual

Introduction

Formatting Helpers

Criteria

Table Output

main()

imo2009q6

index

Pages

Programs

Techniques

Future

Easy

Multiple file tangle support

Real projects

Notes

How this works

What's still needed

What would be awesome