The Lowly Programmer

The Rise of Datacenter Computing

2021-04-26T01:04:00.001-04:00

The coming decades will see a significant change in the usage and deployment of “compute” (herein CPU cycles, storage, etc), specifically relating to the rise of datacenter computing. I don’t know the exact form this change will take - though I will point to the first phases of it below - but the fact that significant paradigm shifts are coming seems inevitable.

Note that I’m not talking about IoT or anything in that space, though it’s peripherally related. Here I’m specifically talking about the rise of datacenters as the majority share of compute power, and how the economy of computing and our individual access will be reshaped by it.

Part 1: The Data Center of Gravity

Two trends here seem clear: first, the balance of compute power is going to continue shifting towards datacenters and away from personal devices (even as those devices proliferate); and second, economics will dictate a shift in how this capacity is delivered to meet the ever-growing demand.

The balance of compute power will shift inexorably towards Centralized: as the fraction of GDP put towards compute rises, the portion of it situated in the home and on the person will fall, as will non-datacenter corporate compute.

Centralized is efficient: better power utilization (PUE), better resource utilization, more cost-efficient devices (e.g. larger processors, rack-oriented computer designs) all favour datacenter compute. But also because high density computing itself often allows for more efficient processing: lots of computers and memory and storage together on a high bandwidth fabric permits algorithms ill-suited to - or impossible on - highly-distributed small-scale devices.
Our willingness to have compute resources poorly utilized and at high risk will decline. Most compute capability of a cell phone or a personal computer goes almost entirely wasted already, which is tolerable for the convenience they bring at the relative cost. But would you want a $100k computer sitting idle at home, and a $10k cell phone in your pocket, if that’s what it took to keep up with compute demand growth? Almost certainly not. And relatedly, we’re seeing a rapid shift from low efficiency corporate computing - on-prem, colo, and private small datacenters - towards larger datacenters and especially hyperscalars with cost effectiveness as one of the driving motivators.
When we consider the electricity consumption of global compute rising from the present-day ~3% (including consumer devices, datacenters, and networks) to say 21% by 2030 (or perhaps a few years later once efficiencies are played out), we couldn’t carry all those watt-hours in our collective pocket even if we wanted to. At best it could shift to remote devices that live in the home, but efficiency still pushes strongly for it being situated near cheap electricity instead.

This is not to say that individuals won’t continue to grow their net compute usage. Rather, the trend of comparatively weak computing devices on the person and in the home (e.g. 2-4 core cell phones) will continue, augmented by a growing fraction of compute happening in responsibilities offloaded to centralized (datacenter) environments and their hundred-plus core machines.

To make this concrete: today the majority of compute capabilities (e.g. CPU cores) remain in the hands of consumers, in the form of cell phones, tablets, laptops, PCs, game consoles, and other such devices. But server-oriented compute devices are rapidly growing, with disks leading the charge. Note that server-oriented disks do not dominate by unit count yet, but already do by capacity as the server- and consumer-oriented devices diverge. GPUs are soon to tip towards server-dominant as well (and may have already when measured by capabilities, if we account for GPUs used crypto mining that appear in the consumer “Graphics” segment, capability segmentation consumer vs enterprise, and generally higher prices / revenue in the consumer market skewing revenue-based breakdowns). CPUs and ram are the most difficult to pinpoint, but anecdotally servers helping to hold up the ram market, and AMD’s datacenter-first strategy, suggest that we’re nearing the tipping point there too.

By usage aka utilization, datacenters are already likely dominant, and growing rapidly. The average consumer device has an extremely low overall utilization (single-digit percentage), when considering the time-on-device, core scaling, and frequency scaling utilized by devices to keep power consumption low. Whereas datacenters have entire teams dedicated to keeping utilization high and timing capacity delivery against demand, achieving overall utilizations reaching 50% and likely averaging 30-40% industry-wide. Combined with the relative share of compute capabilities, this suggests the balance of computing is already happening in datacenters, and incremental compute coming online suggests usage highly skewed towards datacenters.

On the economic side, the complexity of producing these datacenters will continue to shift as well. “Supply” will tend towards fixed over the coming decades: not deployed according to short-term needs, but rather, at a fixed pace determined over increasingly long time-scales. Consider the recurring semiconductor shortages (bottlenecked on fabrication capacity) combined with the rising price of chip fabs, the lead times on electric grid scaling, the lead time on adjusting mining output for rare earth elements, the cost of a modern datacenter building, and the general inertia of any industry as its share of global GDP rises.

This will see datacenter compute capabilities continue the shift into “base infrastructure” IaaS / PaaS plus higher levels layered above, with Public Clouds being only the first step on this road. One might also expect planning to trend towards more centralized as well: akin to the Oil industry, I could imagine the producers and consumers of bottleneck infrastructure resources - power, computer chips, flash, and disk drives today - coming together to commit to a particular curve years or decades in advance, and pre-committing demand in a way that makes the funding of mega fabs ($20B today; $50B in a couple more generations) possible.

Beyond that, I’m not sure. Do whole countries start to take a seat at the negotiating table? The world’s reliance on TSMC for top-end chips, and the collaboration with the US government to land a fab on US soil, suggests it’s decently likely. Does the cryptocurrency share get reigned in? I guess we’ll see.

Part 2: “Personal” computing

Let’s run a thought experiment: what does it look like for centralized “compute” to become available not only to corporations who pay for it, but also made directly and practically available to the citizens of the world. What will it be used for? How will that be decided?

“Practically” is important - it’s true that individuals can sign up for accounts at public Clouds today, and use the many services available, but the vast majority do not have the knowledge necessary to do so successfully. Whereas almost everyone knows how to download and run an app on a local device. As such, the predominant consumer compute patterns leveraged today are:

“Local general-purpose”: apps and programs running on the capabilities of a device you provide.
“Ad-supported”: websites, social networks, etc, where the load you personally contribute is small enough to be funded entirely by advertising or offered for free; requiring no payments either way.
"Remote fixed-cost” services: costly enough to host that they’re not offered for free, but where the load you personally contribute is still sufficiently bounded that a flat-rate subscription pricing model suffices.
Plus a comparably small amount of “remote compute services”: e.g. photo storage; paid for intentionally by the user, with variable costs or tiered plans.

The paradigm of yore was “local general-purpose” - you had a computer or device, and used it however you saw fit; the networks didn’t really exist to support anything else. The internet then ushered in an era of “ad-supported” services, e.g. search engines, maps, and social networks. More recently yet “Remote fixed-cost” has been on the rise, with services like Spotify or Netflix leading the charge into the home from the media front with VPNs riding on their coattails, team-oriented services like Zoom gaining significant mindshare in the “free for home, paid for business” space, and most recently “cloud gaming” starting to take off as well. And finally “remote compute services” exist in limited form, particularly for storage applications, but not yet having taken off.

(Tangent: I looked up the Adobe Creative Cloud and other competitors to see if “paid cloud photo/video editing” is a thing, and it seems not yet? I’m a little surprised, as data-heavy applications that idle at 0 but could easily spike to hundreds of cores during active use seems inherently well suited to this model.)

You’ll note that I did not include “remote general-purpose” in the list above, as I do not see this as a prevalent paradigm today, with a bit of niche usage of “virtual desktop in the Cloud” style services and little else. But it’s likely coming next - I expect the consumer portion of computing to shift in decent part towards remote variable-cost compute as the local slice of compute consumption becomes relatively smaller, through some mix of proliferating “remote compute services” offering pay-as-you-go or subscription access to myriad services for storage, photo/video editing, data analysis, model generation, data generation, etc; as well as new “remote general-purpose” paradigms that have yet to be developed, such as “apps” executing on dynamic remote resources.

While this could be presented as “desktop in the Cloud”, I doubt it will be - the desktop paradigm is suited to a small, always on, fixed-sized, pre-purchased unit of compute (one computer), and less clearly suited to expressing irregular bursts of short-lived high-consumption activities. Solving for burst usage with cost predictability will likely require developing paradigms more inherently suited to that task, rather than simply transplanting a (relatively) dying paradigm into “the Cloud”. It’s much more likely to me that subscription services will take the broadest share, and decently likely that a cost-predictable paradigm for “remote apps” will be found to fill the remaining gap.

As a final thought to leave you with, the donation of spare cycles towards causes of interest may be interesting in a future world also. Today imagine collectivist projects like Folding@Home, or seeding torrents to challenge content “ownership”. Will these still exist, and in what form?

Hardware trends for datacenter computing

2012-11-01T15:53:00.000-04:00

Moore’s Law (and family) are a significant boon to anyone writing distributed systems. As the systems grow over time, so too does the underlying hardware get more powerful, allowing exponential growth to be readily managed. Programs need to be modified to cope, of course, but the raw compute power is there and available. That said, designing algorithms that can scale with more than an order of magnitude of growth remains a difficult challenge, and regular rewrites are inevitable. In the interest of minimizing that number, here are a few trends I’ve observed that should continue into the next few years. Caveat emptor, this is my personal perspective, but if it doesn't line up with your experience I'd love to hear about it.

1. Compute Units are Static

Moore’s Law tells us that the number of transistors on a die double every 2 years, and practical experience tells us that performance doubles every 18 months. In the multicore era, however, the performance of a single core has largely plateaued, with further advances coming in the form of more cores per chip.
RAM and on-die caches roughly keep pace with CPUs, which means that now that we’re into core count growth, the RAM and cache per core is now approximately fixed.
Disk storage is also keeping the same pace, and hence disk per core is now also approximately fixed.

Thus, we can define a theoretical “compute unit” as, say, 1 core, 2GB RAM, 20GB flash*, 200 GB disk (adjust numbers as you see fit); this unit should stay roughly stable going forward, with the number of such units increasing exponentially.

Note that the purchasable range of resources-per-core is generally increasing; this represents the most economical average rather than a strict upper or lower bound. Resources do not need to be apportioned to each process in these ratios, either; e.g. disk-heavy and CPU-heavy processes can coexist on a machine and balance each other out. But it's worth looking closely at anything expected to grow significantly on one axis and not in others, to see if it's sustainable long term.

* Flash adoption remains spotty, but early indications put it roughly halfway between RAM and disk on performance and price, and it seems to be keeping up the growth pace. I speculate it will become a standard inclusion going forward, but this isn’t as clear as the others.

Update: One of my colleagues has pointed out that disk performance is not keeping pace with capacity increases (neither seek nor transfer rates). I'm not sure how to include this detail in the compute unit definition itself, but it will likely prove to be significant and help drive the adoption of flash going forward.

2. The Datacenter is the New Machine

This is already largely true, but the trend will only get stronger. This is driven by two main trends:

NUMA architectures are reducing sharing and performance uniformity within machines, and
Intra-datacenter connectivity is growing faster than individual machine performance, reducing the impact of communicating between machines and racks.

Thus, the trend of running programs as many smaller components that may or may not share any particular physical hardware will continue to get stronger. Correspondingly, virtualization will be increasingly necessary for isolation, further abstracting processes from the hardware they run on.

We may also start to see this trend spread to the client-side as well, but I’m not sure what form it would take. In-home compute sharing?

3. Decomposable Programs are Ideal

As a corollary to the previous point, algorithms that can be decomposed into fixed-sized chunks are best. They can be apportioned among CPUs/machines/racks easily, and scaling is decoupled from machine growth (which occurs in generational bursts when hardware is refreshed). Which means that if sharding is necessary, dynamic sharding - a variable number of fixed-size chunks - should be preferred over having a fixed number of shards that grow separately.

MapReduce (or Hadoop), Pregel (or Giraph), and GFS (or HDFS) are all examples of decomposable programs, tackling data processing, graph computation, and storage, respectively.

4. Consumer Bandwidth is Lagging

Consumer network growth is lagging behind intra- and inter-datacenter networks, and hardware trends. Some of this is due to a shift towards mobile, where equivalent resources (primarily images and videos) are smaller, but it holds on wired internet as well. It is unclear to me how long this will last - gigabit customer internet is already being tested in a few places, for example - but for the time being it remains true. This means inter-datacenter communication is growing cheaper relative to communicating with consumers. While it remains true we should expect more work to be pushed remotely to the “cloud” and an increase in datacenter entities cross communicating, and a slow shift towards smaller and more expensive resources (heavier compression, more computationally expensive codecs, etc) for customers.

5. Latency Doesn’t Scale

A fact that has always remained true: the speed of light is (effectively) constant, and does not decrease in line with other improvements. As bandwidth increases latency will continue to dominate long-distance performance, and round-trip times will continue to constrain serial throughput. Together, these will continue to push a couple of long-lived trends:

Reducing data proximity. Caching (local and edge) and having data prefetched are key in reducing latency, and get comparatively cheaper over time. One could imagine future efforts that will proactively push resources straight into user networks or devices as well.
Supporting long-lived connections and message parallelism. Connection establishment requires at least one round trip, and any form of serial channel use cannot fully utilize one. Together these will lead to more connection sharing where possible, and more long-lived idle connections otherwise.

Primes part 2: A segmented sieve

2012-08-26T10:41:00.001-04:00

TL;DR: The sum of all primes <= 1,000,000,000,000 is 18,435,588,552,550,705,911,377

This post is a followup to Writing an efficient Sieve of Eratosthenes

A while back I wrote a post detailing a memory-efficient Sieve of Eratosthenes. The algorithm I used took advantage of lazy evaluation and a sliding array to reduce the RAM requirement to a fraction of what a 'vanilla' implementation would require, at the cost of a non-trivial increase in CPU time. The resulting code ran at approximately 25% the throughput of the vanilla implementation, and maxed out at 2 billion candidates.

While researching that post, I noted that the most efficient generators at present use either the Sieve of Atkin or a 'segmented sieve'. As an excuse to play with Go^[1], a couple weeks ago I decided to implement a segmented Sieve of Eratosthenes. This post details my results.

Implementation

Language: Go
Time complexity: O(n∙log(log(n)))
Space complexity: O(√n)
Candidates in 1 sec: ~50,000,000

Gist (expand/contract)

The algorithm proceeds as follows:

Calculate primes up to √max via a vanilla array sieve
Slice up segments of about √max candidates for checking
To check a range,
1. For each prime p from 1., find the first multiple within the range that's >= p²
2. Cross off every multiple from there to the end of the range
Merge the results from the processed segments

You'll note that other than splitting the full candiate set into segments, this is the standard Sieve of Eratosthenes. Hence, it's the segmented Sieve of Eratosthenes.

In my Go version this is implemented by starting segments as individual goroutines that output to their own channels. A single worker goroutine is responsible for marshaling the results from these channels to a single channel read by the main thread. This architecture was chosen simply because it fits well with the Go way of doing things, but it also has the side-effect of providing some amount of free parallelism.

Results

The very first run of this variant was faster than the most optimized version from my previous post. It runs at about 65% the speed of a vanilla implementation, making it about 2.5x as efficient as the previous lazy implementations, with a lower memory footprint. As always, a better algorithm is worth more than any amount of low level code tuning :). I should point out that in the current implementation I also implemented a bit array rather than simply using an array of bools. This reduced the memory footprint somewhat, but did not appear to have any significant impact in either direction on CPU time required, and so could reasonably be dropped to shorten the code.

With all primes needing to be marshaled back to the main thread parallelism maxes out below 2x linear. If all we care about is an aggregate value computed from the primes (the sum in this case), rather than the primes themselves in order, we can achieve >4x parallelism simply by adding more processes. This is also more efficient in general, and allows >300,000,000 primes to be processed in one second^[2].

The net result is an implementation that can sieve 50M candidates in one second on one core or sum 300M on four; sum primes up to one trillion in an hour; or sum primes in a range of one billion (10^9) in the region of one quintillion (10^18) in under a minute. I'm happy with that...for now.

Footnotes

Let me say right now that Go is a fantastic language to work with, being both easy to write and efficient to run. I fully intend to start writing some of my work projects in Go in the near future.
As noted in the previous post, we use the generic "primes in one second" metric for making apples-to-oranges comparisons of algorithms and implementations. This is not intended to provide anything more than a rough comparison.

The algorithms of memory

2012-08-19T09:24:00.001-04:00

The human brain has the best storage system on the block in a lot of ways. It’s notably lossy and doesn’t have the easiest API to work with, but in terms of flexibility it’s second to none. Computer scientists have been trying to model and mimic its features for a lot of years, but we haven’t got it cracked quite yet. Part of the challenge lies in the sheer variety of access styles that human memory allows. I don’t think we even have them all cataloged yet, much less working together in one system.

I’ve been trying over the last couple days to list all the different patterns I can see myself using. I’ve also tried to pull out systems I know of that do the same for comparison, although I can’t find direct equivalents in all cases. Those without an existing equivalent are probably the most interesting - would mimicking these patterns be useful in the future?

Name	Mind example	System example	Characterized by
Cache hit	Facts immediately available for use with no delay	In-memory data storage	Synchronous; low latency
Long term storage	Facts that take a while to look up. “Um...his name... was.... Let's move on - it'll come to me in a minute.”	Lookups from disk or tape	Asynchronous (notify on delivery); high latency
Reminders	Remembering to go buy groceries on your way home from work	Calendar notifications	Time or event based; defined in advance
Information requests	All the associations that come up when you think of a topic. “New Orleans” brings to mind...	Web search	Web of relationships; can be explored further in any direction
Background processing	Coming up with answers/solutions while you sleep or otherwise aren’t explicitly working on the problem	Uploading a video to YouTube - when you check again all the different formats and qualities will be available	Processing item while in storage; queued work requests; separated from foreground tasks
Full Scan	Life flashing before your eyes (wiki)	Processing ordered events during crash recovery	Ordered; sequential scan of all items
Guided randomness	Brainstorming, free association games, mad libs, improv	?	Random item output or random exploration of web; subject to limited constraints
Unsolicited triggered reminders	Being reminded of facts/stories/events by things you see or hear	? [1]	Unsolicited notifications; loosely related to recent input
Unsolicited untriggered reminders	Memories that come to mind with no discernible trigger, e.g. past regrets	?	Unsolicited notifications; no relation to recent input; may be randomly derived
State affecting	Memories that change your mood or attitude. E.g. remembering a birthday makes you happy; remembering specific near miss makes you cautious.	? [2]	State changes triggered by the contents of the information retrieved
Expectations (suggested by e-socrates on Reddit)	"We constantly predict from memory what to expect, then compare that to experience"	?	Continuous short-term predictions; characterizing events as unexpected or surprising

Footnotes

[1] Google Now is trying for this to some extent. Online advertising partially fits, however it is not bubbling up details you already know - rather, it’s feeding you new data.
[2] There are some trigger-based examples of this in security, e.g. logging of accesses to documents, but they don’t really change the state of the system itself (they trigger work in others instead).

On hosting static files

2012-01-30T00:05:00.000-05:00

You may have noticed that I have a blog. Shocking, I know. It’s predominantly composed of text, as many are, but that’s not the only content I like to share. Posts are more approachable and memorable if they contain images, sorting visualizations require Javascript frameworks, etcetera. Unfortunately, all these support files need to be stored somewhere.

This blog is hosted on Blogger, a decision I’m very happy with. Blogger doesn’t do arbitrary file hosting (that I know of), but it does support pushing images to Picasa. So images are hosted by Picasa, another Google service, no sweat. Now, what about other files?

Hmm.

It turns out the answer to this question is a lot trickier than I would have hoped. I was hoping for another Google service, to keep my dependencies down, so I started with Sites.

Google Sites

“Google Sites is a free and easy way to create and share webpages”, the tagline goes. And it is – we use it internally all the time. One quick form later, I have a site, and another couple clicks puts a single attachment on it. Great! Update the post to reference it, double-check in incognito mode, and my work here is done. Right?

Well…not quite. It turns out that Sites uses some interesting redirect magic to actually link you to your content. Redirect magic that, for reasons I don’t understand, doesn’t actually resolve when you’re using it to hotlink content. For exactly this reason I guess, I don’t know. Anyways, since I had visited this content while on Sites, my browser had it cached, and even incognito mode could access it, but it wouldn’t resolve anywhere else. Which is a good reason to test things on more than one computer, I suppose.

Ok, not sites. App Engine?

Google App Engine

App Engine does do static file hosting. Instructions are here – you must simply download the SDK, create the app.yaml file appropriately, upload your new App (the files must belong to something, after all), etc. This looks doable, but certainly non-trivial. I also cannot figure out how this is going to be priced, nor the latency to expect from it. So let’s keep looking.

I’m running out of Google services (I considered and rejected a few more). Time to look further afield. I don’t know of any good, trustworthy solutions offhand, and a quick search isn’t taking me anywhere I really want to go, so lets look at Amazon AWS. They have a service for every occasion, right?

Amazon AWS

As it turns out, yes, yes they do. A quick look through the options (there are so many!) says that Amazon Simple Storage Service (S3) will do nicely, optionally backed by CloudFront if I ever really need to scale. A quick signup, upload of my files, and one new CNAME, and I’ve got my files living under static.thelowlyprogrammer.com, backed by S3. Nice! Doubly so since the free usage tier should make this entirely free for the next year or so, and pennies after.

Finally, in researching this blog post, I found one more option. And it’s a Google service, which was my original goal. Let’s check it out!

Google Cloud Storage

New on the block, Google Cloud Storage appears to be a competitor to Amazon S3. Priced a tiny bit cheaper, the features seem roughly comparable. The documentation is a bit rough still, which partly explains why I didn’t figure this out earlier, but everything is there if you look hard enough. One significant distinction is that you do not choose where data is stored (unless you want to add restrictions), and it will get replicated wherever needed. Note that this includes automatic edge caching for popular files, so this is pretty much S3 and CloudFront all rolled into one. Fancy! It supports the same CNAME aliases, so I’ve got this set up as a hot-spare for my S3 data. I’ll leave S3 as primary for now since I’ve got it all configured and tested happily, but it looks like I’d be well served either way.

Maybe in a future post I’ll do a head-to-head comparison of S3 and GCS, if I can figure out a fair way of measuring performance all over the globe. Until then, I’m happy to stick with either.

Mission accomplished – static files hosted. Time for bed.

Marriage Sort, a visualization

2012-01-16T23:49:00.002-05:00

You'll need a modern browser that handles SVG (a recent Chrome, Firefox, or IE9 will do) to properly see this post.

On Saturday, Mike Bostock posted a link on Hacker News to an explanation of the Fisher-Yates Shuffle he had written. Alongside the explanation he included a very slick visualization where you could see it in action. Now, at a fundamental level, shuffle algorithms and sorting algorithms are simply the reverse of each other. So if this visualization does well on shuffle algorithms, why not a sorting algorithm as well?

A couple years ago I wrote a sorting algorithm, which I gave the tongue-in-cheek name Marriage Sort (an homage to the problem that inspired it - see the introduction post for more details). I was looking for a project to play with on the weekend, so I thought, why not try the visualization on that? Mike has kindly let me re-use his code, so here's the same visualization of Marriage Sort.

Disclaimer: The visualization I'm building off looks great across platforms, so if there is anything wrong with this one, it's from my changes. Go see the original if you don't believe me.

For those that don't remember the algorithm, here's a synopsis. There are two stages:

Stage 1:
1. Take a prefix of size √m, where m is the size of the working set.
2. Pick the maximum (in this case, most-right-leaning) of the prefix, and take it as the pivot.
3. Walk through the list, pulling out anything greater than the pivot and moving it to the end of the working set.
4. Move the pivot to the end of the working set.
5. Go back to step 1 and repeat.
Stage 2: a bog-standard insertion sort.

As the passes proceed the pivots decrease in size and more values are matched, allowing the array to be put into a "mostly sorted" state. On average, every element is within √n spots of the correct position, which the insertion sort then corrects. Overall complexity O(n^1.5) time, O(1) extra space. Code for the visualization is at https://gist.github.com/1628602 for easy perusing.

Looks good to me! What do you think?

The Game of Life, part 2: HashLife

2011-05-10T23:53:00.002-04:00

Last time I wrote I gave a brief introduction to the Game of Life and a very simple Python implementation for visualizing it. I will freely admit that was a teaser post; this post gets into the real meat of the topic with an overview of the HashLife algorithm and a much more interesting implementation.

Introduction

This entry has taken me an embarrassingly long time to post. As is my habit, I wrote the code and 90% of the post, and then left it for months and months. Whoops!

If you haven’t played with a Game of Life viewer before they are legitimately fun to toy around with - I encourage you to check this one out (code is here). Since the last version everything is much improved. The viewer supports a larger set of controls (see the README for details) and basic file reading is implemented so it’s possible to try new starting patterns on the fly. And, as promised, I’ve implemented the HashLife algorithm to massively speed up iterations, so enormous patterns billions of generations forward are easily within your reach.

Algorithm

HashLife is a simple yet interesting algorithm. Invented in 1984 by Bill Gosper (of Gosper glider gun fame), it exploits repeated patterns to dramatically cut down the work required to support large patterns over vast numbers of iterations. Between the Wikipedia page and the enigmatically named “An Algorithm for Compressing Space and Time” in Dr. Dobb’s Journal I think it’s decently well explained, but it took me a couple read-throughs to really wrap my head around so I’m going to try to give an overview of the key insights it utilizes.

At it’s heart, HashLife is built around the concept of a quadtree. If you’re unfamiliar with it, a quadtree takes a square region and breaks it into four quadrants, each a quarter the size of the original. Each quadrant is further broken down into quadrants of its own, and on down. At the bottom, in squares of some minimum size like 2x2, actual points are stored. This structure is usually used to make spatial queries like “what points intersect this bounding box” efficient, but in this case two other properties are taken advantage of. First, nodes at any level are uniquely defined by the points within their region, which means duplicated regions can be backed by the same node in memory. For the Game of Life, where there are repeated patterns and empty regions galore, this can drastically reduce the space required. Second, in the Game of Life a square of (n)x(n) points fully dictates the inner (n-2)x(n-2) core one generation forward, the inner (n/2)x(n/2) core n/4 generations forward, irrespective of what cells are adjacent to it. So the future core of a node can be calculated once and will apply at any future point in time, anywhere in the tree.

Together these properties allow for ridiculous speedups. Hashing and sharing nodes drastically reduces the space requirements, with exponentially more sharing the further down the tree you go. There are only 16 possible leaf nodes, after all! From this, calculating the future core for a node requires exponentially less time than a naïve implementation would. It can be done by recursively calculating the inner core of smaller nodes, where the better caching comes into play, and then combining them together into a new node. You might be wondering if the gains from caching are lost to the increasing difficulty of determining which nodes are equal, but with a couple careful invariants we actually get that for free. First, nodes must be immutable - this one’s pretty straightforward. Second, nodes must be unique at all times. This forces us to build the tree from the bottom up, but then checking if a new node duplicates an existing one is simply a matter of checking if there are any existing nodes that point to the same set of quadrants in the same order, a problem that hash tables trivially solve.

def __hash__(self):
# Hash is dependent on cells only, not e.g. _next.
# Required for Canonical(), so cannot be simply the id of the current
# object (which would otherwise work).
return hash((id(self._nw), id(self._ne), id(self._sw), id(self._se)))

def __eq__(self, other):
"""Are two nodes equal? Doesn't take caching _next into account."""
if id(self) == id(other):
return True
return (id(self._nw) == id(other._nw) and
id(self._ne) == id(other._ne) and
id(self._sw) == id(other._sw) and
id(self._se) == id(other._se))

Implementation

As before, the code I’ve written is for Python 2.6 and makes use of PyGame, although neither dependency is terribly sticky. The code lives in a repository on github, and I welcome any contributions you care to make. As the code here is complicated enough to be almost guaranteed a bug or two, there is a basic set of unit tests in life_test.py and the code itself is liberally sprinkled with asserts. Incidentally, removing the asserts nets a 20% performance gain (as measured by the time it takes to run the ‘PerformanceTest’ unit test), although I find the development time saved by having them is easily worth keeping them in forever. As noted later, the performance of the implementation isn’t all that important anyways. Which is a good thing, since I coded it in Python!

A comment on rewrites: during the transition from version 1 - a simple brute force algorithm - to version 2 - the Node class that implements HashLife - I had both algorithms implemented in parallel for a while. This let me have every second frame rendered by the old algorithm so I could ensure that at different times and different render speeds that the algorithms were coming up with the same results. I’ve seen this pattern used at work for migrating to replacement systems and it’s very much worth the extra glue code you have to write or the confidence it gives. John Carmack recently wrote about parallel implementations on his own blog, if you want to hear more on the topic.

Performance

The performance is hard to objectively detail for an algorithm like this. For example, it takes ~1 second to generate the billionth generation of the backrake 3 pattern, which has around 300,000,000 live cells; it takes ~2 seconds to generate the quintillionth generation with 3x10^17 live cells. But this is a perfect patterns to showcase HashLife - a simple spaceship traveling in a straight line, generating a steady stream of gliders. In comparison, a chaotic pattern like Acorn takes almost 25 seconds to generate just 5000 generations with at most 1057 alive at any time. As it stands the properties of the algorithm drastically outweigh the peculiarities of the implementation for anything I care to do. Although I must say, if you want to compare it to another implementation in an apples to apples comparison I’d love to hear the numbers you get.

As always, I’d love to hear what you think!

The Game of Life, part 1

2011-03-12T11:03:00.003-05:00

Update: See part 2 for the implemented HashLife algorithm.

The Game of Life is a fascinating system. It was invented by John Conway in 1970 and has been studied continuously ever since. For those reading who haven’t heard of it before, a brief explanation: The world is an infinite grid of points, all either alive or dead. After each generation – or ‘iteration’ if you’d prefer – cells are updated according to the following three rules:

If a cell is alive and it has two or three live neighbours, it stays alive.
If a cell is dead and it has exactly three live neighbours, it becomes alive (tripartite reproduction?).
Any other cell is dead.

From these simple rules amazing complexity can arise. Some configurations are stable, like the period two “blinker” [above left], or the period four “glider” [above right] that moves one row over and one row down with every cycle. Other configurations, like the one above centre, grow infinitely – this one spits out two gliders then lays down a zig-zag strip of blocks forever after.

There is more to the Game of Life than pretty patterns and curious growth, I must hasten to add. It has been studied by a host of people in a variety of fields and has gone on to start a new branch of mathematics (cellular automata) and spur discussions on whether a sufficiently complicated pattern could be considered intelligent. It has also been proven to be Turing complete, so any computation your computer can run can be run by simulating the Game of Life with the correct starting state.

I have implemented a basic python program for simulating the Game of Life on GitHub. It allows for infinite patterns, grows the field of view automatically, and allows speed to be controlled with Up/Down, but otherwise is a very simple implementation. The goal here is to eventually implement some of the more interesting algorithms for speeding up the simulation. There are numerous such algorithms, although the one I find the most interesting is called Hashlife and exploits repeated patterns through space and time to achieve an exponential speedup in running the simulation. More details in part 2, whenever I write it :).

New Job!

2010-06-08T02:25:00.000-04:00

Today I officially started at Google! Exciting stuff. I will be working on the DoubleClick Ad Exchange doing something-or-other for the time being — however as I am neither a senior engineer nor a corporate figurehead, I do not plan to write about my work. This should, however, explain my recent absence :).

That is all.

Robustly hot swapping binaries (with code)

2010-04-28T14:23:00.015-04:00

Going down for swap
Coming up from swap!
This is generation 2 running 0.2 software (initial version 0.1)
Running state machine

A while ago I remember reading an article by Nathan Wiegand on hot swapping binaries. This was a very eye-opening article for me – before reading it, hot swapping was one of those black arts I never really thought about, and certainly wouldn’t have thought was at all easy. I highly recommend you read it for yourself. Go ahead, I’ll wait.

There. Did you notice it? The elephant in the room? One thing the article doesn’t address is how to design programs to use this fancy new ability, without being fragile and crashing and all those bad things. I’ve been mulling this over since reading it, and have settled on a basic design that I’ll present here. But don’t worry, this isn’t just a thought design…I’ve actually coded it up and made sure it works as expected. Feel free to jump down and check it out before reading the design. Still, caveat emptor and all that.

Design Goals

For this design, I am focusing on three key goals:

Allow updating to any future version
Allow updating to any previous version
Make it easy to be crash-free

Simple enough? Updating forward is a pretty obvious goal, as is crash-free code. I want to allow updating backwards as well for the simple expedient that I don’t expect all new code to be bug free, and so it might be desirable to roll back when a bug is introduced.

Design

To achieve this, I’ve settled on a shortlist of constraints for the code:

Use protocol buffers to store all state
Provide suitable defaults for everything, and be robust against weird state
Structure the code as a state machine

I did say the list was short. Let’s look at these in detail:

Protocol buffers are an obvious choice for persisting state, as they are forward and backward compatible by design. Care must be taken to not re-use tag numbers and to never add ‘required’ fields, but this is an easy requirement to satisfy. Now, using protocol buffers to store everything does incur some overhead, but they are quite efficient by design and we really only need to store all state between state machine transitions so local variables are still quick.
Hand in hand with 1), we cannot always expect to have the data we want in the format we want version to version. To accommodate this we must pick suitable defaults for fields, and if necessary be able to get new values at runtime. At the same time, if the meaning or format of a field is changing, it is probably better to use a new field. We will always try to handle weird data that may show up, but this shouldn’t be abused.
Finally, structure the code as a state machine. This surfaces all state machine transitions as potential points for upgrading versions, and forces state to be in the protocol buffer when these transitions are crossed to ensure important data isn’t forgotten. And like everything else, the next state data can be stored in the protocol buffer.

There is one problem with 3), however. What happens when new states are added? Going forward is easy, but if we update to a previous version when we’re in one of these new states, it will have no idea where to start running again. We could try storing fallback states or something like that, but that seems too fragile. Instead, I would recommend not allowing updates to occur when transitioning to these new states. Then, a few versions down the line when you’re sure you won’t need to downgrade past where they were added, remove that restriction.

enum State {
    // Special states supported by the program infrastructure.
    STATE_NONE  = 0;
    STATE_DONE  = 1;
    STATE_ERROR = 2;

    // Program states. Unknown state transitions lead to ERROR and terminate the
    // program, so should be avoided at all costs.
    STATE_INIT          = 3;
    STATE_PROCESS_LINE  = 4;
    STATE_MUTATE_LINE   = 5;
  }

  optional State prev_state = 2 [default = STATE_NONE];
  optional State cur_state  = 3 [default = STATE_ERROR];

What About Threads?

You may have noticed that this design is inherently single-threaded. Threading can be added easily enough if the main thread owns all the work, and can easily and safely wait for or cancel all worker threads without losing anything. In that case, spin down the workers when you’re about to swap, and spin them up again when it completes. If your program doesn’t fit that description, however, this design may not be for you.

Testing?

Of course! I would recommend trying all transitions on a test instance first before upgrading the real process. You could also build in consistency checks that auto-revert if the state doesn’t meet expectations, regression tests for certain upgrade patterns, etc. This design is meant to make it easy to hot swap successfully, but it is no silver bullet.

Let's See the Code!

As always, the code is up on GitHub for you to peruse. It is broken into two demonstration applications, ‘v1’ and ‘v2’, that can be swapped between at will. While looping they respond to ‘u’ and ‘q’ (update and quit), although at times you may be prompted for other input. the makefiles build to the same target location, so build whichever one you want run next and press ‘u’ to swap to it.

The code is structured so you can use it as a framework to play with yourself easily enough. You should only need to write an init method, update the state machine and .proto file, and write the respective state methods to do the real work. The state machine and state methods will look something like this:

ReturnCode runStateMachine(ProgramState& state) {
    cerr << "Running state machine\n";

    // Put stdin into non-blocking, raw mode, so we can watch for character
    // input one keypress at a time.
    setStdinBlocking(false);

    while (true) {
        ProgramState::State next;
        switch (state.cur_state()) {
            case ProgramState::STATE_INIT:
                next = runState_init(state);
                break;

            case ProgramState::STATE_PROCESS_LINE:
                next = runState_process_line(state);
                break;

            case ProgramState::STATE_DONE:
                setStdinBlocking(true);
                return SUCCESS;

            case ProgramState::STATE_NONE:
            case ProgramState::STATE_ERROR:
            default:
                setStdinBlocking(true);
                return FAILURE;
        }

        ProgramState::State cur = state.cur_state();
        state.set_prev_state(cur);
        state.set_cur_state(next);

        // For now, simply let the user decide when to swap and quit. We can
        // always change this later.
        ReturnCode code = checkForUserSignal();
        if (code != CONTINUE) {
            setStdinBlocking(true);
            return code;
        }
    }
}

ProgramState::State runState_init(ProgramState& state) {
    cout << "Please provide a line of text for me to repeat ad-nauseum\n";
    string line;
    setStdinBlocking(true);
    getline(cin, line);
    setStdinBlocking(false);
    state.set_line_text(line);
    cout << "Thanks!\n";
    state.set_line_count(0);

    return ProgramState::STATE_PROCESS_LINE;
}

Easy, right? And here is an example transcript from going forward and then back between the versions in the repository (behind the scenes compiles not shown):

eric:~/code/hotswap/v1$ ../bin/hotswap.out
HotSwap example started - version 0.1
Initial call
Running state machine
Please provide a line of text for me to repeat ad-nauseum
All work and no play makes jack a dull boy
Thanks!
0: All work and no play makes jack a dull boy
1: All work and no play makes jack a dull boy
2: All work and no play makes jack a dull boy
u
Going down for swap
Coming up from swap!
This is generation 2 running 0.2 software (initial version 0.1)
Running state machine
3 mutations: All work and no play makes jack a dull boy
4 mutations: All work and no play maXes jack a dull boy
5 mutations: All workqand no play maXes jack a dull boy
6 mutations: All workqand nL play maXes jack a dull boy
u
Going down for swap
HotSwap example started - version 0.1
Coming up from swap!
Running state machine
7: All workqand nL play maXes jack a dull boy
8: All workqand nL play maXes jack a dull boy
9: All workqand nL play maXes jack a dull boy
q
Terminating with code 0

As you can see, version 0.2 mutates the line as it goes, while version 0.1 simply prints it forever. There are more differences than that, but you can find all that out from the code.

Enjoy! If you do end up playing with it, I’d love to hear about your experiences, or your thoughts on the design even if not.

This will probably be my last post for a while – on Saturday I leave the continent for 2 weeks and the city for 6. I will try to respond to emails and comments while I’m gone, but I may be a bit slower than usual.

Indexing and enumerating subsets of a given size

2010-04-23T16:37:00.010-04:00

I received an email yesterday from a gentleman named Calvin Miracle, asking my opinion on subset enumeration strategy. He also provided a copy of this paper, which details an algorithm to generate a sequence of subsets in Banker’s order¹, and asked about an algorithm to generate these subsets in a stateless manner. I’ll let him describe it:

Given a call to the method banker(n,k,i), where n is the size of a set, k is the subset size under consideration, and i is the i'th subset of size k, the method will return a boolean vector of size n, with only k TRUE values, that selects the i'th k-size subset from the overall set.

Now, prior to this I had never heard of Banker’s sequences nor really thought about enumerating subsets, but I’m always willing to be nerd sniped so I gave it a go. Presented here is the algorithm I designed for him.

Disclaimer

After writing this algorithm, I did some more digging and found out that some recent research has been done into enumerating subsets of a specific size, although in lexicographic order rather than Banker’s order. Wikipedia has details on this, and provides an algorithm very similar to the one I created here for except lexicographic ordering. So none of this should be seen as especially new or groundbreaking, although it was new to me and hopefully will be to you as well.

Algorithm

The basic idea of this algorithm is that given any choice of the first element for our subset, we can calculate how many possible ways to choose the remaining elements there are, and we know that every subset with this first element will come before every subset with a later first element according to our ordering scheme. So we can iterate through the possible choices of first element, totalling up how many subsets each represents, until we find the range that the desired index falls within. We can then recursively determine the rest of the subset in the same manner.

So, Let n be the number of items, k the size of the subsequence, and i the specific index we are looking for. We will enumerate sequences by considering all subsets that start with “1”, then all that start with “01”, then all that start with “001”, etc., where “0” represents skipping an item and “1” represents selecting it. There are n-1 choose k-1 of the first sort, n-2 choose k-1 of the second, n-3 choose k-1 of the third, etc. So:

“1”:	0	<= i <
“01”:		<= i <
“001”:		<= i <
…

Once we know what prefix i has, we can recursively determine the next sequence using n' = n-(prefix length), k' = k-1, i' = i-(bottom of range).

Example

Let n = 5, k = 3, i = 7:

“1”	0	<= i <	6 (4C2)
“01”:	6	<= i <	9 = 6+3 (4C2 + 3C2)

So, initial prefix is “01”.

We recurse with n = 3, k = 2, i = 1:

“1”

<= i <

2 (2C1)

So, the next piece is “1”.

Finally, we recurse with n=2, k=1, i=1:
Trivially we can see that for “10” = 0, “01” = 1, so the last piece is “01”.

Putting this together, the sequence is “01101”, so items 2, 3, and 5 of the 5 compose the 7th subset² of length 3.

Complexity

This algorithm is quite efficient – it needs to check at most n prefixes in total over all the levels of recursion, since each we consider shortens our candidate set by 1. There will be at most k recursive levels, and k ≤ n. So if we handwaive the “a choose b” calculations, we find that this can be done in O(n) time, which is O(1) in the size of the output, and thus optimal.

If we don’t handwaive the choice functions, we find that nCk is O(n!) in the worst case, so the result requires O(log(n!)) =O(nlogn) bits to store. If addition is O(1) we still may need to add up to n of these, so the total worst-case runtime is O(n² logn), or O(nlogn) in the size of the output. However, with small values for k we don't get anywhere close to that worst case, and indeed are closer to O(logn) in size of the input. This is still quite efficient, and probably about the best that can be achieved in a function of this type. We should also consider the time needed to calculate the choice values, but if we are iterating over these indices the values can be pre-cached first and then the amortized cost per subsequence lookup goes to effectively zero.

Code

If you want to play with this algorithm yourself, I have placed some java code implementing it on GitHub. It is as efficient as expected, and should be plenty fast enough for anything you could conceivably need it for. For example, it takes essentially no time to determine that the 160,000,000,000,000,000,000,000,000,000th subset of size 12 from a 10,000 item set is {0, 1, 2, 69, 1212, 1381, 4878, 5291, 5974, 6139, 6639, 8979}. So have at it!

Update: Ligon Liu has kindly provided a C# port of this code, which I have added to the repository (it the c# directory).

Update 2: Josiah Carlson has kindly provided a python port of this code, also in the repository now.

Update 3: Richard Lyon has kindly provided ports to both JavaScript and PHP, on GitHub as well. Thanks guys!

Update 4: Corin Lawson has a C implementation in his GitHub repository, along with other Banker's order functions. Great!

Footnotes

The easiest way to think of Banker’s ordering is to think of comparing the indices of the items that make up the subsets. So to order two subsets, take the list of indices of elements in the first subset, and compare it to that of the second subset in standard dictionary (lexicographic) order. So {2, 3, 4} comes before {2, 3, 5} but after {1, 3, 5}, for example.

The image at the top of this post depicts the Banker’s ordering for subsets of lengths 1, 2, and 3, respectively, from a set of size 9.
Subsets are indexed from 0, so you may consider this the 8th subset.

Introducing: Marriage Sort

2010-04-14T01:45:00.013-04:00

Update: I've created a live visualization of this algorithm, so you can see it in action - see Marriage Sort, a Visualization.

Two weeks ago, a link showed up on Hacker News on how to (mathematically) select the best wife. Tongue firmly in cheek, of course. The key takeaway from the article is that knowing only the relative rankings of items in a sequence (and assuming you can never go back to a previous one), you can maximize your chances of selecting the maximum one by letting N/e go by and then picking the next one that's better than all those. Further, to maximize your expected value you only need to let the first √N - 1 go by. After rattling around in the back of my mind for a couple weeks, yesterday this tidbit coalesced into an actual Idea. Could I turn this selection routine into a sorting algorithm?

And thus, Marriage Sort was born. I present it here purely out of interest, not because it is especially fast - if nothing else, because it is the only sorting algorithm I know of that has an average complexity of O(n^1.5)¹.

Algorithm

The basic idea of this algorithm is to repeatedly choose the maximum element from the first √N - 1 elements of our working set, and then scan to the end looking for elements bigger than it. When one is found, swap it to the end of the working set, and decrease the working set by one. After the pass is complete, swap out the maximum element from the first √N - 1 as well, and start again. When everything is finished (√N - 1 <= 0), use insertion sort to put the array into true sorted order.

One pass of the Marriage Sort. Two elements are found and swapped to the end, followed by the maximum element from the first √N - 1.

This works because, as we have noted from the article linked above, each element bigger than the first √N - 1 is expected to be close to the largest remaining element in the array. By repeatedly taking these elements and moving them to the end, we put the array into an approximately sorted order, where elements should be reasonably close to their 'true' positions. When this is done insertion sort will do the final rearranging, moving elements the short distances to their true positions.

In pseudocode:

def marriagesort(array):
    end = array.length
    while true:
       skip = sqrt(end) - 1
       if skip <= 0: break

       # Pick the best element in the first vN - 1:
       bestPos = 0, i = 1
       while i < skip:
           if array[i] > array[bestPos]: bestPos = i
           i += 1

       # Now pull out elements >= array[bestPos], and move to the end:
       i = skip
       while i < end:
           if array[i] >= array[bestPos]:
               array.swap(i, end-1)
               end -= 1
           else:
               i += 1

       # Finally, move our best pivot element to the end
       swap(bestPos, end-1)
       end -= 1

    # Finish off with insertion sort to put the elements into true sorted order
    insertionsort(array)

Here you can see this algorithm working on a randomized array. Many thanks to Aldo Cortesi for the wonderful sorting algorithm visualization framework!

Update: Here is another visualization made using Aldo Cortesi's tools, of a 512 element array this time. If you take a close look at the insertion sort phase (the slope on the right side), you can see that it is composed out of a lot of little triangles. Each of these triangles corresponds to one pass of the marriage selection routine, and shows how far elements can be from their true sorted position.

Complexity

Some quick back-of-the-napkin calculations indicate that this algorithm takes O(n^1.5) compares in the average case, although the worst case looks to be O(n²). For the first stage (i.e. without the final insertion sort) each pass will take at most n compares and move approximately √N items to the end, requiring about √N passes. I don't know if this is a tight bound - the passes will actually speed up and require fewer compares as the working set shrinks, however I didn't want to try including that in the calculations. Similarly, after each pass all the items moved are guaranteed to be greater than all the items left, capping the distance moved items are from "home" to √N on average for the insertion sort. This holds the insertion sort to O(n^1.5) compares on average as well.

Similarly, this algorithm takes O(n^1.5) swaps in the average case (and O(n²) in the worst case). The number of swaps could certainly be decreased if a different final sort were used - the first pass only takes Θ(n) by itself - although in practice this isn't much use since the compares will still dominate the runtime.

Note that both of these graphs are log-log plots, so the important feature to look at is the slope of lines rather than the absolute position. For example, in the "Swaps" plot quicksort appears to be hovering around the n∙log(log(n)) line, however looking at the slope we can see that it is actually approximating n∙log(n), just with a better constant factor.

Note 2: The x axis of these graphs is the number of elements in the array, and the y axis is the number of comparisons/sorts required. I should have labeled these better.

Postscript

Some sample code implementing this algorithm is up on GitHub - this is the code to generate the log-log plots shown above.

As always, questions and comments are welcome!

Footnotes

Most sorting algorithms are either Θ(n²) or Θ(n∙log(n)) in the average case. This one falls between those two extremes, making it faster (asymptotically) than algorithms like insertion sort or bubble sort but slower than quicksort or merge sort.

Visualizing RGB, take two - Update

2010-03-17T00:38:00.005-04:00

When I posted my 'Visualizing RGB, take two' article last month, an anonymous commenter going by the name Full-size had a good suggestion - I should use a form of error diffusion to better hide the pixel errors when selecting the colours to use. Within a couple days I had this implemented, and wow does it make an improvement! Unfortunately, between school and needing to reinstall Windows I never ended up posting the results. So, three weeks late, here they are.

Results for Barcelona image

As you may recall, the goal of the algorithm was to transform a source image into a 4096x4096 version that uses every RGB colour exactly once. As a reminder, here is the original image, and what my algorithm previously did with it:

Original

Result from previous version

Now, take a look at how it looks using a form of error diffusion (with error terms capped at 60):

(Full 50 meg png available here). Much nicer to look at! Take a look at the sky in particular - where it used to go psychedelic trying to deal with all that near white, large portions now end up simply turning into a uniform gray. The new version is worse in a couple spots (e.g. the wheel wells of the car), but overall I think it is hugely improved. Now, I wonder how the new version would do on a harder image?

Results for the Mona Lisa

As promised, I am also posting the results of running this algorithm on an image of the Mona Lisa. This is an especially difficult image, because the colour palette is very limited - notice the lack of blue in particular. First, let's take a look at the original, and the result from the previous version of the code:

Original

Result from previous version

Ouch. Poor Lisa. Still, let's forge on and see how the new version does, shall we?

(Full 50 meg png available here). Considerably better overall, although the colour burning on the forehead and neck is pretty ugly to look at. Still, considering we are trying to use every RGB colour once, I think the results are quite decent.

Postscript

The code to achieve this is in the same place as before, updated on GitHub. Right now you have to change the source code to edit the error diffusion cap, sorry - but feel free to fix that!

Writing an efficient Sieve of Eratosthenes

2010-03-02T22:57:00.008-05:00

See also the followup post containing a segmented sieve implementation in Go that beats all these.

I recently came across a paper by Melissa E. O'Neill on implementing a lazily evaluated Sieve of Eratosthenes. It was a very interesting read, although for a non-Haskeller understanding the code was certainly tricky. And by happy happenstance, not two weeks later I ended up needing a prime generator of my own for one of the Project Euler problems. This, I thought, was the perfect opportunity to try implementing the algorithm from that paper! That thought ended up leading me on a many-day journey to see how efficient an implementation I could make. This post details the key changes my code went through in that time, culminating in a C# algorithm that can sieve around 20 million candidates per second (or a Python version that can do 1.4 million).

Disclaimers

During the course of this journey I learned about segmented sieves and the Sieve of Atkin, but I decided not to make use of either of these, instead sticking to the algorithm I chose from the start. Maybe next time.
If you are simply looking for the fastest implementation possible, I point you towards http://cr.yp.to/primegen.html, which is written in highly optimized C and at least an order of magnitude faster than mine.

Version 1: Straight re-implementation of algorithm from paper

Language: Python
Time complexity: O(n∙log(n)∙log(log(n)))
Space complexity: O(√n/log(n))
Candidates in 1 sec: ~300,000

Code (expand/contract)

This is simply a re-implementation in Python of the Haskell code from the paper above, although it loses some of the elegance in the translation. You'll note that I am measuring performance with the very unscientific 'candidates in 1 sec' - since I am trying to compare algorithms with different asymptotic complexities in different languages with order-of-magnitude performance differences, it seemed like the easiest way to get a feel for the general magnitudes in this apples-to-oranges comparison.

The key idea here is that for each prime we encounter, we make a generator for multiples of it (starting at p²) that we can use to filter out our candidates. For efficiency, all multiples of 2 and 3 are skipped automatically as well, since that reduces the number of candidates to check by two thirds. These generators are all added to a min-heap, keyed by the next value each will generate. Then we can keep checking the next candidate and seeing if it matches the top of the heap. If it does, pop that, increment it, and push it back on keyed by the next value. If not, the candidate must be prime, so build a new generator and add that instead.

So, let's try an example of this. The first candidate we try is 5 (remembering that all multiples of 2 and 3 are already gone), and our heap is empty. This becomes the first item on the heap, and then 7 is checked:

5? []

Nope, prime
7? [25: 25,35,55,...]

Nope, prime
11? [25:25,35,55,... 49:49,77,91,...]

Nope, prime
13? [25:25,35,55,... 49:49,77,91,... 121:121,143,187...]

Nope, prime
17? [25:25,35,55,... 49:49,77,91,... 121:121,143,187... 169:169,221,247,...]

Nope, prime
19? [25:25,35,55,... 49:49,77,91,... 121:121,143,187... 169:169,221,247,... ...]

Nope, prime
23? [25:25,35,55,... 49:49,77,91,... 121:121,143,187... 169:169,221,247,... ...]

Nope, prime
25? [25:25,35,55,... 49:49,77,91,... 121:121,143,187... 169:169,221,247,... ...]

Aha! Matches the top of our heap...not prime
29? [35:35,55,65,... 49:49,77,91,... 121:121,143,187... 169:169,221,247,... ...]

Nope, prime
...

Make sense? This is the algorithm straight from the paper, so if my explanation is lacking you could always try there. When I write it down this way, however, an optimization springs out at me - why are we using a heap at all? At any given time we are querying for a single value, and don't actually care about the other items. Instead of updating a heap structure all the time, we can simply use a hash table instead. Strictly speaking a multi-way hash table, since some values will have multiple prime factors.

Version 2: Using a hash table

Language: Python
Time complexity: O(n∙log(log(n)))
Space complexity: O(√n/log(n))
Candidates in 1 sec: ~1,400,000

Code (expand/contract)

Not bad, under 50 lines of code and with good complexity. This is as far as I know how to go with Python, and it is a dynamic language anyways which isn't really suited to this kind of number crunching, so for the next version I'm switching to C#. There I know the time required for most operations, and more importantly, I have a good profiler I know how to use.

Version 3: Re-implement in C#

Language: C#
Time complexity: O(n∙log(log(n)))
Space complexity: O(√n/log(n))
Candidates in 1 sec: ~10,000,000

Code (expand/contract)

Wow, significant speedup there! And with a modest 25 line code hit (plus a main method for convenience), that is certainly reasonable. A little profiling tells me that the bottleneck is the list operations, which isn't too much of a surprise. Of course, C#'s native list implementation isn't necessarily going to be the fastest for our exact situation - as is often the case when optimizing, the next step is to write our own data structure. Here I use a dead simple data structure, essentially just an array with an Add operation that dynamically resizes it.

Version 4: Use a custom structure for the 'multi' in multi dictionary

Language: C#
Time complexity: O(n∙log(log(n)))
Space complexity: O(√n/log(n))
Candidates in 1 sec: ~12,000,000

Code (expand/contract)

Not bad, a modest speedup in this version. But I must admit, I cheated a little. The main benefit of this new data structure is not the speedup gained at this step. No, the real speedup is allowing an essentially free Clear operation. Because as useful as a hash table is, we can re-write it too to get even more of a speedup. The idea here is that most of our current iterators will have a 'next' value that differ by at most √n or so, so a circular array will be better for cache locality, remove the cost of hashing, and let us re-use the 'multi' objects. Note that a circular array is essentially a giant array, except only a small sliding window is actually backed by memory and allowed to contain values at any given time.

Version 5: Use a multi cyclic array instead of hash table

Language: C#
Time complexity: O(n∙log(log(n)))
Space complexity: O(√n)
Candidates in 1 sec: ~20,000,000

Code (expand/contract)

Ok, that is about as far as I'm willing to go. Beyond this point I expect I'll start getting into low-benefit, high-complexity optimizations, and I don't want to go down that road. Although by some counts I am already there - from version 3 there has been a 2x speedup, but at the cost of doubling the code size and ending up with algorithms that fall firmly within the realm of 'tricky'. If I actually had to maintain a prime sieve in a professional setting that wasn't absolutely time critical, I think I would be going with version 3 - the later code starts to impose too much of a mental tax.

And there you have it, a reasonably efficient Sieve of Eratosthenes, and for the most part without 'tricky' optimizations. Let me know what you think!

Postscript

In the comments, Isaac was wondering how these compared to "vanilla array" implementations (essentially, pre-allocate the array and cross-off), so I have added two array implementations to GitHub (Python, C#) for comparision. Both are about 4 times faster than the comparative best in their language, testing 5.5 million and 75 million candidates in one second (respectively). The Python version runs out of memory somewhere before 500 million candidates, and the C# version can't get beyond about 2 billion due to limitations of the BitArray class.

Visualizing RGB, take two

2010-02-20T01:40:00.006-05:00

Update: The follow-up post can be found here.

About 3 weeks ago, I wrote about a program I created to transform pictures into all-RGB images (4096x4096 images using each possible RGB colour exactly once). It worked by ordering pixels based on a Hilbert ordering of the 3d colour cube and then re-colouring them in order, and while it produced interesting images, it pretty much failed at the stated goal of “keep[ing] the overall look untouched”. The problem was that the general hue of the pixels was often very shifted from the input, so the overall features were preserved but the colour balance was not. So for the past week or so I’ve been working on a new program, one that will (hopefully!) do a better job of keeping the base look intact. As with last time, I’m using an image I took in Barcelona for testing – let me know if you have a different one you’d like to see.

Original

Result From Take One

Choose the closest colour…

My idea this time was that instead of choosing an ordering of the pixels, it would be better to try to minimize the distance between the source and destination colours overall. The easiest way I could think of was to simply choose pixels at random, and assign them the “closest” colour remaining. Hopefully deviations would occur in all directions equally, so the average colour of a region would be as close as possible to the source. By popular demand, I will try to make this algorithm a little more explicit this time:

Go through the pixels in the source image in a random order.
1. For each, select the closest remaining unused colour, by Euclidean distance between the coordinates in the colour space.
2. Assign the found colour as the output colour of the pixel, and mark it used.

But in which colour space?

Sources: one and two

A key question I had was which colour space would be best for preserving hues? There are a number of different “colour solids” that I could use coordinates from, with RGB being only one of many. I had a strong suspicion that something like HSL would do better than using RGB directly, but the easiest way to find out which to do a direct comparison. I tried the RGB cube as well as HSL and HSV cylinders for the comparison. My test images are presented below.

Original

RGB

HSL

HSV

As you can see, HSL and HSV give essentially the same results, which are both much better than RGB (look closely at the wheel wells, or the buildings in the trees on the right to see the differences). I like to think that HSV is slightly better, but I might be imagining differences that really aren’t there. Either way, I chose to use HSV for the final copy.

Looks good! Certainly a lot closer to the source image – I’m satisfied with this one for now.

Postscript

As with last time I am using a conceptually simple algorithm, however this time the implementation was considerably more difficult. The problem is that choosing the closest remaining colour to a source pixel is a hard problem to do efficiently, especially since the set of candidate colours changes at every step. I wrote the code in C# for performance this time, but I have still had to spend quite a few hours optimizing the code to get the program to finish at all. The final version can take 30+ hours to generate an image, and peak at over 4 GB of ram. I based my code around a KD-tree I found online, then rewrote to optimize for the 3D, single-nearest-neighbour case as well as to support branch pruning on delete. The rewritten tree – as well as the rest of my code – is available in a repository on GitHub: http://github.com/EricBurnett/AllRGBv2. Feel free to try it out for yourself - if you do, I’d love to hear about it!

Buzz: The perfect spam platform?

2010-02-12T19:45:00.000-05:00

Buzz is out, genius or a privacy fiasco or just another me-too, depending on your view. Those topics have been covered to death already, but one thing I haven't seen talked about is how easy it makes spamming. And I don't mean shouting inanities to all your friends - that's what it's for, after all - I'm talking about targeted spam by spammers, like the blog spam that used to be a huge problem.

Here is the problem as I see it:

You can follow anyone you want simply by adding their email address.
When they "buzz", you are notified.
When you comment, they see your response.

Does this seem like a bad idea to anyone else? I have a feeling that Google is going to have to allow people to "accept" followers, bringing it that much closer to Facebook.

Straining the limits of C#

2010-02-12T15:31:00.001-05:00

Two weeks ago I wrote about an algorithm to generate All-RGB images from pictures. I am currently working on a follow-up post about a new algorithm, in C# this time. This one is a bit more computationally intensive, and despite the language change it is running into scaling issues. So while I wait for it to finish, I thought I'd write about a few of them.

Good data structures are hard to find

When you start processing large numbers of items in different ways, choosing the right data structure to store them becomes an absolute necessity. They can mean the difference between an O(n∙log(n)) and an O(n²) and algorithm, which can be the difference between taking 1 hour to run or 100 years. For this project, the requirements were simple – a data structure mapping points to objects that supported nearest-neighbour searches and deletion. To me, that immediately translated to kd-tree.

Usually in cases like this I end up needing to roll my own structure, but this time I was lucky. After some Googling I found exactly one viable implementation to use, and better yet, it was open source. I'm glad it was; it turned out later that there was a bug¹ that needed fixing, and I needed to compile a 64-bit version anyways (I wonder if there's a lesson in here?). It is unfortunate that this was the only option, however. I mean, there are a ton of data structure libraries for most languages you can imagine, but the vast majority of them implement the same set of structures, are buggy, unsupported, and incompatible. I would love to see a Stack Overflow-style site to address this – community edited and supported code, implementations ranked together for comparison, searching by requirements if you don't know what you need, and the list goes on.

But even with the appropriate structure, the algorithm I have chosen will take more than a day to run and 4+ GB of memory. That is fine, I knew the approximate complexity when I started, but it does lead to the next set of issues.

Good algorithms are hard to find

Or should I say, good implementations of algorithms are hard to find. By way of introduction, a brief digression: my computer is unstable. Not terribly unstable, not enough to for me to actually take the time to fix it, but my sound card is slightly unsupported on Windows 7 so every once in a blue moon something like skipping a song will blue-screen the computer. Just about all my programs know how to pick up where they left off, but of course that doesn't hold for these projects I throw together in an evening. So when my computer decided to crash this morning, I decided to add some basic checkpointing. Checkpointing is easy, right? Hah!

Attempt 1: tag classes with [Serializable], run needed structures through a BinaryFormatter, streaming to file.

So, anyone want to guess what the problem is here? If you said object-graph size, you're right on the money. BinaryFormatter doesn't support object graphs with more than 6M items or so, and arrays get no special treatment. So serializing an array of 16.7M items throws a very unhelpful error message ("The internal array cannot expand to greater than Int32.MaxValue elements")². Fine, I can unwrap my own arrays easily enough.

Attempt 2: unwrap arrays manually.

With each array element being serialized as a separate object, the overhead in the file is huge. If I had to guess, I'd say that the size on disk is about 10 times the size in memory. And since I'm trying to write about 1 GB of data...you can probably guess where this is going. Something in the output stack explodes when more than 4 GB of data is written, a number suspiciously close to the max size of an Int32. This is simply poor implementation, since it's not like I'm trying to mmap the data, and large files have been supported in all modern OS' for years. Not a big deal though, that data is going to be very redundant and I/O is expensive, so writing a compressed stream is probably faster in the long run.

Attempt 3: write to the file using a System.IO.Compression.GZipStream wrapper.

With compressed data, I expect the on-disk size to be comparable to the in-memory size, or a bit better. So the 4 GB limit should be no, problem, right? Wrong! The GZipStream has the same problem, and refuses to support more than 4 GB uncompressed. The fix here is even simpler – swap in a better GZip library.

Attempt 4: write to the file using a SharpZipLib.GZipOutputStream wrapper.

Success! The output file is about 700 MB and takes somewhere around 20 minutes to write, for a compression rate of about 9 MB/sec and space savings of about 93%.

Now, I could chalk these problems up as a failing of C#, but that wouldn't be accurate. By playing with this much data I am working outside the limits expected by the library designers, and I know it. I have focused on C#, but the issues are far more general than that – I can't even find a 64-bit version of python 2.6 for Windows to test with at all, but I'm sure I would run into a different set of problems if I could use it, and the same goes for the rest of the languages out there. The real issue is that versatile algorithm implementation is hard, and not getting much easier with time. And that I don't have a workaround for.

Footnotes

The problem is that "deletions" are supported by tombstoning, so you periodically have to rebuild the index to clean them up. That is fine, except the range() method used to get the current entries out doesn't check the deleted flag! Don't worry, I'll be sending a fix back upstream.
Someone else did the digging, and it seems the list of prime sizes for some internal hash table stops at 6 million, so the next prime it tries to resize to is something enormous (-1 unsigned?). Microsoft decided this is a non-issue, so no fix is coming. Their suggested workaround was to use the NetDataContractSerializer and a binary xml writer, but when I tested it the performance was too terrible to consider.

Visualizing RGB

2010-01-31T02:50:00.009-05:00

Update: See the follow-up post here.

Aldo Cortesi posted a link today to allrgb.com, a site dedicated to images visualizing the RGB colour space - in particular, 4096x4096 images that use each RGB value exactly once. Inspired by his hilbert curve visualization and the urge to spend a day programming, I present to you: the all-RGB Mandelbrot set.

Sort by colour...

My idea was this: instead of trying to visualize the colour space directly, why not use a base image for the "shape", and then map the RGB spectrum onto it? I thought that if I could find an image with an even spread of colours, this would let me make each pixel unique yet keep the overall look untouched.
To perform this mapping, I chose to define an ordering based on the 3 dimensional Hilbert curve. Cortesi explains it far better than I can, but the basic idea is this: the Hilbert curve can be used to find an ordering of all 16.8 million colours so that if you were to stretch them out on a line, every colour would be there and they would flow smoothly from one to another. Like this, except a lot smoother and a lot longer.

With this ordering in hand it is easy to find the index into this line for each pixel in the source image, sort the pixels, and then assign them the colour they line up with.

Choosing an image

When I started this morning, I had the idea that the output images would look reasonably close to the source images. I was half right; the images certainly have all the same features as before, but the colouring is all wrong. In hindsight the reason is obvious - unless the original had a perfectly even spectrum of colours, the mapping would be stretched in some places and shifted in others, and in general not line up nicely.

While interesting, this wasn't exactly what I was going for. Hmm...what image could I use where it wouldn't matter if the colours were all shifted? The first thing that came to mind was a visualization like the Mandelbrot set, where the colours are arbitrarily chosen anyways. A quick Google search found me this:

Which, when transformed, came out as this:

Perfect!

Postscript

I created all the visualizations here using Python 2.6 and the Python Imaging Library. It isn't the most efficient code (it takes half an hour to render a 4096x4096 image), but the quick development time easily makes up for it. If anyone is interested in playing with it, I have placed the code on github. Or if you just want to see what something would look like, feel free to send it to me and I'd be happy to run it for you.

Moved!

2010-01-30T14:22:00.001-05:00

This blog has moved! It now lives at www.thelowlyprogrammer.com. As is probably obvious, I have also changed the name to The Lowly Programmer (see the updated about me for details). Nothing else is changing at the moment; in particular, I doubt I will be posting to it with any regularity yet :).

The value of a point

2010-01-16T16:29:00.001-05:00

Recently I have been spending a lot of time on Hacker News. I've lurked for quite a while, but it is only in the last month or so that I have actively started to comment. I usually try for informative comments rather than going for the inflammatory or popular topics. This only gets me a point or two per comment I post, but these are points I have earned by contributing to the discussion in a positive way. I had almost reached 50 points this morning and was feeling quite good about my contributions overall.

But then I made a submission.

Most of my submissions go the way of my comments, earning with a point or two plus some interesting responses. It's these responses I am after; I find an interesting article somewhere, so I submit it to see what the HN community has to say about it. But this one, for whatever reason, the community really appreciated, and within 20 minutes it was at the top of the front page. The discussion was lively and as interesting to read as I could hope for, and it was raking in the points.

And therein lies the problem. Someone, a self proclaimed member of HN no less, took the time and effort to write a thought provoking piece and there I was earning scores of points off it, which left me feeling like a bit of a cheat. But even had I been the original author, I don't think I would want that many points for it. By the time it falls off the front page it will likely be over 50 points itself, surpassing a month's effort in one fell swoop and devaluing the points accordingly. I was looking at points as a general reflection of my value to the site, but I cannot do that anymore, at least not directly.

So what can be done about the 'problem' with points? I can think of a few suggestions.

Nothing. After all, I am only a single opinion, and the system seems to be working fine as it is.
Don't count any points past a certain cutoff. They would still be useful for display and ranking, but for karma the extra would be discarded. I kind of like this idea, but any cutoff would be an arbitrary one. Stack Overflow uses a system like this, although there it's points per day rather than per action.
Count points on a log scale. Display them as normal, but change the way karma is calculated. This would help address crazy outliers like this one, and in general promote sustained quality over hot-button topics (for people actively trying for points).

I know pg is always tweaking the system to make it work better; it will be interesting to see what, if anything, actually changes in the next while.

Google Wave. Can you imagine?

2009-06-06T04:13:00.001-04:00

I'm sure you've all heard of Google Wave by now, but for those who haven't, think of a hybrid email-chat-collaboration program all rolled into one. Participants can message back and forth on their own time like email, but it's also much more than that. As you type, your keystrokes are streamed in real-time to everyone else, letting them read what you write as you write it. Couple that with the ability to comment and edit others' comments, and what do you get? I'm not sure, but we'll find out soon enough.

I don't have any earth-shattering insights about how this product is going to revolutionize how we communicate (or fail completely), but I would like you to consider one scenario. Imagine a group of Digg (or 4chan or reddit or...) users, all getting onto a wave at once. They'd be able to fight for the top, change each others' comments to include even more memes, spam, complain, and of course mock, and all in real-time. Pedobears would be flying, grammar would be mutilated in the name of speed, and still people would be replying in anger to posts not even finished. Would it be fun? Quite possibly. Carnage? Undoubtedly.

Welcome to the internet, Wave.

How fast do you type?

2008-09-17T20:10:00.002-04:00

According to typingtest.com, I can type at 58 words per minute with 97% accuracy, which is probably about average among my roommates. But that is words per minute; I am a programmer, I spend as much time typing code filled with hyphens and tildes and ampersands as typing "words". And I must admit, I never learned to touch-type type numbers or symbols, so typing code involves a lot of glancing down at the keyboard to find out exactly where a symbol is. Try it yourself – here is a string I find myself typing fairly often during a day at work (and no, it isn't confidential!):

dbisql.exe –c "UID=dba;PWD=123;ENG=myEngine;DBF=..\testdb.db"

How quick can you type that? And I will point out that this string doesn't even have any really difficult characters, just lots of punctuation.

Steve Yegge wrote a really good rant recently on the subject of typing (or as he calls it, Programming's Dirtiest Little Secret), and it got me thinking. As good as I am at typing normal text, I fall apart when I run into anything else. Before reading his rant, I would never have thought of putting practicing typing on my professional development list, despite the fact that it will serve me better than any of the books I'd read or languages I'd learn. But right now I am going to put learning python on hold (again) and spend some quality time with my old friend Mavis Beacon.

What’s next for copyright?

2008-05-31T18:50:00.004-04:00

Everyone knows there is a copyright war going on right now. Big companies own the copyright to virtually all music, movies, and television shows, and they want to make large profits off these rights. People, on the other hand, want to be able to play their media when and where they want, and ideally they'd like to get it for free. This conflict shows up in a number of guises – battles over internet privacy, digital rights management, and bittorrent, to name a few – and shows no signs of being resolved any time soon. I do think there is hope for a solution, but the industry needs to accept that they can't control everything.

To see it, let's take a look at one website that ignores copyright and gets away with it: YouTube. Users can upload and watch virtually anything, and due to the sheer number of uploads, copyright holders can't keep up. It survives because there is a big company protecting it from lawsuits, but that isn't why it's popular. The biggest feature here is ease of use – users go to YouTube because it is the best way to get the content they're looking for. YouTube already makes lots of money; if publishers were to learn from this, I think they'd have a winning strategy.

Of course, the challenge is to turn this example into a general solution. I would say that big publishers can do it, but they need to change their business model. What they should do is try to control the brand and be the best for distribution, to make it as easy as possible for people to get the content they want. I'll give a couple examples of how I think this could go.

I picture a website much like YouTube, but controlled by a publisher. Users can visit it to watch movies and TV shows for free, at their own leisure. To fund this they use the tried and true internet model: advertising. Most of it is general advertising, keyed to the content in the movie, and the rest is good for the publisher – collector's edition DVDs, upcoming movies with the same director, etcetera. Sure, people still pass movies around on bittorrent, but this is quick and convenient. Then persecuting people who share files isn't really necessary, since many of them will make it to the publisher's site on their own. Similarly, people won't fight as hard to protect other sites hosting the content, since there is an easy, legal, and free channel to get the same. Naturally this wouldn't be a replacement for physical media, but the shows will be on the internet regardless of what publishers want, so they may as well get a slice out of it.

For music, this could go even further. A music company could run their own download site or tracker, putting up the content themselves. It would be high quality official content, for free. They could try to get users to pay if they like the music, buy physical copies, etcetera. Remind you of anything? Radiohead has done this. After this is in place, they could go after other distributors, to keep people coming to the 'official' site. Again, people will share the music. But who cares? It's free unless they choose to pay, and let's face it, they are going to share anyways.

In that case, why would this kind of scheme work? A number of reasons:

It is easy to do. Laws don't need to be changed, and people won't fight it.
Profits are still good. Maybe not as high as they once were, but that industry is dying anyways. And compared to physical media, costs are virtually non-existent, so people wouldn't need to pay nearly as much piece of media for this to work.
Publishers have already lost control; they just don't know it yet. If you can't control your users, make them want to come to you.

There are lots of signs that this kind of model is coming. The biggest problem is that it isn't being done by the publishers. Either artists are breaking off and trying it themselves, or companies are springing up to sign deals and try it for themselves. Somehow we need to get the publishers to try it, and I think the whole problem of draconian copyright enforcement will slowly disappear. Or at least, it won't be the users themselves getting slapped with lawsuits.

NOT Puzzle Solution

2008-05-29T22:16:00.005-04:00

Edited 2012/08/16: Updated problem attribution.

Last week, I closed with a puzzle: "Invert three inputs using only two NOT gates." I should start off by pointing out that I cannot take credit for this problem; it dates back to at least the 70s and was featured in HAKMEM by Richard Schroeppel. I myself learned about it via Clive Maxfield. I'm also told that Guy Steele beat me to the generalization to n case, however I cannot verify this, and the solution presented here is entirely my own.

The solutions that follow are just two of many possible ways of looking at this. Structurally, they are by no means the simplest solutions, but I find them to be conceptually the simplest. If you haven't tried to solve the puzzle yet I heartily recommend it; it makes knowing the solution so much more rewarding if you have.

Solution for 3 inputs

My solution to this puzzle involves reasoning on the number of inputs with value '1'. If none of the inputs are 1s, then all the outputs must be, and vice versa. To do this, we create a set of temporary variables with the goal of knowing exactly how many inputs were 1.

3 := A ∧ B ∧ C
2or3 := (A ∧ B) ∨ (B ∧ C) ∨ (A ∧ C)
0or1 := ¬ 2or3
1 := 0or1 ∧ (A ∨ B ∨ C)
0or2 := ¬ (1 ∨ 3)
0 := 0or1 ∧ 0or2
2 := 0or2 ∧ 2or3

From here we can just build our outputs directly:

X = 0 ∨ (1 ∧ (B ∨ C)) ∨ (2 ∧ (B ∧ C))
Y = 0 ∨ (1 ∧ (A ∨ C)) ∨ (2 ∧ (A ∧ C))
Z = 0 ∨ (1 ∧ (A ∨ B)) ∨ (2 ∧ (A ∧ B))

Solution for n inputs

This proof sketch is for odd n. If n is even, just invert the last input directly with a NOT, and use this pattern for the first n-1.

As before, we are going to reason on how many inputs (I₁ to I_n) were 1. For this we will use a bit more syntax: {a-b} means anywhere from a to b inputs (inclusive) were 1, and {a, b} means either a or b inputs were 1, with no other possibilities allowed. First, we can generate all the ranges ending in n directly:

{1-n} := (I₁ ∨ I₂ ∨ … ∨ I_n)
{2-n} := ((I₁ ∧ I₂) ∨ (I₁ ∧ I₃) ∨ … ∨ (I_n-1 ∧ I_n))
…
{n-n} := (I₁ ∧ I₂ ∧ … ∧ I_n)

Clearly, this doesn't need any NOTs. Next, we are going to make all the ranges starting in 0 and ending in an odd number:

{0-1} := ¬ {2-n}
{0-3} := ¬ {4-n}
…
{0-(n-2)} := ¬{(n-1)-n}

This requires (n-1)/2 NOT gates. Using these together, we can have variables for all the odd numbers of inputs directly:

1 := {0-1} ∧ {1-n}
3 := {0-3} ∧ {3-n}
…
n-2 := {0-(n-2)} ∧ {(n-2)-n}
n := {n-n}

Again, this doesn't require any NOTs. Lastly, we use one more NOT to let us get all the even values:

{0,2,4,…,(n-1)} := ¬ (1 ∨ 3 ∨ … ∨ n)

0 := {0-1} ∧ {0,2,4,…,(n-1)}
2 := {0-3} ∧ {0,2,4,…,(n-1)} ∧ {2-n}
4 := {0-5} ∧ {0,2,4,…,(n-1)} ∧ {4-n}
…
n-1 := {0,2,4,…,(n-1)} ∧ {(n-1)-n}

There! Now we have a variable for each possible number of inputs with value 1, so we can just plug in all possibilities to get our outputs:

O₁ = 0 ∨ (1 ∧ (I₂ ∨ I₃ ∨ … ∨ I_n)) ∨ … ∨ (n-1 ∧ (I₂ ∧ I₃ ∧ … ∧ I_n))
O₂ = 0 ∨ (1 ∧ (I₁ ∨ I₃ ∨ … ∨ I_n)) ∨ … ∨ (n-1 ∧ (I₁ ∧ I₃ ∧ … ∧ I_n))
…
O_n = 0 ∨ (1 ∧ (I₁ ∨ I₂ ∨ … ∨ I_n-1)) ∨ … ∨ (n-1 ∧ (I₁ ∧ I₁ ∧ … ∧ I_n-1))

As you have seen, this took a total of (n-1)/2 + 1 NOT gates. Since n is odd, (n-1)/2 + 1 = ⌊n/2⌋ + 1, and we're done! Whew, I wouldn't want to implement that.

Boolean Logic

2008-05-24T14:54:00.004-04:00

Today we discuss Boolean Logic. It may be the first in a series on different kinds of logic, or it may not. We'll see.

A concept that should be intimately familiar to any programmer is Boolean logic. True or False, 1 or 0, Yes or No, these are all different representations of a Boolean variable. Invented by George Boole in the 1800s, Boolean logic forms the basis of modern computing, used in everything from the base transistor to bitmasks and search queries. Wikipedia has a good introduction to the theory for anyone who needs a refresher. Rather than re-teach it, here I am going to draw your attention to a few of the more interesting features of Boolean logic.

First off, everything can be defined from only one basic operation, alternative denial (aka NAND, ↑). Alternatively, the same can be done with NOR, albeit in different patterns. For example, NOT can be created by splitting the input into both sides of a NAND gate. For this reason – and many others – NAND is often used as a building block of transistor circuits. It is a bit awkward for general use however, so for higher level Boolean logic we usually deal in three main operations: negation (NOT, or ¬), conjunction (AND, or ∧), and disjunction (OR, or ∨). There are others, but these are good enough for now. Take a look at a few examples:

A ∨ B = ¬(¬A ∧ ¬B) (De Morgan's Law)
0 = A ∧ ¬A
1 = ¬(A ∧ ¬A)

Even just using these three operators, there are multiple ways to express any given function. For example, which is the 'best' way to represent XOR (the ⊕ operator)?

A ⊕ B = (¬A ∧ B) ∨ (A ∧ ¬B)
A ⊕ B = (A ∨ B) ∧ ¬(A ∧ B)

The first is easier for most people to understand ("A is true and B is false, or the other way around"), while the second has fewer logical operations. If we wanted to implement XOR in software that might matter, although in hardware it would usually be implemented as a custom unit such as the CMOS 4070, or failing that, four NAND gates:

N := A ↑ B
A ⊕ B = (A ↑ N) ↑ (N ↑ B)

Another interesting feature of hardware level Boolean logic is propagation delay. Essentially, we think of logical operations as instantaneous, however in hardware they take a measured time to switch. The propagation delay is the total time it takes for a change in the inputs to be reflected as a change of the outputs of a function. This gives us a tradeoff between time and transistor count – basically, it might be better to use more transistors to implement a function, but structured in parallel. This tradeoff is sometimes reflected in software: if your computer can run multiple instructions in parallel, it is sometimes possible to use more instructions perform a function in fewer clock cycles. The compiler usually handles this kind of micro-optimization, however.

One other interesting tidbit: take a look at Karnaugh maps for simplifying expressions and finding race conditions. Essentially, you draw out the truth table for an expression and then cover the '1's with as few overlapping rectangles as possible, optionally adding extras to prevent race conditions.

Lastly, I'll leave you with a puzzle: invert three inputs using only two NOT gates. To be specific, you have a function with 3 inputs, A B C, and you want three outputs, X Y Z, such that

X = ¬A
Y = ¬B
Z = ¬C

However, you only have two NOT gates (although you can use as many AND or OR gates as you'd like). Note: to solve this in written form (rather than by drawing a circuit diagram), you may find that you need temporary variables, such as 'N := ¬A ∧ B'.

Bonus points if you can generalize it to an algorithm for inverting n inputs using ⌊n/2⌋ + 1 NOT gates.

The Lowly Programmer

The Rise of Datacenter Computing

Part 1: The Data Center of Gravity

Part 2: “Personal” computing

Hardware trends for datacenter computing

1. Compute Units are Static

2. The Datacenter is the New Machine

3. Decomposable Programs are Ideal

4. Consumer Bandwidth is Lagging

5. Latency Doesn’t Scale

Further Reading:

Primes part 2: A segmented sieve

Implementation

Results

Footnotes

The algorithms of memory

Footnotes

On hosting static files

Google Sites

Google App Engine

Amazon AWS

Google Cloud Storage

Marriage Sort, a visualization

The Game of Life, part 2: HashLife

Introduction

Algorithm

Implementation

Performance

The Game of Life, part 1

New Job!

Robustly hot swapping binaries (with code)

Design Goals

Design

What About Threads?

Testing?

Let's See the Code!

Indexing and enumerating subsets of a given size

Disclaimer

Algorithm

Example

Complexity

Code

Footnotes

Introducing: Marriage Sort

Algorithm

Complexity

Postscript

Footnotes

Visualizing RGB, take two - Update

Results for Barcelona image

Results for the Mona Lisa

Postscript

Writing an efficient Sieve of Eratosthenes

Disclaimers

Version 1: Straight re-implementation of algorithm from paper

Version 2: Using a hash table

Version 3: Re-implement in C#

Version 4: Use a custom structure for the 'multi' in multi dictionary

Version 5: Use a multi cyclic array instead of hash table

Postscript

Visualizing RGB, take two

Choose the closest colour…

But in which colour space?

Postscript

Buzz: The perfect spam platform?

Straining the limits of C#

Good data structures are hard to find

Good algorithms are hard to find

Footnotes

Visualizing RGB

Sort by colour...

Choosing an image

Postscript

Moved!

The value of a point

Google Wave. Can you imagine?

How fast do you type?

What’s next for copyright?

NOT Puzzle Solution

Solution for 3 inputs