Alessandro Bahgat's Blog

The nearest hospital to every place on Earth, in a single S2 range query

Sun, 31 May 2026 15:00:00 GMT

How far is the nearest hospital, for every place on Earth? At first, this sounds like a distance problem with billions of pairs to check. However, with the right tools, it isn’t a distance problem at all: with S2 indexing it’s the same query as which country, region, and locality contains this point? We can solve it with a plain integer range check against a single index.

Last week I was advising a client whose geo pipeline was getting slower every week. Their team had started with the obvious approach: given a few hundred thousand points and several hundred thousand polygons, for each point they scanned every polygon to find the one that contained it. As the input size grew, they watched the nightly batch pipeline become considerably slower and were running out of options.

As we sketched applicable approaches on a whiteboard, I realized I was drawing a picture I’d drawn before, a decade before at Google. The trick, using S2 geometry to turn spatial joins into key joins, is one of the most elegant and underrated primitives I’ve come across: the kind of indexing idea that, like Ukkonen’s suffix trees, collapses an apparently quadratic problem into something nearly linear.

The problem I was solving back then had the same shape but on a different application: the typical price of a hotel stay anywhere on Earth, any night of the year, served at 10ms latency, which required precomputing everything via batch pipelines. One step had to associate every hotel with every named place that might contain it: neighborhood, town, city, region, POI.

Written naively, that step is a cartesian product. Given H hotels and R regions, generate all H x R pairs and only keep the ones where the hotel falls inside the region’s polygon.

The association step. The answer is a sparse matrix: each hotel falls inside a handful of regions. But the naive cartesian product still runs all H×R point-in-polygon tests to find it.

I’d built it as a Flume pipeline, computing summaries for every night of the year over data that already lived distributed across several storage backends. Running it on Flume was justified by the layout of the data, the multitude of prices we had, and the 365-day time dimension. However, joining points-against-polygons was always within reach of one machine. The primitive, the indexing trick at the heart of it, never needed a cluster around it.

To demonstrate it, I wanted to solve a fresh public-data version of that same problem on a pretty typical machine: an AMD Ryzen 9 7900 desktop (12 cores, 64 GB of RAM). The question I picked: where on Earth is your nearest hospital? Its naive form is 437 billion pairs. The S2 index collapses it to a single integer range-join: about an hour to build the index, and then the answer for every locality on Earth comes back in seconds. This post is the story of how.

The worst place to get injured

If you live on the Kerguelen archipelago (a French sub-Antarctic research outpost halfway between Madagascar and Antarctica) and you need a hospital, the nearest one is 3,362 km away, on Rodrigues Island in Mauritius. That’s farther than New York to Los Angeles. And it’s the loneliest result in a worldwide leaderboard of localities ranked by distance to their nearest healthcare facility, covering every place on Earth that anyone has put on the map.

The top of the leaderboard is unsurprising: the top three are all settlements on Kerguelen, the next seven are all Tuamotu atolls in French Polynesia. Interestingly, both are French overseas territories; together they sweep the entire global top 10 because:

each small atoll is its own locality polygon,
they really are extraordinarily remote, and
France maps them well.

The last point was interesting: the underlying map data is a treasure, but has varying coverage by country. I suppose it’s due to a combination of factors, one of which being how active the local mapping community is. I guess it makes sense considering how much of it comes from OpenStreetMap volunteers.

A more interesting exploration asks the question per country, restricted to countries most readers will recognize. Click any row to see the actual location on OpenStreetMap.

Country	Locality	km
Russia	Dikson, Arctic Ocean coast	2,901
Canada	Read Island, Northwest Territories	2,264
United States	Adak, Aleutian Islands, Alaska	1,546
Greenland (Denmark)	Nanortalik, southern Greenland	1,171
Australia	Mundrabilla, Nullarbor Plain	631
China	黑瞎子岛镇, Bolshoy Ussuriysky Island	584
United Kingdom	Rockall, North Atlantic islet	503
Mauritania	Akdernit, Sahara	365
Madagascar	Beroboka Nord, western coast	245
Brazil	Oriximiná, Amazon interior	230
Mali	Tin Zaouatine, Algerian border / Sahara	219
Argentina	Sierra Colorada, Patagonia	173
Japan	Kitadaitōjima, remote Pacific island	165
Chile	Natales, Patagonia / Magallanes	153
Mexico	Progreso, Yucatán	125
Italy	Ustica, volcanic island	57
France	Île-de-Sein, Brittany	45

A few things this list reveals (I had to look up every single one of them except for Ustica):

The UK’s most-isolated mapped locality is Rockall: a tiny rocky islet in the North Atlantic, contested between four countries, with a Royal-Marines-occupied flagpole. So small, it turns out, that nobody lives on it at all (more on that below).
China’s is Bolshoy Ussuriysky: a river island the PRC and Russia split between them in 2008.
Australia’s Nullarbor Plain at 631 km is genuinely remote; the Outback is really isolated.
Italy and France show what well-covered countries look like: the most-isolated locality is a tiny offshore island just ~50 km from the nearest hospital. That’s what most of Europe looks like.

Behind every entry on that table is one integer per locality, half a million of them, classified against every healthcare POI in Overture (the open global map dataset, introduced below), every (locality, POI) pair a potential great-circle distance. The primitive that makes this possible at scale is S2 cell indexing, originally built in the mid-2000s to power Google Maps. The same primitive simultaneously answers a different-looking question (what country, region, and city does each hospital belong to?) from the same index. That’s the part this post is built around.

The hotels-to-cities problem, again

The Flume pipeline from my memory was a hotels-to-cities job: for every (hotel, place) pair, answer is this hotel inside this place? The pipeline in my demo works for both hospitals-to-administrative-units and hospitals-to-radius-bands. S2 cell indexing makes these problems tractable by turning geographic questions (is point P inside polygon Q?) into an integer-key question (is the integer P′ in the sorted set Q′?). Then, a sort-merge pass handles the rest and the shape of the problem goes from quadratic-on-geometry to linear-on-integers.

I wanted to apply the lesson to a fresh problem with public data and real-world stakes, and settled for these two facets:

Distance to the nearest hospital is something everybody understands. Mapping every hospital on Earth to every country, region, locality it belongs to, at the same time, is the same question as the hotels-to-cities use case from my experience.
How far is the nearest hospital from each locality is the same question shape, just with distance bands instead of admin levels.

And S2’s hierarchy makes those two kinds of “level” indistinguishable to the algorithm. That’s the surprise from the beginning, spelled out: distance and containment are the same query: let’s see how.

S2 in one paragraph

S2 partitions the surface of the Earth into a quadtree of cells. Key principles:

every cell has a unique 64-bit integer ID;
every cell has a parent at the next coarser level;
every leaf cell at level 30 (about 1 cm²) is the descendant of exactly one cell at every level above it;
it’s possible to walk the S2 hierarchy all the way up to six face cells.

This strict parent-child invariant — that every leaf has a unique ancestor at every level — is what makes S2 special. It lets us encode every cell as an interval [range_min, range_max] over leaf-cell IDs. A leaf L is contained in cell C if and only if range_min(C) ≤ L ≤ range_max(C). One integer-interval check, no geometry, and it works at any level, on any cell, against any leaf. The whole post is built on this one trick.

Each S2 Cell ID is positional. Three bits pick one of six cube faces, then two bits per level pick one of four children, and a final 1 bit marks where the cell stops. A coarser cell is therefore a binary prefix of all its descendants:

cell C        011 10 00 11 · 1 · 0000…0
range_min(C)  011 10 00 11 · 0000…00 · 1
range_max(C)  011 10 00 11 · 1111…11 · 1

Every leaf under C shares the prefix 011 10 00 11 and varies only in the bits below it, from all-zeros to all-ones. That span is the interval [range_min, range_max], and “is this leaf inside this cell?” becomes “does this integer fall in that range?”. It’s prefix matching, the same trick a router uses on a CIDR block. The Hilbert ordering shown below adds a separate property on top: nearby cells get nearby IDs, so a whole region covers into just a few of these intervals instead of thousands.

S2 cells on one face at level 4, walked in Hilbert order. The blue space-filling curve is the order S2 uses to number cells: consecutive cells along the curve get consecutive cell IDs. The red square is the 16 leaf-cell descendants of a single parent cell two levels up: they form a contiguous run of cell IDs (a single integer interval) that's also a connected sub-region of the curve. That's the property the whole pipeline exploits.

The data

The dataset I used for the demo is Overture Maps, a public release of Meta’s, Microsoft’s, AWS’s, and TomTom’s joint cleanup and merge of OpenStreetMap, Microsoft Building Footprints, and assorted proprietary data. As of the May 2026 release it has 54 million POIs and 1.07 million administrative polygons, all openly available as Parquet on S3, byte-range-readable with no auth required.

I pulled the global tile to my workstation via the overturemaps CLI in a few minutes. From that, the input set for this post:

770,440 healthcare POIs: hospital, medical_center, emergency_room, urgent_care_clinic. This deliberately excludes pharmacies, dental clinics, and specialist offices. You could reproduce the same approach with any other type of POI.
567,307 localities: cities, towns, villages, wards.
~4,700 regions, ~380 country polygons (some countries have multiple polygons in the source, such as overseas territories, exclaves, and disputed islands, which get rolled up to one ISO code at aggregation time).

Those three admin levels add up to ~572,000 polygons: a subset of Overture’s 1.07 million administrative polygons.

Mapping every POI to every admin level, in one pass

For each admin polygon (country, region, locality) we then build its cell-union: the smallest set of S2 cells whose union covers the polygon. Each cell carries an INTERIOR / BOUNDARY tag. INTERIOR cells are entirely inside the polygon; BOUNDARY cells straddle its edge. The construction is a single call to S2’s RegionCoverer, wrapped to also tag each cell:

from geo.s2_covering import cover_polygon

# `polygon` is a shapely geometry for one admin region.
# `cover_polygon` returns a mixed-level cell-union: small cells along
# the boundary, big cells inside the bulk.
tagged = cover_polygon(polygon, min_level=4, max_level=12, max_cells=200)

rows = [
    (admin_id, c.range_min, c.range_max, c.tag == "INTERIOR")
    for c in tagged
]

Building the cell-union table for those ~572,000 admin polygons produces ~5.7 million rows in 36 minutes on the workstation. Each row is a small struct: (admin_id, range_min, range_max, is_interior).

Italy's S2 cell-union at levels 4–7, overlaid on OSM. Green cells (INTERIOR) sit entirely inside the country polygon: a hospital whose leaf cell falls in one of them is auto-confirmed as inside Italy, no further check needed. Orange cells (BOUNDARY) cross the polygon edge; hospitals in them have to be re-checked with a real polygon-contains call. INTERIOR cells are visually rare here, and that's not a quirk of Italy's peninsular shape: on every real admin polygon I checked, INTERIOR cells are a small minority of the cell-union. The next section unpacks why the join is still fast.

For each healthcare POI, I compute its leaf cell, a 64-bit integer. All 770k of them take four seconds.

Then a single SQL range-join in DuckDB (conceptually similar to SQLite, but for analytical work on columnar data):

SELECT poi.id, admin.id, admin.subtype
FROM poi_leaves poi
JOIN unified_cell_union admin
  ON poi.leaf BETWEEN admin.range_min AND admin.range_max

That’s the whole spatial join.

What’s impressive is that it works across all admin levels at once. It returns ~5.6 million candidate (POI, admin) pairs in under a second:

~965k INTERIOR matches are auto-confirmed: leaf is inside an INTERIOR cell, no further work needed.
~4.7M BOUNDARY matches need a real polygon-contains check; ~3.3M survive.

The refinement is the only place geometry comes back into the loop. To keep it cheap, group the candidates by admin polygon: load each polygon’s geometry once, then call shapely.contains against every POI that landed in one of its boundary cells:

from shapely.geometry import Point

confirmed = []
for admin_id, group in boundary_candidates.groupby('admin_id'):
    poly = admin_polygons[admin_id]      # loaded once per admin
    confirmed.extend(
        (poi_id, admin_id)
        for poi_id, lon, lat in group.itertuples(index=False)
        if poly.contains(Point(lon, lat))
    )

Polygon loads dominate the runtime; the contains calls are cheap because the admin polygon was already simplified upstream of the index. ~4.7M candidates → ~3.3M confirmed in a few minutes.

Total: ~4.2 million confirmed (POI, admin) pairs, across country and region and locality, in 6 minutes of compute.

To make the join concrete, take a real hospital: Bergamo’s Ospedale Papa Giovanni XXIII, at roughly (45.6917° N, 9.6692° E). Its S2 leaf cell ID is 5,152,488,575,548,925,233. Each admin polygon is a union of many cells; the one whose interval contains that integer is its matching cell. A cell and its interval are the same object: a cell at level k is exactly the leaf range [range_min, range_max]. The three matching cells:

Admin polygon	Matching cell	range_min	range_max
Italia (country)	level 6	5,152,117,973,711,847,425	5,152,680,923,665,268,735
Lombardia (region)	level 7	5,152,399,448,688,558,081	5,152,540,186,176,913,407
Bergamo (locality)	level 11	5,152,488,509,130,407,937	5,152,489,058,886,221,823

Three nested intervals on the same number line, and the hospital’s leaf cell ID falls inside all three. The SQL above resolves country, region, and locality containment with one BETWEEN per row, against one unified table. The usual way to answer “which polygon contains this point?” is a spatial index like an R-tree, the structure behind PostGIS and most geo databases. But an R-tree covers one layer of polygons at a time, so all three admin levels mean three separate indexes and three separate tree searches. Three integer comparisons against one table replace all of it.

Each administrative level's cell-union is a set of S2 cells (the table above shows just the one cell per level that matches here). Drawn on a single axis of cell IDs, the solid cell in each row is the one whose interval contains the hospital's leaf cell (red); those three are nested, so one BETWEEN per row resolves country, region, and locality containment together.

The cell-union skews heavily toward BOUNDARY cells. Real admin polygons have long, jagged perimeters relative to their area, and RegionCoverer adaptively subdivides only the edge cells: each level of refinement turns ~1 straddling parent into ~2 straddling children, so a few edge cells at coarse levels balloon into many leaf BOUNDARY cells at max_level. Meanwhile the polygon’s bulk gets covered by a handful of large INTERIOR cells.

That skew doesn’t slow the join down, though. The range-join itself is just BETWEEN comparisons, so it doesn’t care how the cells are tagged. The only real cost is the BOUNDARY refinement, and most of those checks land on leaf cells at max_level, where the geometries are small and shapely.contains runs in microseconds. INTERIOR cells help at the margin, auto-confirming the POIs that fall in a polygon’s bulk, but the join is fast mainly because even the boundary path is cheap: a tiny leaf, an already-simplified polygon, a microsecond contains call.

The refinement work scales with polygon perimeter at max_level, not with area.

Distance is just another kind of hierarchy

Now the harder-looking question: how far is the nearest hospital from each locality? The naive solution is to treat it as a nearest-neighbor problem: for each locality, scan all hospitals, find the closest. Even with a spatial index it’s a different kind of problem from the admin-rollup join.

Is it though?

For each healthcare POI, we can build a spherical cap at each radius band (within 1 km, within 5 km, within 15 km, within 30 km, within 100 km) and cover each cap with S2 cells. The construction in S2 is about as short as it gets:

import s2sphere
EARTH_RADIUS_KM = 6371.0

center = s2sphere.LatLng.from_degrees(poi.lat, poi.lon).to_point()
cap = s2sphere.Cap.from_axis_angle(
    center,
    s2sphere.Angle.from_radians(radius_km / EARTH_RADIUS_KM),
)
cells = region_coverer.get_covering(cap)

Here:

get_covering is the same primitive used for the admin polygons;
Cap is just a shape RegionCoverer knows how to cover.

Why a cap, not a circle: the Earth is a sphere, so “within r km of a hospital” is the patch of surface within an angular radius θ of the axis pointing at that hospital: a spherical cap. The angle is all S2 needs: θ = radius_km / EARTH_RADIUS_KM, exactly the value passed to Cap.from_axis_angle. On a flat plane the same region would be a disk; wrapping the plane onto the sphere turns it into the cap, which keeps the bands honest out to the 100 km radius.

Then we take the per-radius union across all 770k POIs. The result is five cell-unions (one per radius band), each tagging the part of the planet that’s within that radius of some hospital.

Then, for each locality, we take its representative point’s leaf cell and ask: what’s the smallest band whose cell-union contains me? That’s its isolation distance.

The check is identical to the admin one: leaf BETWEEN range_min AND range_max. The hierarchy now spans distance scales instead of administrative scales, and the algorithm doesn’t know or care about the difference. I see that as more proof of the elegance in the S2 design: country / region / locality / 1 km / 5 km / 30 km are all the same kind of thing to the index. They are all representable as nested cell-unions over a quadtree, and queryable with a single integer-interval check.

Five concentric radius-band S2 cell-unions around one example hospital in Reykjavik, overlaid on OSM. Each colored band is everything within 1 / 5 / 15 / 30 / 100 km of the hospital, covered with S2 cells whose [range_min, range_max] intervals can be checked against a leaf cell in a single integer comparison. Bigger radii merge into coarser parent cells, the same Hilbert-locality property that made the country figure work. To the algorithm, this picture and Italy's are the same class of thing.

This is where S2 separates from H3, Uber’s hexagonal grid system, and from R-trees. R-trees can do range queries but need one tree per question: four levels of hierarchy = four trees, four traversals. H3 can index polygons but its parent-child relation is approximate, so a multi-level union breaks the integer-interval trick. S2’s strict quadtree parent-child invariant is the property that makes the same primitive work for both kinds of hierarchy.

Here are the cell counts per band, after normalization:

Radius	Cells (after normalize)
1 km	1,575,449
5 km	724,157
15 km	317,056
30 km	129,497
100 km	11,021

Bigger radii merge into coarser parents, yielding fewer cells for the same fidelity. Building all five bands across 770k POIs took 30 minutes; classifying 567k localities into bands took 30 seconds.

A booby-trap

The first version of the global leaderboard had Archipel des Crozet at 2,008 km. Being unfamiliar with many of these locations, I decided to do some spot-checking. Crozet is a sub-Antarctic French island group in the southern Indian Ocean, about as far from anything as land gets, so I expected a big number. But 2,008 felt too low: Crozet to Cape Town is 2,400 km, Crozet to Madagascar is 2,400 km, and there’s nothing in between. A hospital can’t sit closer than the nearest land, I thought. So where was this 2,008-km hospital?

A query of my data gave me the answer: at coordinates (-30.75°, 63.63°), the middle of the open Indian Ocean. Querying that location in Overture’s data turned up:

서울병원 (Seoul Hospital), country=KR

A Korean hospital tagged ~10,000 km from Korea, due to a geocoding error somewhere upstream in the OSM ingest. That phantom POI was the “nearest hospital” pulling Crozet down to 2,008 km. It did worse damage next door: it also sat ~2,150 km from Kerguelen, well inside Kerguelen’s true ~3,360 km, so it had also quietly bumped the genuine top results off the board. Three more ghosts turned up at similar mid-ocean coordinates, each understating how isolated some real place is.

The fix is one filter, against the same table we built for the join:

-- Keep POIs whose leaf falls inside some country polygon.
-- Misgeocoded ocean POIs never match because no country range covers
-- the open ocean.
SELECT DISTINCT poi.id
FROM poi_leaves poi
JOIN unified_cell_union admin
  ON poi.leaf BETWEEN admin.range_min AND admin.range_max
 AND admin.subtype = 'country';

That’s the same BETWEEN primitive, restricted to one admin level. The index doesn’t need to know what a “country” is; it just trusts the cell-union for subtype = 'country'. After the filter, Kerguelen reclaims the top three and Crozet settles at its real ~2,400 km (nearest hospital in Madagascar): still extraordinarily remote, just below the Kerguelen and Tuamotu tier at the top of this post.

The lesson, more useful than the leaderboard: even with a clean algorithm and public data on a workstation, the first answer is wrong in interesting ways. The question shape is right; the data has booby-traps; you find them by looking at outliers and asking “does this make sense geographically?”

You’ll often hear advice that in any data problem (whether it’s for ML or data analysis), spending time eyeballing the data for patterns and outliers pays off. This is one more example of that.

See it

Mode

The interactive map above zooms through three layers of the same join focused on my home country: Italy. 20 regioni, 8,577 comuni, and 12,579 healthcare POIs. Polygon fills are colored by what fraction of contained comuni are within 5 km of healthcare: blue good, red bad, cream in the middle. Hover any feature for its per-area stats. Toggle “Density” to switch from access to healthcare-POIs-per-km².

Why Italy specifically? Of course I have ties to the country, but beyond that, there are more interesting reasons.

While the algorithm is universal, the input data isn’t: Overture’s locality coverage is dense across Italy (every comune mapped) and several other countries, but patchy elsewhere. Showing the join over Italy keeps the demo interesting and realistic: every visible polygon is a real comune with a real population, not a hole in OSM’s coverage. When I worked on this problem at Google, we were also dependent on the quality of data from Google Maps, and we knew that some countries were better mapped than others.

A few patterns the Italian view makes obvious:

The alpine arc (Valle d’Aosta, Trentino, the northern fringe of Lombardia and Veneto) reads warmer: small mountain comuni without their own hospital, leaning on the valley town next door. An interesting follow-up study might be considering driving distance, which I’d expect to affect POI accessibility even more.
Po Valley, Tuscany, Campania around Naples, the Adriatic coast read cool: dense comune mosaic, dense hospital coverage.
Sardinia and Sicily show internal isolation: interior mountain comuni warm, coastal comuni cool.
Zoom in far enough and the POI dots light up: deep red for hospitals, blue for emergency rooms.

See the join structure itself

Flip on Show S2 cells in the toolbar. The choropleth gets overlaid with the actual S2 cell-union used by the join: green cells are INTERIOR (entirely inside the polygon, auto-confirmed in the range query), orange cells are BOUNDARY (straddle the edge, refined via shapely.contains). At low zoom you see the per-regione cell-union; at z7–z8 it hands off to the per-comune cell-unions, the same parent-child hand-off the algorithm relies on. Around that zoom level you can briefly see both layers on top of each other: a Lombardia-tier cell sitting above the Milan / Monza / Lodi comune cells it contains. That visual nesting is the integer-interval check from the SQL block earlier in the post.

One thing the overlay makes visible: the INTERIOR/BOUNDARY split looks completely different depending on what you count. Barely 1% of the cells in Italy’s cell-union are INTERIOR; nearly all the rest trace the jagged comune and regione edges. That sounds like the algorithm’s “fast path” is doing almost no work, but the matches skew far more INTERIOR than the cells do, because POIs cluster in the bulk of polygons, not on their edges, exactly as the join section above described.

What the answer doesn’t (and can’t) say

Four caveats deserve more than a footnote:

OSM coverage varies by country. The dataset’s tagging fidelity isn’t uniform. Italy’s espresso bars are tagged bar and not coffee_shop; Vietnam tags every cà phê; Thailand tags every clinic; rural Russia barely tags anything. For isolation distance, this means the answer is an upper bound wherever coverage is thin: when we say “120 km to the nearest hospital,” the truth is “120 km to the nearest hospital that someone bothered to map.” That’s the right answer to what does the public dataset say. It’s not the answer to where are the actual hospital deserts without continued investment in mapping ground truth as open data.

Density is partly a tagging-density story. The “densest healthcare” leaderboard is dominated by Bangkok wards (top entry: 254 healthcare POIs per km², four of the top five are in central Bangkok), then Delhi, Jakarta, Taipei, Seoul, Saigon. Bangkok’s medical-tourism culture and high small-clinic density are real. But Thai OSM mappers are also unusually thorough. The dataset’s leaderboards are always partly a leaderboard of map quality.

Outright vandalism happens. OSM is open; some entries are jokes. Greenland’s top-isolated locality in the raw data came back as a Skibidi-meme joke name that someone added to the map. Same problem as the misgeocoded ocean POIs: once you look at the answers, you find them. (Filter that one and the next entry, a real Inuit settlement on the east coast, slides up.) The shape of the problem is the same: a public dataset’s leaderboard is downstream of the public dataset’s quality, and you treat it as data not as truth.

A locality isn’t necessarily inhabited. When I looked into the UK’s lonely winner, Rockall, it turned out to have no permanent population at all. Nothing is wrong with the data, though: Overture has an optional population field, but it’s sparse, and the locality subtype only asserts that someone put a named place on the map, not that anyone lives there.

Locality coverage is wildly uneven across countries, which is why the live demo in this post is Italy. Overture inherits OSM’s country-level mapping density: Italy has 8,577 mapped comuni, France 36,752 communes, Germany 22,967. Greece has 178. Within the US, Massachusetts is fully tiled with mapped places; Virginia is full of gaps. A global choropleth at city zoom reads great over Western Europe and becomes visibly patchy elsewhere, not because the algorithm breaks, but because the input data is uneven. The interactive in this post uses Italy because every visible polygon is a real comune with a real population.

Why this is an S2 post and not just a “use a spatial index” post

The cell-set inner join, on its own, doesn’t distinguish S2 from any other spatial index. R-trees do range queries. H3 cells can index polygons. DuckDB’s plain ST_Contains is already quite fast at this scale on its own: for a single-level point-in-polygon problem, a generic spatial join would land in the same ballpark.

What S2 buys with its versatility, and what this post is built to show, is one index, every scale at once:

The same cell-union table resolves country and region and locality for any point: single integer-interval lookup against a unified table.
The same primitive resolves what 1 km / 5 km / 30 km radius am I in for any point: same lookup, separately-built but structurally identical table.
The two kinds of “level” are the same kind of thing to the algorithm. R-trees would need four trees and four traversals. H3 has parents; that’s not the issue. The issue is that H3 cell IDs across resolutions live in disjoint integer spaces. The encoding embeds the resolution in the ID itself, so a leaf cell at res 9 and its res-5 ancestor are unrelated integers. There’s no equivalent to S2’s [range_min, range_max] that lets a single leaf ID be range-checked against a cell at any coarser level. With H3 you can do lookups, but only at one fixed resolution at a time. S2’s strict quadtree invariant, that every cell at every level is representable as an integer interval over leaf IDs, is what makes the multi-resolution unified table work.

Concretely, here’s what that costs visually. The same patch of Lombardia, covered two ways:

Lombardia covered two ways. On the left, S2's mixed-level cell-union: 59 cells at levels 5–9. Three large green INTERIOR cells tile the polygon's bulk; the remaining 56 BOUNDARY cells (orange, finer) handle the edges. Mixing levels is legal because S2's quadtree guarantees each cell's [range_min, range_max] over leaf IDs cleanly contains all its descendants, so coarse and fine cells live in the same integer-indexed table. On the right, H3 at two resolutions overlaid: res 4 (15 large blue hexes) and res 5 (97 red strokes laid on top). The red boundaries cross the blue ones; the smaller hexes do not nest inside the bigger ones. A res-4 cell well inside Lombardia has a 14.3% symmetric-difference with the union of its 7 res-5 children: children spill out and leave gaps. That's why H3 can't do the S2 trick of mixing resolutions in one cell-union table.

S2’s mixed-level union is a clean tiling: a handful of large interior cells, a band of small boundary cells, no gaps and no overlaps. H3’s geometry can’t deliver that: pick one resolution and you get a sensible cover; try to mix two and the cell boundaries don’t agree. The hex grid is the wrong shape for a strict quadtree invariant.

Cell counts across the three nested admin polygons make the same argument numerically:

	S2 (mixed levels)	H3 res 5 (~8.5 km)	H3 res 7 (~1.2 km)	H3 res 9 (~174 m)
Italia (country)	178	1,701	83,322	4,081,854
Lombardia (region)	59	97	4,653	227,466
Bergamo (locality)	16	1	8	380

The S2 union uses cells at levels 4–8 for Italy, 5–9 for Lombardia, 8–12 for Bergamo: bulk gets covered by a few large cells, only the edges need fine ones, all coexisting in the same table because the integer intervals at the leaf level let them. H3 has to commit to one resolution: at res 5 Bergamo is a single hexagon (unresolvable), at res 7 Italy needs 83,000 hexagons in the index, at res 9 the index is unworkable. H3’s compact() helps some (it can fold contiguous fine cells back into parents) but the resulting set still isn’t a range-comparison primitive, and the lookup remains set-membership at a chosen resolution. You can build a working spatial join with H3; you can’t build this particular multi-scale unified-index join with it. That’s the structural property S2 gives you, and the post is built around it.

Concretely: the unified cell-union table over 572k admin polygons is 46 MB. The five radius-band cell-unions over 770k POIs total ~70 MB. Both indexes are tiny, and building them is the only slow part: the admin cell-unions take 36 minutes, the radius bands 30, so the indexes are ready in about an hour. Everything after that is a query. The range-join that resolves country, region, and locality for all 770k POIs returns in under a second; classifying every one of the 567k localities on Earth into a nearest-hospital band takes 30 seconds. The one query-side step that costs minutes, the boundary refinement, is geometry-loading, not the join.

Which loops back to the opening. I used a distributed batch job via Flume for the original problem at Google, not because the algorithm needed one but because the data was huge. Here the entire pipeline (both indexes, the joins, the refinement, the aggregation) runs on a single CPU, no GPU, no cluster. A point-in-polygon question becomes an integer-interval question, and the rest is sort-merge. Build the index once, and the planet-scale query is seconds of work. The index is what did the real work, at fleet scale and on a desktop alike.

Reproduce

The code is at github.com/abahgat/s2-spatial-join (public, MIT-licensed). All data sources are open and auth-free. End-to-end on a workstation, from raw Overture tiles to the finished leaderboard, runs in under two hours. Almost all of that is the one-time index build (the admin cell-unions and the radius-band cap coverings, the latter dominated by the finest-level band), not the queries, which answer in seconds once the index exists.

uv sync
uv run overturemaps download -t place         -f geoparquet -o data/cache/world_places.parquet
uv run overturemaps download -t division_area -f geoparquet -o data/cache/world_divisions.parquet
uv run python scripts/bake_healthcare_world.py            # phases A–D: indexes
uv run python scripts/bake_healthcare_world_geojson.py    # phase E: aggregate + leaderboard

The whole pipeline. Two open Overture datasets feed one S2 index, and the same range-comparison primitive answers both questions: containment (which country, region, and locality contain each POI) and distance (the nearest hospital per locality, via radius-band cell-unions). Both tracks converge into the per-locality aggregate that becomes the leaderboard. Phase tags A–E match the script comments above.

If you want to ask the same kind of question of a different POI category (most isolated locality from a school, a pharmacy, a fire station, a bookstore, a coffee shop) the pipeline generalizes by changing one filter line, and runs in the same amount of time. The S2 index doesn’t care what kind of POI you put in it. That, more than any specific number, is the point.

The engineering leader I was coaching: their pipeline is now an integer-interval check too. They didn’t need a cluster either.

Same Agent, Different Score: The Problem With Testing Non-Deterministic AI

Thu, 16 Apr 2026 14:30:00 GMT

A few days ago I experimented with a couple local AI models by having them play Zork. They both got stuck in the maze, and the results were interesting: while one agent managed to score 35 points on a good run, most runs scored zero. The next steps in my plan was to give the agent structured tools such as maps, memory and breadcrumbs, and then test how they affected gameplay.

Before building tools and trying to measure their impact, I wanted a solid foundation to build on. That meant picking a model, but more importantly it meant having a benchmark I trusted. I expect finding the right approach with tools will require dozens of design decisions, and I need to be able to tell whether each one actually helps. Zork is an ideal testbed for getting evaluation right: runs are cheap (minutes, not hours), the score is unambiguous (the game tracks it for you), and every turn is logged, so you can replay exactly what happened.

I expected the work to be in running models and comparing scores. Most of it turned out to be in getting the ruler right. Even though I read a lot about evals in the context of applied AI to have a baseline expectation, nothing beats first-hand experience.

By now, everyone has heard remarks that LLM outputs are non-deterministic, to the point that it’s become a hand-wave: “results may vary.” However, when you actually try to make decisions based on aggregate scores from dozens of runs, the non-determinism stops being a footnote and starts being more tangible. The same model can score 40 on one run and 0 on the next. Different benchmark harnesses can make the same model look good or terrible depending on how they handle edge cases. And the only way to tell whether your numbers mean anything is to invest in the telemetry to audit them after the fact.

One caveat before the numbers: Zork has been on the internet for decades by now, and I bet at least some of these models have seen walkthroughs in training. I’m not expecting this experiment to measure general ability at playing text adventure games from a blank slate. What it measures is whether an agent can execute on a multi-step strategy it plausibly already has a model of, despite the non-determinism. That’s even more important for the tool-use experiments I intend to run next.

Two performance tiers

I expanded from two models to five, all running locally on the same RTX 5080 with 16GB VRAM. Each model played five independent games capped at 100 turns.

Model	Params	Mean Score	Avg Latency	Notes
Gemma 4 26B	26B MoE (4B active)	19.0 (±16.4)	6.5s/turn	Highest peaks, highest variance
Mistral Small 24B	24B dense	12.0 (±11.0)	2.9s/turn	More consistent
Qwen 2.5 14B	14B dense	3.0 (±6.7)	0.9s/turn	Fast but directionless
Gemma 4 E4B	8B total (4B active)	0.0 (±3.5)	4.8s/turn	Too small
Phi-4 Reasoning 14B	14B dense	0.0 (±0.0)	5.4s/turn	Couldn’t follow the format

Score vs latency tradeoff across all five models. The stars mark model means; the dashed line shows the Pareto frontier.

The first clear result: there’s no smooth spectrum. Three models score near zero, two score in the double digits. There’s nothing in between. This was already a win, because at this point I didn’t know whether I should expect a large gap in capabilities, and seeing statistically significant differences allowed me to start making decisions.

Individual runs as dots, with mean and 95% CI bars. The gap between the top two models and the bottom three is not a gradient. It's a cliff.

This seems to be one of the few results that survives the noise, and hints to a capability cliff. Gemma 4 E4B has one-third the parameters of its 26B sibling and drops from solving the cellar to scoring zero. Phi-4 Reasoning can’t even follow the JSON output format: every run looped within six turns. The gap between a model that “can play Zork” and one that “can’t” is significant, and it’s the only comparison where n=5 gives you statistical confidence.

The gap between the two models that can play? That’s where things get complicated.

The variance is not noise

Look at Gemma’s individual scores: [25, 25, 0, 40, 5]. At first I thought that was just noise in measurement. When I started reading the game traces, it looked more like the model taking genuinely different strategic paths on each run.

Each point is a single run. The stars mark model means, the ellipses show spread. Gemma clusters around high-score/short-lived; Mistral clusters around moderate-score/long-lived.

Gemma plays aggressively. It pushes into the cellar early, grabs items, and takes risks. When it works, it scores fast: 40 points in 48 turns in its best run. But it died in 80% of its runs, often in the dark or by falling into a pit. One run never even entered the house.

Mistral plays cautiously. It explores methodically, backtracks when confused, and survives much longer. But it plateaus. Its weakest runs scored 5 because it wandered the forest without finding a way in.

These aren’t noisy samples from the same distribution. They’re different behaviors sampled from different regions of the model’s probability space. Back when I was still struggling with getting Qwen to stick with English, I had set temperature to 0.2: low, but not zero. That small amount of randomness compounds across dozens of decisions into completely divergent playthroughs. This is aleatoric uncertainty: randomness inherent in the system, not reducible by collecting more data. More runs don’t make the distribution narrower. They reveal its shape, which is the actual information.

Sam Savage calls this the “flaw of averages”: plans based on average conditions are wrong on average, because the mean of a non-linear system doesn’t represent any actual outcome. Gemma’s mean of 19 is a score no individual run ever produced. It doesn’t tell you that Gemma either scored 25+ or crashed. Mistral’s mean of 12 doesn’t tell you that it rarely hit those highs but also rarely hit zero. The distribution is the result, not a nuisance to be averaged away.

If you’re choosing between them as agents, this matters. A cautious agent that reliably scores 15 is a very different tool than an aggressive one that scores 40 or 0, even if they average out similarly. For traditional evals, where the goal is to match a known expected answer, variance is noise you want to minimize. For agents navigating a complex space with branching consequences, variance is signal. It tells you the shape of what the agent might do, and that shape determines what kind of help it needs. That’s why pass@k metrics are valuable.

Put differently: if you only get one shot at a problem, you want Mistral. Its worst run was a 5, not a 0. If you can run the agent several times and keep the best result, you want Gemma: its ceiling is much higher even though most individual runs fail. The mean hides this entirely.

For deployments in the real world, these are the same two failure modes I have to consider. One agent type moves fast and occasionally nails it, but blows up when a step goes wrong. The other is so careful it runs out of budget before making progress. The tools you’d build to help them are different: the aggressive agent needs guardrails, the cautious one needs a push.

Three tries at honest measurement

This is where the deliberate investment in benchmarking paid for itself. I iterated through three versions of the harness, and the scores changed dramatically each time. And it wasn’t because the models changed, it was because of changes in the harness and benchmarking guardrails.

v1: Too lenient

The original harness only detected exact action loops: open mailbox, close mailbox, open mailbox, close mailbox. A model bouncing between the kitchen and the living room with different actions each time could wander for 80+ turns undetected. When the API timed out, the harness injected a fake action and kept going, corrupting game state. Under v1, Mistral looked strong (mean 19.5 across two batches of n=5) because its methodical wandering was never caught; Gemma’s mean came in at 13.0, dragged down by a catastrophic -10 run where injected actions corrupted the game state beyond recovery. Both numbers were the harness talking, not the models.

v2: Overcorrected

I added location-based loop detection, a token cap, and discarded runs on consecutive API errors. Scores cratered. Gemma dropped to 5.0, Mistral to 7.0. I initially took this as confirmation that v1 had been inflated.

But here’s the thing about non-deterministic evaluation: when your scores change, you don’t know if the system got worse or your ruler got shorter. I needed to look at individual runs to tell the difference.

The telemetry payoff

This is where an investment I’d made early on paid off. From the start, every turn logged the action taken, the location, the score, the model’s reasoning, and the latency, and some hardware statistics. I hadn’t needed most of this data. The benchmark “worked.” The charts looked reasonable. But now I could replay every termination decision.

Two of Gemma’s five v2 terminations were false positives. Both happened in the Kitchen. In both cases, the model had just scored points by entering the room, then systematically picked up every item: take bottle, take sack, open sack, take garlic, take food. Six turns, six unique actions, all productive. But the location detector saw “Kitchen” six times in a row and killed the run. It couldn’t distinguish “stuck in a room” from “thoroughly looting a room.”

I diagnosed this entirely from logs, without re-running a single game. If I’d been looking only at aggregate scores, I would have concluded that v1 scores were inflated and v2 was the honest version. The per-turn telemetry told a different story: v2 was punishing exactly the behavior I wanted to encourage. I keep coming back to this: you can’t tell whether a benchmark change is a fix or a regression from aggregate numbers alone. You need per-unit data or you’re guessing.

v3: Stuck vs. thorough

The fix was straightforward: before firing the location loop detector, check whether the actions are diverse. If at least two-thirds of recent actions are unique, the model is interacting with the environment, not stuck. This preserved all four legitimate terminations while sparing the two false positives.

Harness evolution across three versions. v1 was too lenient, v2 overcorrected, v3 refined the loop detector.

Under v3, Gemma recovered to 19.0 and Mistral to 12.0. They still aren’t statistically different: the variance is too high with n=5. But I trust the scores now reflect actual gameplay, not artifacts of the harness.

Score progression across harness versions. v1 runs drift aimlessly, v2 cuts them short, v3 lets productive play continue.

	v1 (lenient)	v2 (overcorrected)	v3 (refined)
Gemma 4 26B	13.0 ± 16.8 (n=5)	5.0 ± 7.1	19.0 ± 16.4
Mistral Small 24B	19.5 ± 10.9 (n=10)	7.0 ± 4.5	12.0 ± 11.0

The v1-to-v3 journey taught me something I keep relearning: the hard part of evaluating non-deterministic agents isn’t running them. It’s defining what counts as progress and what counts as being stuck. “Is the agent doing different things?” isn’t enough — it might be doing different things in a circle. “Is the agent in the same place?” isn’t enough — it might be productively working a complex room. You need both signals, and you need the telemetry to check your assumptions after the fact.

A note on the hardware cliff

One thing that shaped the experiment more than I expected: I run everything on WSL2 with an RTX 5080 (16GB VRAM), and WSL’s Hyper-V overhead eats about 1.5 GB. That doesn’t sound like much, but open-source models are released in size tiers designed to fit common GPU memory boundaries (8GB, 16GB, 24GB). Being just below a boundary doesn’t get you a slightly slower model. It drops you to the next tier down, which may not be capable enough for your task. The capability cliff from earlier can be caused by 1.5 GB of VRAM overhead just as easily as by model architecture. I have more to say about this than I can fit here, but the short version: “run it locally” sounds simple until your model choice is dictated by a hypervisor you didn’t choose.

The Troll is still the ceiling

Across every model, every run, every harness version, one fact hasn’t changed: no agent has ever fought the Troll.

The Troll guards the passage between the early game and the mid game. It requires selecting the right weapon from your inventory and attacking. Only one Gemma run, out of all the Gemma and Mistral playthroughs across every harness version, even reached the Maze beyond the Troll’s room. Every other run’s ceiling was the cellar.

The lantern is the key gate. Runs that grab it and descend into the cellar consistently score 25 or more. Runs that don’t top out around 15. Both capable models can reach the cellar. Neither has gotten past it.

This is where tools should finally make the difference. The Troll isn’t a reasoning problem. It’s an inventory management problem: the model needs to know it has a weapon, that the weapon is effective, and that it should fight instead of flee. That’s exactly the kind of contextual nudge that tool-assisted play can provide.

What I learned about evaluating agents

This is why I invested in benchmarking before building tools. Getting evaluation wrong at this stage would have meant building the entirety of tools and harness on top of numbers I couldn’t trust. Zork made the problems visible early and cheaply: a three-minute run instead of a three-hour production incident. If I had to distill what I learned into advice for anyone evaluating non-deterministic AI systems:

Run more than once. A single run tells you almost nothing. Gemma’s best run scored 40. Its worst scored 0. A one-shot eval would have given me either number and I’d have believed it.

Look at distributions, not means. The mean hides whether your agent is reliably mediocre or bimodally brilliant-and-catastrophic. For agents in complex environments, the distribution is the result: it tells you what the agent might do, and that determines what kind of help it needs.

Your harness is part of the system. The benchmark isn’t a neutral observer. Its loop detector, timeout policy, and fallback behavior all shape the scores. When results change, check whether the system changed or the ruler did. The Kitchen false positives would have led me to the wrong conclusion twice over if I hadn’t been able to replay individual turns.

Hamel Husain and Gergely Orosz make a similar argument from the enterprise-LLM side: the only way to know whether your evals are working is to read individual traces. I arrived at the same conclusion from a game-playing benchmark, which I take as evidence the problem isn’t domain-specific.

What’s next

I now have a benchmark I trust (though it took three tries) and two viable baseline models. Mistral Small at 2.9 seconds per turn gives me faster iteration; Gemma 4 at 6.5 seconds per turn has the higher ceiling. For building the ADK harness, I’ll use both: Mistral for rapid feedback and Gemma for measuring whether tools actually move the needle.

The question is straightforward: does tool access change the scores? The Python benchmark stays as the control. The honest harness will tell me.

Either way, the Troll is still waiting.

Stuck in the Maze: Why AI Agents Can't Hold the Map

Tue, 07 Apr 2026 02:56:00 GMT

I was testing a local AI model this weekend when it started responding in Thai.

Not gibberish. Actual Thai script, mixed with Chinese characters. I’d asked it to play Zork, the 1981 text adventure, and it was doing everything except that.

This wasn’t what I set out to study. At work, I’ve had good results getting AI agents to respond to cloud alerts. A service throws an error, the agent reads the logs, traces the relevant code, and proposes a fix. But when a fix requires tracing a request from service A through a message queue to service B, then to service C’s database, the agent often gets lost. Not because it can’t reason about each piece. It can reason remarkably well about individual pieces in isolation. It just can’t hold the map.

I wanted to study that limitation in isolation. No pixels, no distributed systems, no production risk. Inspired by Ramp’s experiment getting Claude to play RollerCoaster Tycoon, I picked the simplest possible test of “can an agent find its way around?”: a text adventure.

I also suspected that small local models would struggle with this far more than frontier reasoning models. That made the experiment more interesting, not less: if tools and scaffolding are what let frontier models succeed, then small models are the best stress test for whether your harness is doing its job.

The setup

Zork drops you in front of a white house. You explore rooms, collect items, solve puzzles, and try to score 350 points. The world is a graph of interconnected locations with descriptions, objects, and a few characters. It’s played entirely through typed commands like go north, take lantern, and open trapdoor.

I wired together Jericho (Microsoft’s Python library that runs the original game file and exposes the state machine) with Pi Coding Agent, a TypeScript-based agent framework. Ollama running on an RTX 5080 with 16GB VRAM provided the model, and a custom bridge connected everything together, validating actions against Jericho’s state and logging every turn. I tested two models: Qwen 2.5 14B and Gemma 4 26B.

The goal was simple: tell the agent to play Zork. No hand-holding, no hints. Just: “you are an autonomous explorer, play the game.”

Day 1: why is my agent speaking Thai?

The first attempt was with Qwen 2.5 14B. I gave it a system prompt explaining it was an autonomous Zork player, handed it a tool to send commands to the game, and let it run.

It immediately broke character. Instead of playing, it started explaining how text adventures work. “In Zork, you typically want to explore your surroundings by using commands like LOOK and EXAMINE…” The model has been trained so aggressively on being a conversational assistant that it defaults to helping you play, rather than playing itself.

Fine. I tightened the constraints. Strict system prompt: “You are the autonomous player. Do not speak to me. Execute moves only. English only.”

That’s when it started outputting Thai.

Qwen responding in Thai and Chinese instead of playing Zork

Actual Thai script, interspersed with Chinese characters. Fragments like 推进完毕 (roughly: “progress complete”). Under heavy “no chitchat” constraints, the model was reaching for high-probability tokens outside English. Qwen’s multilingual training means that when you suppress its English conversational patterns hard enough, other languages become the path of least resistance.

This wasn’t random hallucination. It was a pressure valve. And it was the most visceral reminder I’ve had that giving agents uncontrolled access to your systems requires more than a well-written prompt. If you can’t predict what language the model will respond in, you definitely can’t predict what commands it’ll try to run.

Day 2: the architecture pivot

I tried Gemma 4 26B too, Google’s mixture-of-experts model that had been released just two days earlier. It was more stable in English-only mode, which solved the immediate language problem. But swapping the model didn’t change the gameplay much. Both models scored similarly across runs: mostly 0 or 10 points, with occasional flashes of competence.

The real issue was architectural, not model selection.

I’d started with a static prompt template describing how to perceive the game, reason about it, and act. But rigid templates caused the model to output tool calls as plain text instead of actually executing them. The template was teaching it to perform the format, not use the tool.

The fix was to move the intelligence into the dynamic tool output. Every response from the game included not just the text but a state summary: current location, inventory, score, and valid actions. I also added a thought parameter so the model could reason inside the tool call itself, giving it working memory without triggering the conversational assistant pattern.

Gemma 4 playing Zork through Pi's terminal UI, with the thought parameter and game state visible

One unexpected tension: constraining the model too hard (temperature 0.0, strict bans on any non-game output) made it more reliable at calling tools but worse at actually playing. The creative reasoning needed to solve puzzles requires some flexibility. Over-constrained models would execute actions mechanically but make no progress because they never paused to think laterally.

Day 3: the maze

One run hit 35 points in 49 moves: it found the hidden cellar, lit the lantern, navigated the underground, and defeated the troll with the elvish sword. But that was a lucky outlier. What’s consistent across runs is the moment everything breaks: the maze.

Benchmark results: both models scoring between 0 and 10 across automated runs

Zork’s maze is legendary. It’s a set of rooms that all have the same description (“This is a maze of twisty little passages, all alike”) with exits that loop back unpredictably. It’s non-Euclidean: going north and then south doesn’t return you to where you started. Humans solve it by dropping items as breadcrumbs and methodically mapping connections.

The agent walked in circles. For over ten moves, it tried different directions, got the same descriptions, tried again. No breadcrumb strategy. No attempt to map what it had already seen. Each move was a fresh guess with no memory of the previous attempts.

It was stuck.

Why agents get lost

The maze failure isn’t surprising in hindsight, but it’s instructive. The agent wasn’t failing at reasoning. Each individual move was a reasonable attempt to escape. It was failing at spatial cognition: the ability to build and maintain a mental model of a connected space.

A few things made this worse than I expected.

Without a persistent world model, every turn is an isolated event. In one run, the agent found a jewel-encrusted egg (a high-value treasure) and immediately threw it down a grating. No sense that this object might be important later. No concept of consequences spanning multiple turns.

The tooling fought back in unexpected ways, too. Ollama’s repeat_penalty parameter, designed to avoid repetitive output, broke pathfinding above 1.1. The model became reluctant to output go north twice in a row, even when that was the correct path. A parameter designed to improve text quality was destroying navigational logic.

And the interface itself shaped behavior: running the agent through Pi’s chat UI made it more conversational and less autonomous. A headless Python loop with a tight execute-observe-act cycle and no chat played noticeably better.

The microservices parallel

Zork’s maze is a 44-year-old version of a problem I see at work: a graph of nodes that all look similar, where you can only see your immediate surroundings, and where the only way to make progress is to build and maintain a map as you go. Tracing a request through service A → message queue → service B → database C is the same kind of spatial reasoning challenge.

Frontier models with proper tooling handle this much better than my local 14B and 26B models did in Zork. But the pattern that makes them succeed is the same one that would fix the maze: external memory, explicit maps, state injection. The model doesn’t discover the topology on its own. The system provides it. The lesson from Zork isn’t that agents can’t navigate complex systems. It’s that they can’t do it without scaffolding, and the smaller the model, the more scaffolding it needs.

What’s next

This is where I am now: an agent that can sometimes play Zork, sometimes wanders in circles, and reliably gets stuck in the maze. The scores are modest, the tooling is rough, and neither model has a clear edge over the other.

But the experiment is pointing at something real, and I now have a harness to keep pushing. Next, I want to give my agents better tools: maps they can query, memory they can write to, breadcrumbs they don’t have to invent. I’m eager to see how much that changes the scores.

If you want to understand where your agents are getting lost, give one a text adventure. The maze will show you exactly where the reasoning stops and the flailing begins.

Permission Structure

Tue, 31 Mar 2026 06:11:00 GMT

A few months ago, I was toying with the idea of building a video game. Something inspired by the mechanics of Cultist Simulator, but set in the world of big tech, simulating the daily life of a software engineer. I was intrigued, so I asked an AI agent for an honest assessment: is this a bad idea?

The response was thorough. Six reasons it could work, six reasons it might fail. It read like a well-organized analysis. But look at how some of the risks were framed:

Potential for Mundanity: If the game focuses too much on the truly repetitive and tedious aspects of the job without finding engaging metaphorical representations, it could become boring rather than intriguing. Balancing realism with engaging gameplay will be crucial.

Difficulty in Abstraction: Translating complex technical concepts and workflows into engaging card mechanics without becoming overly simplistic or overly technical will be a significant challenge. Finding the right level of abstraction is key.

The risks are real, but they’re recast as design challenges: “this will be hard but you can do it,” never “here’s where this might fail.” And the conclusion:

I don’t think it’s inherently a bad idea at all! It has the potential to be a very clever and humorous game that resonates with a specific audience and offers a unique gameplay experience.

Followed by an enthusiastic question about what the game’s equivalent of “summoning a creature from beyond” would be.

While it looked like critical analysis, it actually felt like a yes dressed up as deliberation. At the time, I didn’t think much of it. The stakes were low and I was leaning towards building it for fun anyway.

Then something happened that made me rethink that entire exchange.

A different conversation

A couple weeks ago I had a very different experience. I asked Claude to assess the feasibility of a new customer-facing feature for our mobile app at work. It searched the codebase, found existing infrastructure we could reuse, and came back with a solid plan. About two days of paired work. Thorough, well-reasoned, ready to go.

Then I asked about strategy. Where does this sit in our funnel? Could it drive momentum towards our goals? Does it compete with what we already have?

Claude gave thoughtful answers to each question. And at the end of every turn, it included some version of this:

Want me to update the plan with these revised estimates, or do you have other questions before we proceed?

Same pattern as the game conversation: nuanced analysis, always concluding with momentum toward building. The default posture was forward.

But this time, I did something different. Instead of asking more questions about how to build, I started expressing doubt in my prompts:

I’m trying to reason about whether this is even worth it still.

This could be a dropoff point instead of a conversion point.

I have no data to back this assumption.

Once I opened that door, the analysis sharpened. Instead of building on my assumptions, it started dismantling them:

What we’re assuming (without data):

That app quality is a differentiator customers care about — you suspect this but acknowledge no data.

That the demo wouldn’t be a leaky bucket — every fork in a funnel loses people.

And then:

The uncomfortable question: Who is the customer this is for?

This was the moment the conversation became genuinely valuable. Not because the AI had some brilliant strategic insight I couldn’t have reached myself, but because it organized the unknowns I was already thinking about into a structured argument I could act on. It named the assumptions, laid them out, and made the gaps visible.

The recommendation? Skip the build entirely. Run a low-cost experiment with tools we already had. See if there’s signal before investing in a polished version.

Two conversations, one explanation

The maindifference between these two conversations wasn’t the model. It wasn’t the topic. It was my posture.

In the game conversation, the stakes were low and I knew it. I asked for an honest assessment, but I wasn’t genuinely looking for one. In the feature conversation, the stakes were real. It’s my job to be skeptical about what we build, and I brought that skepticism into the conversation. Once I started expressing genuine doubt, the model had something to work with.

This matters because of how these models are built. They’re trained on goal fulfillment: the reward signal pushes hard toward helpfulness, toward getting you to “yes,” toward doing the thing you asked for. You say “build X” and they build X. You say “evaluate X” and they evaluate X and then offer to build it. Even “tell me why this might fail” gets filtered through the same optimistic lens unless you bring real uncertainty to the table.

I’ve seen the extreme version of this. I once had Gemini commit and push code to main while I was still exploring whether the idea was worth pursuing. I hadn’t asked it to commit. I certainly hadn’t asked it to push. But the model inferred that the goal was to ship, and optimized accordingly.

A forge, not a filter

Neither idea was killed by scrutiny. Both came out stronger. The permission structure isn’t a filter that sorts good ideas from bad ones. It’s a forge that finds the weak points early, when they’re cheap to address.

The technique itself is almost embarrassingly simple:

“Give me a critical assessment of whether this even makes sense.”
“Who is the customer this is actually for?”
“What are we assuming without data?”
“Why would I not want to build this?”

These work because they reframe the model’s goal. Instead of optimizing for “help the user build X,” it’s now optimizing for “help the user evaluate X honestly.” The eagerness to please is still there, just pointed in a different direction.

The hard part isn’t the prompting. It’s the discipline of using it at the right moment. When you have an idea you’re excited about and a tool that can start building it in minutes, the temptation to skip the evaluation step is enormous. AI agents built for shipping code certainly won’t question the feature you ask them to build, unless you ask them to. The cost of building is approaching zero, which means you’ll build more wrong things simply because you can. Each one is cheap on its own, but the cumulative distraction is not. Every line of code you ship is a line you now have to maintain, debug, and reason about. Even if the build becomes nearly free, the ownership still isn’t.

Capacity without clarity

Garry Tan recently released gstack, a set of Claude Code skills that includes a /plan-ceo-review step, essentially formalizing this pattern into a reusable tool. It asks questions like “what’s the 10-star product hiding inside this request?” before any code gets written. Over 10,000 GitHub stars in 48 hours, along with plenty of skepticism. But the instinct behind it is right: the most valuable AI intervention often happens before the first line of code.

And yet, we’ve been almost myopically focused on AI’s ability to write code. Lines generated per hour, pull requests per week, percentage of code written by agents. These metrics matter, but they measure the part of product development that was already the most tractable. The hard parts (figuring out what to build, for whom, in what order, and whether it’s worth building at all) are where most product efforts actually fail. And those are exactly the parts where AI assistance is the most underleveraged.

The consequences are starting to show. Now that code is cheap, we’re seeing apps thrown together with no overarching vision. More Frankenstein products that are confusing to use and break in unpredictable ways. Every feature is technically feasible, so every feature gets built. The absence of a strong “should we?” before each “can we?” produces products that are somehow less than the sum of their parts.

I wrote in The Velocity Paradox that “a 10x software factory is effectively useless if it’s embedded in a 1x decision-making process.” I think we’re arriving at that day: the factory is getting faster, the decision-making hasn’t kept up and the gap is becoming visible in what gets shipped.

This has real organizational consequences. If your engineers are dramatically more productive but your product direction can’t absorb that productivity, you end up in an uncomfortable place: you’ve supercharged capacity without supercharging clarity. In the worst case, the response is to cut the capacity: to let engineers go because the organization can’t figure out what to point them at. That’s entirely self-inflicted. The bottleneck was never the engineering. It was the thinking.

The models are perfectly capable of helping with that thinking. They just need permission.

AI has been supercharged for coding because code is verifiable: tests pass, builds succeed, the feedback loop is tight. But if we lift our gaze from the code, there are areas where AI could be an even stronger force multiplier. What if agents didn’t just build what you asked for, but pushed back on what you’re building by default? Not because you wrote a clever prompt, but because scrutiny was part of the process? That’s the real permission structure: not a technique you apply, but a default you set.

Fortresses, Pipes, and Brains

Thu, 26 Mar 2026 23:15:00 GMT

A few weeks ago, Workday’s CEO called AI agent startups “parasites” on an earnings call. Around the same time, Linear shipped an AI agent built directly into their product. Two very different answers to the same question: what happens when AI wants access to your product’s data?

I’ve been thinking about this a lot, partly because I’ve spent the last few months on the other side of that question: building my own AI-powered workflows on top of Linear using Claude and MCP. Triage automation, status synthesis, issue creation from Slack threads. It worked well enough. But looking back, I was essentially treating Linear as a database and doing all the reasoning somewhere else.

That experience, and this contrast between Workday and Linear, crystallized a pattern I think is worth naming.

Three responses to the same moment

Every SaaS company is facing the same pressure right now: AI agents want to interact with your product’s data. The responses I’m seeing fall into three categories.

The Fortress

Lock the data down. Charge $25,000 for data exports. Call anyone who builds on top of your APIs a parasite. This is Workday’s approach: treating data access as a zero-sum game where every external agent is a threat to the business model.

It’s a defensive posture that works for Workday specifically because their moat isn’t just the data: it’s the business logic, compliance rules, and domain expertise embedded in the product. Their customers aren’t going to replicate that in a prompt. But for companies whose moat is primarily data lock-in, this bet tends to age poorly.

The Pipe

This is where most of the industry is right now. You ship an MCP server or an API, and external AI agents pull your data out to reason about it elsewhere. The product becomes a data store. The intelligence lives in the chat agent, the coding assistant, the orchestration layer, anywhere but inside the product itself.

This was exactly my setup. I had Claude connected to Linear via MCP, and I built workflows that synthesized project context, triaged incoming issues, and generated status updates. The reasoning happened in Claude. Linear was the pipe.

It worked. But there was a ceiling. Every workflow I built required me to explicitly model what context to extract, how to reason about it, and what to push back. I was reconstructing, outside Linear, domain knowledge that Linear already had. The pipe pattern means the product doesn’t get smarter. It just gets read from.

The Brain

Linear’s approach is different. They ship MCP too, and you can still pipe data out to external agents. But they also ship a native agent that is opinionated about the process of building software in teams. It doesn’t just retrieve your issues on request. It triages. It synthesizes customer requests across projects. It catches risks. It drafts issues from meeting notes.

That’s not data extraction. That’s domain intelligence, running where the context is richest.

The difference is subtle but structural. An external agent reasoning about your Linear data is working with a limited snapshot: whatever it pulled through the pipe. A native agent has access to the full graph of relationships, the history of how work flows through your team, the patterns in how issues get triaged and resolved. It can be opinionated about the process, not just the data.

There’s a reason this matters more than it might seem. A lot of the current momentum in AI-assisted development is about keeping specs and context in the repo, which works well for the atomic coding loop: one developer, one feature branch. But the context that matters for team-level decisions (triage patterns, customer signal aggregation, cross-project dependencies, the messy handoffs between deciding what to build and how to build it) doesn’t live in the repo. It lives in the project management layer. That’s exactly the context a native agent can leverage, and an external one piping data out never fully sees.

The uncomfortable middle

Most SaaS companies today are in the pipe position, whether they intended to be or not. They shipped an API or an MCP endpoint, and the AI ecosystem is using them as data sources for external reasoning. The product itself isn’t getting smarter. It’s becoming infrastructure.

That’s not necessarily a bad position. Infrastructure is valuable. But it’s a different business from what most SaaS companies think they’re running. If your product is a pipe, your value is in the data you hold and the integrations you support. That’s a game where switching costs matter more than product quality.

The fortress position is worse. It delays the inevitable while annoying customers. Export fees and API restrictions aren’t a moat; they’re a countdown timer.

The brain position is the hardest to execute but the most durable. It requires the company to actually understand the domain well enough to embed useful intelligence. Not just wrap an LLM around the UI, but develop opinions about how work should flow. Linear can do this because they’ve been opinionated about the process of building software since their inception. The agent is an extension of that product philosophy, not a bolted-on feature.

What this means

I think we’re early in a sorting process. Over the next year or two, every SaaS product will end up in one of these three positions, and the market will price them accordingly.

The interesting question isn’t whether AI agents will interact with SaaS data. That’s already happening. The question is where the intelligence lives. If it lives outside the product, the product is a pipe. If it lives inside, the product has a shot at becoming more valuable, not less.

For the products I depend on in my own workflow, I’m increasingly paying attention to which ones are building brains and which ones are just installing pipes.

Visualizing Ukkonen's Suffix Tree Algorithm

Mon, 09 Mar 2026 15:31:00 GMT

Learning algorithms from books

I learned most of what I know about algorithms by poring over a copy of Introduction to Algorithms I got while in university. The book is very well known, especially among folks who got a formal education in computer science.

If you have studied it, you know the book: it is over a thousand pages long and it weighs enough to double as a doorstop.

I worked through large sections of it, pen in hand, trying to trace through increasingly complex algorithms, building intuition for their behavior and tradeoffs. The book covers the theory in great depth: correctness proofs, recurrence relations, asymptotic analysis.

But there was often a gap between reading an algorithm and truly understanding it. The book would present pseudocode, sometimes a few diagrams showing state at key moments and theorems about performance characteristics. The work of tracing what actually happens was left as an exercise to the reader. I did that work with pen and paper, drawing trees, crossing out nodes, scribbling indices in the margins. It worked, eventually. But it was slow, error-prone, and the understanding felt fragile.

Implementing from a paper

Years later, I ran into this gap again. I was working on a programming puzzle that required near-instant substring search over a large dataset. After some research, I settled on a Generalized Suffix Tree: a data structure that indexes all suffixes of a set of strings, enabling $O(m)$ lookups where $m$ is the length of the search pattern, even over an extremely large corpus.

The algorithm I chose for building the tree was Ukkonen’s, described in a 1995 paper. The paper is well written and includes the full algorithm in pseudocode:

One of several pseudocode snippets from Ukkonen's paper, describing the update function. Clear on paper, but its translation to working code is much more verbose than this.

It took me a few hours to get right. Not because the pseudocode was wrong: it was precise and correct. The difficulty was that the algorithm manipulates a tree in non-obvious ways. There is an “active point” that walks around the tree. Suffix links connect internal nodes as shortcuts. Three different extension rules fire depending on what is already in the tree and what is being added. The pseudocode tells you what to do, but building an intuition for why it works requires watching it happen.

I did what I always did: I sketched trees by hand. I traced the algorithm on the string cacao, then on banana, drawing and redrawing nodes and edges as each character was processed. When my Java implementation finally produced correct results, I was relieved, but my understanding of the algorithm still felt like it had been assembled from fragments.

The biggest frustration was that I had no way to inspect what my code was actually building. I relied on the usual bag of tricks: print statements, breakpoints, inspecting memory structures one by one in a debugger. But that is like understanding a forest by looking at one tree at a time. What I wanted was to see the whole data structure after each operation — to watch the algorithm work.

The visualization I wish I had

That idea stuck with me: build the algorithm in a language where rendering the data structure is easy, then step through the construction visually. JavaScript and D3.js are a natural fit: the algorithm produces a tree, and D3 is very good at drawing trees.

So here it is. The visualization below builds a suffix tree for the string banana using Ukkonen’s algorithm, step by step. Use the playback controls to move through the construction. The gold-highlighted node is the active point. Dashed arcs are suffix links.

Press play to watch the tree being built

Add a string to begin building the suffix tree.

The paper describes the core logic across Sections 2–4. Here is test_and_split, the procedure that decides whether the tree needs to grow, which is a companion to the update function we showed earlier:

Procedure test_and_split from Ukkonen's paper. It returns true when the next character is already in the tree (the end point), and false after splitting an edge to make room for a new branch.

A few things to watch for in the visualization — each one corresponds to something in this procedure:

Branching in update: when test_and_split finds no existing transition for the next character, it splits the edge if needed and update creates a new leaf. These are the moments where the tree visibly grows.
Reaching the end point: when test_and_split finds that a transition for the next character already exists, the algorithm has reached what the paper calls the end point of the current phase. All remaining suffixes are already represented implicitly, so the phase stops. This is the key to the algorithm’s $O(n)$ time: the end point can only move forward through the string across phases, bounding the total work.
Suffix links (the paper’s suffix function $f$ ): if an internal node has path-label $x\alpha$ , its suffix link points to the node with path-label $\alpha$ . The update procedure follows these links to jump to the next insertion point instead of walking from the root every time.
Finally, the ”$” terminator converts an implicit suffix tree, where some suffixes may end mid-edge, into an explicit one where every suffix terminates at a distinct leaf.

Adding more strings

A generalized suffix tree indexes multiple strings. Each string is added with its own terminator, and the tree grows incrementally. Below, panama is added after banana. Step through and notice how much of the tree structure already exists from the first string.

Press play to watch the tree being built

Add a string to begin building the suffix tree.

Searching

Once the tree is built, searching for a pattern means matching characters along edges from the root. The visualization below has both strings pre-loaded. Try searching for ana, then try pan, ban, xyz.

Press play to watch the tree being built

Add a string to begin building the suffix tree.

Try it yourself

An empty tree, yours to experiment with. Add strings, watch the construction, search for patterns. Use the scroll wheel to zoom and click-drag to pan if the tree gets large.

Press play to watch the tree being built

Add a string to begin building the suffix tree.

Beyond suffix trees

What excites me most is how well this generalizes. The gap between an algorithm on paper and an algorithm in memory has always been one of the hardest parts of learning computer science. Textbooks give you static diagrams. Debuggers give you one node at a time. Neither shows you the whole picture in motion.

Browser-based rendering, interactive SVGs, and JavaScript engines fast enough to run non-trivial algorithms client-side make it possible to close that gap for almost any data structure. Red-black trees, B-trees, tries, skip lists, hash tables with open addressing: all of them would benefit from this kind of treatment. Not as a replacement for the theory, but as a companion to it. Read the algorithm, then watch it work.

There is an obvious question lurking here: why bother learning algorithms at all when you can ask an LLM to write one for you? I think the question misses the more interesting possibility. LLMs are not just code generators; they are learning accelerators. You can ask one to explain a single step of an algorithm, to walk through an edge case, or to generate a diagram of how components interact. When I started working in a new codebase recently, the fastest way for me to build a mental model was not reading code or documentation. It was asking an LLM to produce component and sequence diagrams: a much higher-bandwidth channel for understanding, at least for the way I think.

That is the real shift. Not that machines can write algorithms so we don’t have to learn them, but that they can teach us in ways that adapt to how each of us actually learns. Through visualizations, through diagrams, through conversation, through whatever representation makes the concept click. This post is one example. The next one might look completely different, tailored to a different person and a different way of thinking.

We write fewer algorithms from scratch in our day-to-day work than we used to. But we still benefit from understanding them, whether it’s to choose the right data structure, to debug performance issues, or to evaluate tradeoffs. And for those of us who enjoy algorithms for their own sake, the tools for learning them have never been better.

The original Java suffix tree implementation is open source on GitHub. For the full backstory, see the project page and the story of the programming puzzle that started it all. Ukkonen’s original paper remains the definitive reference for the algorithm.

The Velocity Paradox

Mon, 23 Feb 2026 20:30:00 GMT

We’ve all been there. You sit down with an AI agent on a Saturday morning to hack on a side project and it feels like magic. Ten minutes in, you are blown away by how quickly the agent can turn even poorly organized thoughts into working prototypes. You feel like you could do this all day.

And clearly, many of us do: we’re rediscovering our passion for side projects, and every day a thousand bespoke ToDo apps are born, perfectly tailored to the unique needs of their creators.

At the same time, if you’re in an engineering leadership role, you’re also seeing your stakeholders dabble with agentic coding. They are shipping side-hustles on the weekend, and respectable work applications in an afternoon. Some of them might even look at you with ill-concealed suspicion. They want to know why their “pet feature” is stuck in a two-week cycle when they just whipped up a functional prototype over coffee.

And they aren’t entirely wrong. AI agents have been writing 100% of my code for several months now. Informed by the wins on my side-projects, I wanted to see how much faster we could build at work. During the holiday break, I spent a few hours having Claude write a non-trivial feature that touched our database, cloud infra, mobile app, and the embedded application that runs on our hardware devices at Quilt. What would have taken me a week to write took an afternoon to generate.

Yet it still took weeks to get it tested and merged.

It felt like strapping a rocket engine to a tricycle. Exhilarating, sure, but the road ahead is still full of potholes, and there’s a canyon where the bridge used to be. So why isn’t the 100x improvement in how fast AI can generate code moving the needle on how fast we can ship features and improvements?

Coding was never 100% of the job. But for those of us managing legacy debt, AI doesn’t just fail to solve our problems; it collides with them.

I’ve been at several conferences recently where I met leaders from “AI-native” companies, organizations founded in an age where agentic coding is the baseline. One founder told me they don’t do code reviews at all; their CI pipeline is the reviewer. Another gives agents full control of their production infrastructure. For those of us anchored to a culture that is older than even just two years, these practices feel reckless. Yet even more measured companies are rethinking the fundamentals. OpenAI recently pulled back the curtain with their Harness Engineering article, showing engineering re-architected around AI from the ground up.

For the rest of us, the gap between “generating code” and “shipping value” is becoming a chasm. We are stuck in the Unhappy Middle, where the cost of code is diminishing rapidly, but the cost of review and verification is skyrocketing.

The Unhappy Middle

To understand why the promise of 100x faster progress thanks to AI still feels like an illusion, we have to look at the two forces we’re being squeezed by.

On one side, we have the AI-Natives. These are companies and teams founded in the AI era. They have zero legacy debt, they can approach the craft of engineering with an open mind, and they use the same exact “boring” tech stacks the models were trained on. They don’t have to go out of their way to “integrate” AI; they are born out of it. They don’t have to refactor their code to support automated verification, they never knew a world without it.

On the other side, you have the companies with the slack to reinvent themselves. Shopify’s CEO made headlines when he declared that AI proficiency is now a baseline expectation and that teams must justify why a job can’t be done by AI before requesting headcount. Companies like that (or Google, I bet) can dedicate teams to rearchitect their codebase, tooling and processes and build the scaffolding that is required to make AI work at scale.

Then, there’s the rest of us. I call it the Unhappy Middle.

We support live products and services, with customers trusting us and depending on us daily. The cost of failure is higher than a toy prototype. Unlike your ToDo app, you can’t just throw an agent at a problem and hope it doesn’t break your production environment.
We have accumulated technical debt as we were racing towards product/market fit, and yet never had the resources to pay it back. We have to balance work on infrastructure and developer experience with business priorities like opening new product lines. Most of these target ambitious schedules which (you guessed right) require taking on additional technical debt.
With the age of Zero Interest-Rate Policies well behind us, but not quite with the coffers of a larger company, we always have to be mindful of our runway, are constantly short-staffed and always “do more with less”.

In short, we have to balance the technical complexity of an established company with the reality of a startup. Our survival depends on crossing the chasm as quickly as possible. Not every team is here. If your stack is standard and your tests are green, you may already be seeing the gains. But if any of this sounds familiar, the path forward is harder. Here are some examples from my reality.

Bespoke Frameworks: from Asset to Dead Weight

Before AI, we may have optimized for human speed by building bespoke frameworks, custom boilerplate generators or domain-specific languages and abstractions. For many teams, these were their “secret sauce”: internal abstractions that helped teams move fast in 2022. They came at a price (typically, new engineers have to take some time getting comfortable with them), but they often paid off.

Today, those clever optimizations are anchors holding us back. AI agents are brilliant at standard React and Python because they’ve seen it a billion times. And, at the same time, they are completely illiterate in our proprietary and opinionated internals. Every time I ask an agent to work in our bespoke code, I’m paying an invisible tax: I spend a third of my time fixing hallucinations because our “clever” code isn’t in anyone’s training set. (I wrote more about why this happens in The Ghost in the Training Set.)

And you know what’s funny? That’s often why some of the best engineers I know are unimpressed by AI agents: they focus on the last time they saw Claude trip on a gotcha that’s specific to their codebase and ignore the fact that it can build flawless React in the blink of an eye.

Zero Slack

We know technical debt is there, we always wanted to increase test coverage, we defer refactoring for testability because we need to fit one more feature before the release cut. We know that frameworks need to be standardized to become “AI-hospitable.” But in the Unhappy Middle, you have zero slack. You’re always racing, either to hit product-market fit or to extend your runway, and “cleaning up” feels like a luxury you can’t afford.

This creates a painful tradeoff. In a side project, or a non-critical business app, failure is cheap. For a company with a legacy codebase, complex release processes and addressing user-critical needs, the stakes are considerably higher. Without the slack to build automated guardrails, we’re left with manual human review and auditing.

And that’s where the 100x speed gain from AI goes to die.

When Generation Outruns Verification

We often think of the craft of software engineering as composed of several loops, each covering a different stage of the lifecycle, from idea to product. A good visual to illustrate this is the slide below, from a talk Addy Osmani gave at LeadDev New York 2025.

From Addy Osmani's talk at LeadDev New York 2025

At the center is the Inner Loop: the tight cycle of thinking, coding, building and testing. This is where “flow” happens. Surrounding that is the Submit Loop, where your code goes through linting and code review, and the Outer Loop, where it finally gets deployed and gets tested in the real world.

The promise of AI-assisted engineering is to effectively collapse the Inner Loop. When an agent can “Think” and “Code” a cross-stack feature in a single morning, that center circle feels like it’s spinning at the speed of light.

But for those of us who are still in the Unhappy Middle, that loop is often broken before it even starts.

The Broken Inner Loop

The first problem teams are likely to encounter is a broken Inner Loop. Before AI, back in the day when code was expensive to write, tests were the first aspect of a healthy architecture to be sacrificed (or, in the best case scenario, deferred). When we skip writing tests, it’s common to end up with code for which it’s hard to write tests in the long run.

When you can’t give an agent a deterministic way to verify its own work, the feedback cycle doesn’t feed back into the AI, it feeds back into you. The agent isn’t looping, it’s just throwing code over the wall and waiting for you to tell it what happened.

In the best scenario you can imagine, the loop is closed by automation. The agent writes code, runs a test, sees the failure and iterates until it’s green. The feedback is a tight, self-correcting circuit.

Without a way to automate verification, you’re just making a mountain of work for yourself, or accepting to take an enormous amount of risk by shipping code that hasn’t been properly tested.

You were promised AI agents working for you to help you be more effective; instead, you are working for your agents. Not only is it not fun, it’s also a huge waste of your time because you are 100x slower than a software agent.

In my world, this isn’t just a metaphor. I feel it physically. At Quilt, we make hardware devices, and you can’t throw prompt engineering at the physical world. If a test requires me to get up, walk to a test bench and manually press a button, the inner loop isn’t just broken; it’s wide open.

And there are even worse consequences downstream.

The Slowing Submit Loop

Before AI agents were this capable, the high cost of writing code carried a hidden benefit. If an engineer spent two days wrestling with a complex feature, they effectively distilled a lot of context information into their brain. By the time they put a change up for review, the author was the deepest expert on those 200 lines of code.

That’s not how it works today.

As wonderful as the democratizing effect of AI agents is (they enable engineers to contribute well beyond their historical area of expertise), it comes with downsides.

If an agent can’t automatically verify its changes, and the author is not the most experienced engineer in the area affected by a change, the bulk of the burden of audit and review will shift to the reviewer.

On the average team, code reviews are assigned to the most experienced engineers in a given area or domain. In this new world, these folks are getting overloaded with more code to review. Worse, they can no longer assume that the author has the same depth of knowledge about the code that reviewers historically could take for granted.

At the extreme, this has multiple effects:

Because the agent did the heavy lifting, the human author may have a shallower understanding of the “why” behind specific implementation choices.
The reviewer is now receiving 10x more code, but with 10x less intent provided by the author. If the reviewer didn’t (or couldn’t) do a thorough review themselves, it’s 10x more code reviews of a higher intensity. Think more of a forensic audit than a style check.
In a legacy codebase with bespoke frameworks, this can be extremely challenging. If neither the author nor the reviewer fully understands the “clever” choices the AI made, they can’t distinguish between valuable additions and hallucinations, and therefore are taking a high risk shipping this to production.

The practical consequences are tangible. Code ends up spending more time waiting for review than in development (this is what happened to my proof of concept I mentioned earlier). Your most experienced engineers struggle to be productive themselves because they are drowning in code reviews.

But the most worrisome part is what this does at an emotional level.

From Craftspeople to Janitors

If we take the patterns above to the extreme and let them fester without fixing them, then we are taking on a huge organizational risk by turning our most senior engineers into Janitors.

Instead of going to a challenging workday where, at the end, we experience the joy of having created something new, we now have to pore over someone (or, rather, something) else’s code to spot issues and problems. Some engineers feel like they are being paid to clean up AI hallucinations.

This can be deeply demotivating. No one likes being a linear bottleneck downstream of a stage that is accelerating at exponential speed. This is even more difficult at the speed this shift is happening, as many people are mourning the loss of the craft, made worse by simplistic takes about how the world of tomorrow needs fewer engineers.

I still deeply enjoy coding but I recognize that, even in the best of days, a lot of the code I wrote was boilerplate needed to wire together different application components. A very common micro-kitchen joke from my time at Google was that we were all just highly-compensated Protocol Buffer translators.

We miss the 20% of the code we used to write that was high-leverage and intellectually interesting, and forget the other 80% that was toilsome and repetitive.

From Janitors to Gardeners

If you treat every AI-generated PR like a chore to be cleaned up, you are a Janitor. To move fast in a legacy codebase, we need a considerable change in mindset. If you allow me another metaphor, we need to start treating our codebase less like a perfect jewel to polish and more like a plot of land to tend to.

I’ve been thinking about this metaphor for a while. As you scale an organization, you can’t afford to micromanage; you provide structure and support so that decisions happen organically, aligned to what the business needs. The same applies to codebases.

Playing into the metaphor, a gardener may focus their attention on a few things:

Tending the Soil

Hospitable Ground — Transforming AI-Hostile codebases into an AI-Hospitable playing field requires investing in reducing technical debt, so that AI can’t hide behind it. It may mean moving away from bespoke patterns that routinely trip up agents, or making them work reliably. It means standardizing on a well-defined and documented set of abstractions, instead of having 3 different ways to set up an API server because we never finish migrations every time we deprecate an old pattern.
Nutrient-Rich Soil — Agents are great at brute-forcing their way to a workable solution, but very often they struggle because the codebase lacks information beyond the code itself. Code written in haste often lacks documentation about “Intent” and the “Why” we made decisions. If we don’t expose context about tradeoffs and historical decisions, our agents are operating with limited information. Well structured agents.md files are a good start. Checking in architectural guidelines and making them discoverable is increasingly paying off. Ironically, if you keep your design docs locked in Google Docs, your agent is blind to them (hey Google, when can we have MCP access to Google Docs?)

Scaffolding and Direction

Scaffolding — You don’t tell plants how to grow and expect them to listen; you provide scaffolding and support. In software, this can be types, interfaces and architectural boundaries. Well crafted designs that reduce coupling and abstract complexity behind well-defined interfaces are how you give agents a way to grow that is aligned to what you need.
Resilience — Automated tests, lint checks and verifications are much more helpful for AI agents than they are to humans, as they enable both faster iteration speed and more confidence in the review stage of the submit loop. In the gardening metaphor, this is akin to the sturdy fencing that protects your plants from critters.

I find it ironic that many of the principles above are ones that practitioners have been advocating for under the banner of clean code, test-driven development and many others. We might callously shrug at the idea that we struggled to adopt them for the sake of our human co-workers and are now prioritizing them for the sake of our AI-agents. But the truth is that in the last decade, writing effective tests and good documentation cost us time: the time to think about them, and the time to type them. With AI agents being this capable, the typing cost is approaching zero. What remains is the thinking, and that was always the valuable part.

Building the Dark Factory

By now, it should be obvious that if we use AI only to automate the “Coding” stage of the development loop, we may not only struggle to make our team more effective, we may even hurt their effectiveness.

In the same talk by Addy Osmani I referenced earlier, he goes on to show several areas where AI can be effectively adopted to improve developer experience. In my day-to-day work, I’ve had considerable success using AI agents to troubleshoot bug reports and infrastructure alerts from our production fleet. The gains are real.

From Addy Osmani's talk at LeadDev New York 2025

There is a growing conversation in engineering circles about “Dark Factories”: fully automated systems that run without human intervention. In the age of AI, our job is no longer to write the code; it’s to build the factory that builds the code.

Some high-leverage areas to start:

The Verification Machine — Good test infrastructure should be the top priority. Well-written tests enable AI-agents to have much faster inner loops, but they also greatly help with faster code reviews. With good test scaffolding, you don’t just ask “Will this code work in this scenario?” You can ask an agent to demonstrate the expected behavior via a unit test.
Address common tripping hazards for agents — You likely have a few areas where agents routinely struggle. Don’t just scoff when that happens, and use it to say “AI isn’t quite there yet”. Ask yourself why agents are struggling. Is it because of inconsistent patterns? Lack of context or documentation? Because your bespoke framework requires 1 year of experience in your own codebase to master? Making sure agents don’t make the same mistake twice should be part of our responsibilities.
Reducing human dependencies for mechanical tasks — Invest in building reliable automated end to end tests that rely on production-like observability to spot issues and regressions. Wherever manual testing is required, ask yourself “what would it take for this test to happen automatically?” In a hardware company like Quilt, this means augmenting our ability to perform more tests in software.
The Lights-Out Goal — Aim to have a “Submit Loop” so robust that if tests pass and the architectural boundaries are respected, the code is “shippable” by default. Even if that goal feels unrealistic (e.g. for code that is security-critical or that runs on devices that are hard to recover), ask yourself “What would it take for me to be 100% confident in a change without needing to review it?”

A word of warning: don’t confuse building the factory with building more features. If you ship 10x more features without correspondingly improving your infrastructure, you’re taking on a compounding liability. If AI agents today are enabling you to move even just a bit faster than yesterday, aim to put some of those velocity gains towards your scaffolding, instead of putting everything on more features.

Crossing the Chasm

The Unhappy Middle is a trap, but it’s also an opportunity to rethink what engineering leadership looks like.

This requires a fundamental shift in our ego as developers. Instead of ‘pwning’ the agent every time it trips on our proprietary abstractions, we need to ‘own’ our codebase and make it more AI-hospitable. If the smartest AI in the world can’t understand your code, it might not be the AI’s fault, but it might be a sign that our “cleverness” has become our biggest liability.

If we don’t cross the chasm quickly and change our mindset about how we write software, we risk being buried under our own AI-generated slop. The first step is to stop prioritizing just features as our primary output and start prioritizing the speed and accuracy of the factory.

It is notoriously hard to get organizational buy-in to address technical debt. The key is to reframe: this isn’t about “cleaning up” to pay off debt, it’s about investing in tooling to accomplish 10x velocity.

And even then, there are harder questions ahead. If you actually succeed in building the “factory,” you’ll quickly find that the technical bottleneck has evaporated, only to leave you with an organizational one. A 10x software factory is effectively useless if it’s embedded in a 1x decision-making process. And it is possible that we are approaching a Great Filter-like event for companies in the business of software — one that separates those who adapt from those who drown. But those are topics for another day.

For now, the goal is clear: stop just auditing lines of code and start building the systems that define the future of our industry.

Let us begin.

Update — March 2026

I explored the “1x decision-making process” problem further in Permission Structure.

The Ghost in the Training Set

Sat, 14 Feb 2026 08:00:00 GMT

Over the last several weeks, I’ve had to spend time setting up Model Context Protocol (MCP) servers. As the ecosystem matures, it is already navigating its first major paradigm shifts. Specifically, in early 2025, the recommended transport for MCP over HTTP shifted from Server-Sent Events (SSE) to Streamable HTTP.

To my surprise, the agents I use most (Gemini and Claude) kept reverting to SSE. They were well “aware”, at least as much as a machine could be, that Streamable HTTP was the new standard (they could competently answer questions about it) but they were haunted by the statistical momentum of their own training data. When it came time to actually generate code, they defaulted to the pattern they had seen thousands of times before.

The Invisible Weight of Training Bias

Taking a step back, this makes perfect sense: LLMs don’t just “read” instructions in a traditional sense: they weigh them against their internal probability map. If most of the MCP implementations they had seen were built over SSE, that gives them a huge bias in that direction.

Once I started noticing this pattern, I had found it more and more often: LLMs seem to struggle more with bleeding edge patterns and technologies (again, their training dataset has more examples built on deprecated patterns than newer standards).

This is a sneaky pattern, because we don’t naturally think about how old (or new) a model’s training set is, so we can’t realize this is happening unless we pay attention. If you’re working on a bleeding edge domain and you’re not careful, you may find yourself with an agent offering you a beautiful implementation that is actually a frozen snapshot of last year’s best practices.

The challenge grows with the uniqueness of your environment. This problem is even worse with codebases that adopt bespoke frameworks and patterns for which there is no published precedent. Agents thrive on Common Knowledge, and they struggle with Private Context. When we use bespoke patterns, we are essentially moving the agent into a zero-shot environment without even realizing it. The result is a performance degradation that looks like a “dumb” model but it is actually a lack of statistical grounding.

From Prompting to Infrastructure

You may be tempted to try to overcome this through prompting, and try to give strong instructions to anchor your agent towards the new standard by including strong language in your prompt (ALWAYS use Streamable HTTP when implementing MCP services). You need strong anchors to overcome strong biases. But prompts are often lossy, inconsistent and error-prone.

A more sustainable strategy is to start including these guardrails into your agents.md¹ files, or even better in tooling infrastructure. For example, Claude includes an /mcp-builder skill, which serves as a specialized instruction package anchored on the most recent standards, ensuring you land with a well-functioning implementation that overcomes the inherent bias in the models. In contrast, if you tried building an MCP server with Gemini now you may find yourself surprised by a perfectly functional implementation built on the deprecated 2024 pattern.

The Trap of “Contextual Debt”

Just like code accumulates technical debt, continuously adding to agents.md without ever cleaning up leads to “contextual debt”. Over time, these files become bloated with a mountain of “Don’t do X” or “Remember Y.” Even worse, because you can have agents.md files scattered through your repo, and other .md files as documentation, you can find yourself with clashing instructions that throw agents for a loop in ways that are surprisingly difficult to detect and remedy.

We are reaching a point where our “Instruction Budget” is as important as our compute budget. If you have clashing instructions across multiple .md files, you’re not just wasting tokens, you’re creating “hallucination traps” that are far more expensive to debug than a standard syntax error.

Here are a few things that worked well for me:

Progressive Disclosure: Borrowing from the Claude skills playbook, instead of having a giant instruction file, use a modular approach (e.g., a docs/MCP_STANDARDS.md file linked from your root agents.md).
The “Zero-Prompt Test” Stress Test: Periodically run an agent on your project with a blank instruction file (especially after significant model updates). If performance remains stable, the underlying training set has likely caught up to the new standard. At that point, your manual instructions are no longer necessary; they are cruft. Delete them.
Ownership of Configs: Treat agent configurations with as much rigor as a CI/CD pipeline. Obsolete agent instructions have even more impact on your velocity than obsolete documentation, and ironically, up-to-date documentation is now more precious than ever.

With the rapid pace at which things are evolving, I would not be surprised if in a year, half of these strategies would not be necessary as agents get better. And perhaps they will be superseded by a new set of practices.

Conclusion: Managing the Agent’s AI “Memory”

Regardless of what you might think about the tropes around “software engineering being dead”, it is undeniable that the focus of our job is moving away (or perhaps upward) from writing code.

As we spend more effort managing attention and memory of our agents, in the most sustainable agentic systems, instructions and scaffolding will be pruned as ruthlessly, if not more so, than the code itself.

In this post, we’ll reference only agents.md. Hopefully we’re not far from the day where we don’t need to maintain a separate configuration for Claude. ↩

Receiving Feedback Is A Skill

Tue, 25 Aug 2020 19:35:24 GMT

Delivering feedback is a critical part of my day job as a manager at Google. However, it took me a while to realize that receiving feedback is one of the skills that helped me grow the most in my career.

For many of us, our job is the first setting where we receive developmental feedback from people other than our parents or teachers. That experience may be quite shocking.

I still remember the first time I got professional feedback early in my career. I remember almost every single word that my manager chose to use.

What I remember even more vividly though is the strong reaction that feedback caused in me. Within seconds, I got defensive, I felt like I was being criticized, attacked, unappreciated. I heard what they were trying to tell me, but something inside me kept translating that into a personal criticism. A statement about how I, personally, fell short of expectations.

Good feedback sounds like “here’s one thing you can do better next time”. Better feedback sounds like “here’s one thing that you could do differently to achieve a greater result”.

Embracing that mindset allowed me to accept, process and build on feedback. While I can’t say I prefer criticism over praise, constructive feedback no longer makes me uncomfortable. Instead, I actively seek it.

Changing my mindset around feedback required me to make two key changes:

stop doing things that hurt my ability to improve
start doing things that help build on what I hear

Things I Stopped Doing

Taking It Personally

The main reason I had a difficult time processing feedback is the fact that I often took it personally.

When receiving feedback about something I did, I often read it as feedback about me. Oftentimes, that was not the intention.

Instead of hearing “this email was hard to understand”, I heard “you do not communicate effectively”. When the other party was saying “this piece of code is brittle”, I was hearing “you are a lousy programmer”.

I often ended up reacting defensively. I was unable to hear and processing the actual message I needed to receive.

Most developmental feedback will naturally trigger a defensive attitude. That prevents us from getting the full value of what the other person is trying to tell us. We need to make a conscious effort to not jump to defensive mode, and rather engage in active listening.

Arguing With Feedback

Even worse than taking feedback personally, I sometimes found myself wanting to argue with the person delivering it. I wanted to explain why I disagreed with what they were seeing or try to convince them that they were wrong.

In most cases, arguing with feedback is pointless. Take an example from many years ago.

A colleague approached me and told me “I think the comments you left in this review were too harsh”.

Now, if they cared enough to bring up this feedback, perhaps they were not the only ones. Or maybe my communication style could have had an unintended effect on some people, some time.

Yes, I could have argued with my colleague, perhaps even convince them that my tone was not that bad. Winning the argument might even have felt better.

That would not have changed the my comments did trigger a negative reaction for them. Quite likely, others might have had the same reaction. Knowing that, having that awareness, made me more thoughtful when writing review comments. I can tell they were better received from that moment on.

Arguing with people who are trying to give us feedback, does not help us. Eventually, people will shy away from telling us where we can improve. It leads us to us working with less information about what we can do to get better. In the long run, we miss out on a significant opportunity.

Things I Learned To Do Instead

Being Thankful

A friend of mine once shared a quote that sounded like “feedback is a gift”

Good feedback is thoughtful and timely. Often, it is as difficult to deliver as it is to receive. It is especially difficult for people we are not very close with.

Any yet, some people choose to take a risk. They let us know where we can do better. They do that knowing well that we may feel hurt by what they say.

Because of this, the first thing I do when receiving feedback is thank whoever is giving it. I thank them because they took a risk and did something uncomfortable. I also thank them because what they are telling me has the potential of making me much better.

Good feedback allows us to identify growth areas. Areas where we could invest more to get better at something we have been trying to do. Even those of us that have good self-awareness often need to work hard to find where they need to improve the most.

If someone is coming to us with feedback, they may be sparing us a lot of hard work required to identify areas of improvement.

The least we can do is thank them profusely for the gift they just gave us and get to work.

Following Up

Whenever I receive feedback about something I can improve and want to work on, I note it down. Over time, this list becomes my feedback log.

Keeping a list of the items I am trying to get better at is a way to hold myself accountable. I go through this feedback log every few weeks and reflect on the progress (or lack of progress) I have seen so far.

This helps me making sure I make the most of the feedback I was generously given and use it to gradually get better. I try to spend some time every week to work on some of the most important items on the feedback log.

Doing this helps me well beyond the result of addressing feedback. It also helps me ground my identity as someone who can accept feedback gracefully and use it as a tool to keep growing every day.

Wrapping Up

A few simple changes in perspective helped me change my view on feedback. I went from seeing it as a threat to my own self-worth to a stepping stone to become a better version of myself.

The results of this attitude compound over time as I keep focusing my energy towards addressing the most critical feedback items.

Programming Machine Learning

Mon, 04 May 2020 14:47:47 GMT

I just received my copy of Programming Machine Learning, a book by Paolo Perrotta. I had the pleasure of being one of the technical reviewers of the draft and, while this is not the first book I read about Machine Learning, I must say it became one of my favorites.

Paolo promises, at the beginning of the book, to write a book meant for developers, and he delivers on that promise.

In his words,

This is the book I missed when I got started with machine learning: an introduction for developers, written in our own language. After reading it, you’ll be comfortable with the fundamentals, and able to write machine learning programs.

Programming Machine Learning is a book that teaches the foundations of ML by walking the reader through the process of implementing working solutions for a few concrete and specific use cases, such as predicting sales volume for a pizzeria, recognizing hand-written digits or classifying images.

Each chapter introduces a challenge, lays out the foundations of a technical implementation and explains the theoretical background behind the techniques adopted.

As a result, the book is much easier to follow than many others on this subject: even when diving deeper into the technical or mathematical aspects of any of the topics covered, the reader is able to build on the empirical intuition that comes from having implemented ML algorithms and having seen them in action. Every chapter is engaging, starting from the first ones, about trying to predict pizza sales via linear regression and simple perceptrons, to the last ones, leveraging Keras to classify images.

I found the overall approach quite novel and refreshing. I would definitely recommend Programming Machine Learning, especially if you are the type of engineer who generally enjoys learning by doing.

The programming puzzle that landed me my job

Tue, 01 Oct 2019 03:35:33 GMT

Back in 2011, as I was getting a bored with my job and I started looking for new options. During my search, my friend Daniele (with whom I had built Novlet and Bitlet years before) forwarded me a link to the careers page of the company he was working for at the time, ITA Software.

While Google was in the process of acquiring ITA Software, ITA still had a number of open positions they were looking to hire for. Unlike Google, however, they required candidates to solve a programming challenge before applying to engineering roles.

The problems to solve were surprisingly varied, ranging from purely algorithmic challenges to more broadly scoped problems that still required some deep technical insight. As I browsed through the options, I ended up settling on a problem that intrigued me because I thought it resembled a problem I might one day wanted to solve in the real world and seemed to try to test both the breadth of my knowledge (it required good full stack skills) as well as my understanding of deep technical details.

I have good memories of the time I spent investigating this problem and coming up with a solution. When I was done, I had learned about a new class of data structures (suffix trees), gained a deeper understanding of Java’s internals. A year later, I got a job offer due in part to this puzzle.

Instant Search puzzle brief on itasoftware.com (as of 2011)

The Problem Statement

The brief for the challenge was the following:

Instant Search

Write a Java web application which provides “instant search” over properties listed in the National Register of Historic Places. Rather than waiting for the user to press a submit button, your application will dynamically update search results as input is typed. We provide the file nrhp.xml.gz, which contains selected information from the register’s database.

Database The key component of your server-side application is an efficient, in-memory data structure for looking up properties (written in pure Java). A good solution may take several minutes to load, but can answer a query in well under 0.1 ms on a modern PC. (Note that a sequential search of all properties is probably too slow!) An input matches a property if it is found at any position within that property’s names, address, or city+state. Matches are case-insensitive, and consider only the characters A-Z and 0-9, e.g. the input “mainst” matches “200 S Main St” and “red” matches “Lakeshore Dr.” Note that the server’s JVM will be configured with 1024M maximum heap space. Please conform to the interfaces specified in nrhp.jar when creating your database.

Servlet Your servlet should accept an input string as the request parameter to a GET request. Results should include the information for a pre-configured number of properties (e.g. 10), the total number of matches which exist in the database, and the time taken by your search algorithm. Your servlet should be stateless, ie. not depend on any per-user session information. Paginate your additional results as a bonus!

Client Your web page should access the servlet using JavaScript’s XMLHttpRequest object. As the user types, your interface should repeatedly refine the list of search results without refreshing the page. Your GUI does not have to be complicated, but should be polished and look good.

Please submit a WAR file, configuration instructions, your source code, and any comments on your approach. Your application will be tested with Tomcat on Sun’s 64-bit J2SE and a recent version of Firefox.

Reference UI screenshot accompanying the puzzle brief. I ended up using it as a spec for my client code.

Client

I started building this from the UI down. The puzzle brief mentioned using XMLHttpRequest, so I avoided using any client-side libraries (the functionality I was asked to build on the client was, after all, quite simple). The screenshot included with the puzzle brief included just a text field for the search query and a list of results.

I wrote a function to listen for key presses, dispatch an asynchronous call to the server and render the response as soon as it came back. By 2011, I had been coding web applications for a while and I was able to implement that functionality in less than an hour of work.

Web application and Servlet code

The Servlet layer was also quite simple, since all it had to was handle an incoming XML request and dispatch it to what the brief called a database. Again, less than an hour of work here.

At this level, I also wrote code to parse the database of strings to index from an XML file containing data from the National Register of Historic Places. The Tomcat server would run this code when loading my web application and use the resulting data to construct a data structure to use as an index for power the fast search functionality I needed to build. I needed to figure that out next.

Finding a suitable data structure

This is, unsurprisingly, the most challenging part of the puzzle and where I focused my efforts the most. As pointed out in the problem description, looping sequentially through the list of landmarks would not work (it would take much longer than the target 0.1ms threshold). I needed to find data structure with good runtime complexity associated with lookup operations.

I spent some time thinking about how I would implement a data structure allowing the fast lookup times required in this case. The most common fast-lookup option I was familiar with, the hash table, would not work straight away with this problem because it would expect the search operation to have the full key string. In this problem, however, I wanted to be able to look up entries in my index even when given an incomplete substring, which would have required me to store all possible substrings as keys in the table.

After doing some sketching on paper, it seemed reasonable to expect that tries would work better here.

Suffix trees

As I was researching data structures providing fast lookup operations given partial strings, I stumbled upon a number of papers referencing suffix trees, commonly used in computational biology and text processing, offering lookup operations with linear runtime with respect to the length of the string to search for (as opposed to the length of the string to search within).

Suffix Tree for the string `cacao`. A suffix is said to be contained in the tree if there is a path from the root node where the string obtained by concatenating the edge labels has the same prefix as the suffix being looked up. Highlighted the path corresponding to the `cao` suffix.

Plain suffix trees, however, are designed to find matches of a given candidate string sequence within a single, longer, string, while this puzzle revolved around a slightly different use case: instead of having a single long string to look up matches in, I needed to be able to find matches in multiple strings. Thankfully, I read some more and found a good number of papers documenting data structures called generalized suffix trees that do exactly that.

Based on what I had learned so far, I was convinced this type of tree could fit my requirements but I had two likely challenges to overcome:

Suffix trees tend to occupy much more space than the strings they are indexing and, based on the problem statement, “the server’s JVM will be configured with 1024M maximum heap space” and that needed to accommodate the Tomcat server, my whole web application and the tree I was looking to build.
Much of the complexity of working with suffix tree lies in constructing the trees themselves. While the puzzle brief was explicitly saying my solution could take “several minutes to load”, I did not want the reviewer of my solution to have to wait several hours before they could test my submission.

Ukkonen’s algorithm for linear runtime tree construction

Thankfully, had I found a popular algorithm for generating Suffix Trees in linear time (linear in the total length of the strings to be indexed), described by Ukkonen in a paper published in 1995 (On–line construction of suffix trees).

It took me a couple days of intermittent work (remember: I was working on this during nights and weekends — I had another day job back then) to get my suffix tree to work as expected.

Interestingly, some of the challenges with this stage were revolving around a completely unexpected theme: Ukkonen’s paper includes the full algorithm written in pseudo-code and good prose detailing the core steps. However, that same pseudo-code is written at such a high level of abstraction that it did take some work to reconduct it to fast and efficient Java code.

Pseudo-code from Ukkonen's paper. While clear and easy to follow on the original paper, its translation to Java is much more verbose than this.

Also, the pseudo-code algorithm is written assuming we are working with a single string represented as a character array, so many of the operations outlined there deal with indices within that large array (e.g. k and i in the procedure above).

In my Java implementation, instead, I wanted to work with String objects as much as possible. I was driven by a few different reasons:

Java implements string interning by default — there is no memory benefit in representing substrings by manually manipulating indices within an array of characters representing the containing string: the JVM already does that transparently for us.
Working with String references led to code that was much more legible to me.
I knew my next step would be to generalize the algorithm to handle building an index on multiple strings and that was going to be much more difficult if I had to deal with low level specifics about which array of character represented which input string.

Generalized Suffix Trees

This last consideration proved to be critical: generalizing the suffix tree I had up to this point to work with multiple input strings was fairly straightforward. All I had to do was to make sure the nodes in my tree could carry some payload denoting which of the strings in the index would match a given query string. This stage amounted to a couple hours of work, but only because I had good unit tests.

At this point, things were looking great. I had spent maybe a couple days reading papers about suffix trees and another couple days writing all the code I had so far. I was ready to try out running my application with the input data provided with the puzzle brief: the entire National Register of Historic Places, an XML feed totaling a few hundred megabytes.

Trial by fire: `OutOfMemoryError`

The first run of my application was disappointing. I started up Tomcat and deployed my web application archive, which triggered parsing the XML database provided as input and started to build the generalized suffix tree to use as an index for fast search. Not even two minutes into the suffix tree construction, the server crashed with an OutOfMemoryError.

The 1024 megabytes I had were not enough.

Thankfully, a couple years earlier I had worked with a client that had a difficult time keeping their e-commerce site up during peak holiday shopping season. Their servers kept crashing because they were running out of memory. That in turn led me to learn how to read and make sense of JVM memory dumps.

I never thought I would make use of that skill for my own personal projects but this puzzle proved me wrong. I fired up visualvm and started looking for the largest contributors to memory consumption.

A screenshot of VisualVM used to inspect a heap dump (from the official documentation)

It did not take long to find that there were a few memory allocation patterns that were not efficient. Many of these items would hardly be an issue for an average application, but they all ended up making a difference in this case because of the sheer size of the tree data structure being constructed.

Memory micro-optimizations

Analyzing a few heap dumps suggested me a series of possible changes that would lead to savings in memory, usually at the cost of additional complexity or switching from a general purpose data structure implementation (e.g. maps) to special purpose equivalent tailored to this use case and its constraints.

I ranked possible optimizations by their expected return on investment (i.e. comparing value of the memory savings to the additional implementation complexity, slower runtime and other factors) and implemented a few items at the top of the list.

The most impactful changes involved optimizing the memory footprint of the suffix tree nodes: considering my application required constructing a very large graph (featuring tens of thousands of nodes), any marginal savings coming from a more efficient node representation would end up making a meaningful difference.

A property of suffix tree nodes is that no outgoing edges can be labeled with strings sharing a prefix. In practice, this means that the data structure implementing a node must hold a reference to a set of outgoing edges keyed by the first character on the label.

The first version of my solution was using a HashMap<Character,Edge> to represent this. As soon as I looked at the heap dump, I noticed this representation was extremely inefficient for my use case.

Hash Maps in Java are initialized with a load factor of 0.75 (meaning they generally reserve memory for at least 25% more key/value pairs than they hold at any given point) and, more importantly, with enough initial capacity to hold 16 elements.

The latter item was a particularly poor fit for my use case: since I was indexing strings using the English alphabet (26 distinct characters) a map of size 16 would be large enough to accommodate more than half the possible characters and would often be wasteful.

I could have mitigated this problem by tuning the sizing and load factor parameters but I thought I could save even more memory by switching to a specialized collection type. The default map implementations included in the standard library require the key and value types to be reference types rather than native types (i.e. the map is keyed by Character instead of char) and reference types tend to be much less memory efficient (since their representation is more complex).

I wrote a special-purpose map implementation, called EdgeBag, which featured a few tweaks:

stored keys and values and two parallel arrays,
the arrays would start small gradually grew if more space if necessary,
relied on a linear scan for lookup operation if the bag contained a small number of elements and switched to using binary search on a sorted key set if the bag had grown to contain more than a few units,
used byte[] (instead of char[]) to represent the characters in the keys. Java’s 16-bit char type takes twice as much space as a byte. I knew all my keys were ASCII characters, so I could forgo Unicode support here and could squeeze some more savings by casting to a more narrow value range.

Some more specific details on this and other changes to reduce the memory footprint of my suffix tree implementation are in the Problem-specific optimizations section of the Suffix Tree project page.

Conclusion

When I tested out my program after the memory optimizations, I was delighted to see it met the problem requirements: lookups were lightning fast, well under 0.1ms using the machine I had back then (based on an Intel Q6600 2.4GHz CPU) and the unit tests I had written gave me good confidence that the program behaved as required.

I packaged up the solution as a WAR archive, wrote a brief README file outlining design considerations and instructions on how to run it (just deploy on a bare Tomcat 6 server) and sent it over email. Almost a year later, I was packing my bags and moving to Amsterdam to join Google (which had by then acquired ITA Software).

I owe it in no small part to the fun I had with this coding puzzle.

When I think of how much I enjoyed the time I spent building Instant Search, I think it must be because it required both breadth (to design a full stack application, albeit a simple one) and depth (to research the best data structure for the job and follow up with optimizations as required). It allowed me to combine my background as a generalist with my interest with the theoretical foundations of Computer Science.

The careful choice of specifying both memory and runtime constraints as part of the problem requirements made the challenge much more fun. When the first version I coded did not work, I was able to reuse my experience with memory profiling tools to identify which optimizations to follow up with. At the same time, I built a stronger understanding of Java’s internals and learned a lot more about implementation details I had, until then, just given for granted.

When ITA retired Instant Search (and other programming puzzles¹), I decided to release the Java Generalized Suffix Tree as open source for others to use. Despite the many problem-specific optimizations I ended up making, it is generic enough that has been used in a few other applications since I built it, which gives me one more thing to be thankful for.

While the original page is no longer online, the Wayback Machine still has a snapshot of the original page with the original selection of past programming puzzles. They are still a great way to test your programming skills. ↩

What to look for when hiring

Mon, 26 Aug 2019 19:09:10 GMT

A while ago, I found myself in the enviable position of having to rapidly grow my team. By then, I had done a large number of technical interviews, so I had an idea of what to look for in strong candidates for Software Engineering positions. However, I felt like I lacked a framework for understanding how likely a given candidate was to succeed if they had joined my team, beyond a very loose definition of “culture fit”.

As I was trying to better understand what I was looking for, I started to think about what I value in the people I work with and to reflect on traits I found to be quite common among some of the most successful people I have worked with over the course of my career.

While I would not expect every person I work with to exhibit all the qualities I list here, I am always positively impressed when I come across someone who exhibits more than a few and equally concerned when I see no hint of any of these characteristics.

Over time, I became quite sensitive to some hints that suggest someone could possess one of the these traits and I learned to probe further whenever I see them.

Here a list of the most important characteristics I learned to value in anyone I work with, regardless of job function.

Intrinsic Motivation

Many of the best people I worked with are motivated by their own desire to improve, regardless of the environment around them. Certainly, having a great team and a lot of attention from their manager will help them as well as it would help anyone else, but being intrinsically motivated means they are able to find satisfaction without relying on artificial nudges from the system around them.

I tend to enjoy working with people who think this way because they are often pushing themselves to get better every day, react better to difficulties and challenges and, as a result, push me to get better as well.

I know I am looking at someone who has this kind of attitude when they show they are driven by things such as:

learning something new every day
mastering a skill or a craft
accomplishing something they thought of as difficult

Having hobbies and non-trivial side-projects (for those of us who are at a point where they can afford the time required) is often a sign of being intrinsically motivated.

Relentless Focus

Success often requires from focusing on the most important things first and almost ignoring everything else.

Effective executives concentrate on the few major areas where superior performance will produce outstanding results. They force themselves to set priorities and stay with their priority decisions. They know that they have no choice but to do first things first—and second things not at all. The alternative is to get nothing done. Peter F. Drucker, The Effective Executive

I found it hard to gauge how good anyone is at focusing on top priorities based solely on casual conversations. One decent proxy, at least for technical roles, are open ended system design interviews. Many good questions involve asking to solve a problem too large to be tackled within the allotted time or with the given constraints. That forces the candidate to narrow down the scope and focus on the most important aspects of the problem and set everything else aside.

Independent Thinking

A couple of the best people I worked with have a way of asking questions that sometimes can come across as blunt or excessively direct. In their case, I have never had a problem with it, since it is tied to what I believe to be one of their strengths: they are not afraid to question a line of thought if they do not fully understand it or if they disagree with it.

In cultures where it is more comfortable to agree with others than to challenge their thinking, it takes courage to express dissent.

I wrote before how much I value a culture where anyone feels free to voice their disagreement: I value even more individuals who are comfortable speaking up regardless of what the environment surrounding them looks like.

This is another trait that is be hard to spot in casual conversations, I have seen this come across as a set of pointed, specific questions aimed at developing a stronger understanding of a topic and then thoughtfully suggesting there might be a different way to approach a problem.

However, there is a fine line between being willing to challenge ideas when they are not rock solid and being contrarian by default: it is hard to work with someone who disagrees with everything on principle.

Fast Learning

The ability to learn quickly and adapt to changing circumstances is one of the most critical skills to have in this day and age. To me, it means that I can trust someone to be able to be asked to do something they have not done before and rapidly get up to speed.

I generally see this through evidence of high rate of improvement; whether it shows as gaining mastery of many technologies in a short time, working across a number of different domains or being promoted repeatedly while at the same company, this shows an ability to adapt to changing circumstances.

Responsiveness and Follow Through

One of the main differences between working with a team and working by ourselves is that when we are part of a team others tend to depend on our output for their own progress.

Oftentimes, managers end being stuck having to play the role of the persistent nag, reminding others of their prior commitments and making sure that any work that was agreed upon is eventually delivered. Clearly, this is a way around a fairly common problem: the average person is not great at following through.

By contrast, the most effective team players I have worked with hardly need any nudges: they will stay on top of their to-do list and consistently deliver anything they agreed to do by the time they said they would, without you ever needing to ask again. If you do ask something of them, they respond right away.

Sadly, I do not know of a way to assess how well anyone would do on this point without speaking to anyone who has worked with them before.

Decisiveness

Many people struggle with decisions, for fear of making a mistake, being proven wrong and fallible or committing to the wrong direction. Whatever the reason, shying away from decisions is rarely helpful.

“In effect, the lack of a decision is the same as a negative decision; no green light is a red light, and work can stop for a whole organization.” Andrew S. Grove, High Output Management

The truth is that many decisions are relatively easy to reverse if necessary but the cost of paralysis is too high for most teams and organizations to afford. High-stakes decisions are rare, but when facing one it is important to treat it as a priority and not linger too long. The worst thing we can do is simply dwell on it and get stuck.

Decisiveness is often the driving force behind the responsiveness in the previous section.

Curiosity and Inquisitiveness

Beyond being a fast learner or being passionate about the specifics of someone’s own job, being curious and inquisitive can be invaluable in understanding one’s own teammates, manager, users and competitors.

By wondering about the “why” behind anything we observe, we develop a stronger understanding of the problem we are trying to solve or the parties and organizations we are working with. An understanding that inevitably helps us be more effective.

“When you get curious and learn how to turn that disagreement into honest questioning, you can learn more about other perspectives on the issue because your team will open up.” Camille Fournier, The Manager’s Path

Oddly enough, at least based on my own experience, it is fairly common to find engineers who are extremely curious about technical topics but tend to be less interested about understanding less technical subjects (such as organizations and other humans). People I worked with who are truly inquisitive tend to demonstrate it by being uncommonly interested in the motivation behind the status quo or previous decisions. They often ask questions such as “Why do we do things this way?”

Communication

So much of teamwork is communication, yet communication skills are often overlooked. It is hard to overstate the importance of communication in teamwork. Effective communication means, among other things,

being able to make one’s point of view understood,
resolving conflicts,
selling our own vision,
making sure others are aware of our work (and why it matters)

Of all the traits I learned to appreciate, this is perhaps the most visible. If you spend even a few minutes speaking with someone and they are an effective communicator, you will notice.

Going the Extra Mile

Many successful people consistently overdeliver. It is quite difficult to have any sort of success by just doing the bare minimum. Sure, one can get lucky once or twice, but solid careers are built on strings of consistent achievements.

I often see this in coming through from people’s passions. It often shows as side projects (work-like activities they chose to do in their own time¹) or initiatives at work that they started without anyone asking them to do so (e.g. 20% projects at Google).

Note that this is not always possible for people to do, depending on their situation. ↩

Visual and HTML Testing for Static Sites

Tue, 06 Aug 2019 11:06:27 GMT

Over a year ago I switched from having my site hosted on a CMS to having it built statically and served as a collection of static pages. I have been extremely happy with the end result for all these months — the site is very easy to update and effortless to maintain — but I just made a few changes that made my experience even better.

Why test Static Sites

Even for sites as simple as this, it is surprisingly easy to make breaking changes without realizing. Over the time I have been maintaining abahgat.com, I ended up accidentally introduction bugs more than a few times. Here a few examples of things I ran into:

broken links — by default, Hugo does not validate any of the links in the content I am editing, which means that I have to be careful and make sure all URLs and paths are valid
incorrect theme configuration — the more complex the theme I am using is, the more configuration options it will offer. The more options I have to configure, the more likely I am to make mistakes.
bugs in theme customizations — Hugo is great at allowing to override and customize theme templates. However, this is another source of potential issues.
bugs in the theme code itself — No software is perfect, and any theme I might be using can have its own bugs and edge cases. This might be especially true for you if you are actively developing your own theme or you frequently update it to the most recent version available.

Most of the issues above still affected me when I was hosting my site on Wordpress (I did break links and styling every now and then) but one advantage of working with a statically generated site is that we can leverage many of the tools that are available to web developers to catch issues early (and potentially block deploys if any issues are detected). So I set out to find what kind of options I had to improve my workflow so that I could make changes with more confidence that I wouldn’t accidentally break my site.

What can be tested

Based on the list above, I knew I was looking to set up tests to detect, in order of priority, problems such as:

broken internal links
invalid or malformed HTML
issues with layout or presentation
invalid RSS feed entries

Thankfully, I was able to find a way to cover most of these.

Testing HTML with `html-proofer`

Covering the first items on the list has been fairly straightforward with html-proofer.

Provided you have Ruby installed, you can get html-proofer as a gem via the command below

gem install html-proofer

and then run it via

htmlproofer --extension .html ./public

This will scan the ./public directory for any files with html extension and output a report listing any issues with the markup in those files.

When I first ran it on my site, I got a pretty good list of actionable warnings. The messages are fairly specific and easy to understand, as you can tell by looking at the snippet below:

- ./public/author/abahgat/index.html
  *  356:11: ERROR: Opening and ending tag mismatch: section and div (line 356)
- ./public/author/index.html
  *  356:11: ERROR: Opening and ending tag mismatch: section and div (line 356)
- ./public/blog/index.html
  *  829:2157: ERROR: Unexpected end tag : p (line 829)
- ./public/blog/maps-for-public-transport-users/index.html
  *  internally linking to uploads/2009/01/p-480-320-0e6ac38d-252e-47fa-be79-0ae974dad8d2.jpeg, which does not exist (line 476)
     <a href="uploads/2009/01/p-480-320-0e6ac38d-252e-47fa-be79-0ae974dad8d2.jpeg"><img class="size-full wp-image-364 aligncenter" src="https://www.abahgat.com/img/wp-uploads/2009/01/p-480-320-0e6ac38d-252e-47fa-be79-0ae974dad8d2.jpeg" alt="" width="200" height="300"></a>
- ./public/blog/page/2/index.html
  *  linking to internal hash #broken-priorites that does not exist (line 1456)
     <a href="#broken-priorites">The way priorities are managed is broken</a>
  *  linking to internal hash #duplicates that does not exist (line 1453)
     <a href="#duplicates">Lots of issues are duplicates</a>
  *  linking to internal hash #missing-info that does not exist (line 1455)
     <a href="#missing-info">Bug reports do not include enough information</a>
  *  linking to internal hash #processes that does not exist (line 1454)
     <a href="#processes">The system imposes over-engineered processes</a>
  *  linking to internal hash #tracker-misuse that does not exist (line 1452)
     <a href="#tracker-misuse">The issue tracking system is misused</a>

Even with default settings, html-proofer is able to catch most of the issues I was interested in detecting: the list above features a good mix of problems caused by invalid links in my Markdown sources, errors due to how I was misusing my template and bugs in the template I was using.

Fixing the issues required a combination of updating a few broken links, cleaning up the Markdown sources for my site, submitting a few bugs and Pull Requests against the theme I am using.

Overall, all the issues flagged made sense and worth fixing.

Visual Testing with Percy

As useful as html-proofer is, it does not help catching layout and presentational issues that are not due to invalid markup. I have had good experiences with visual testing and review at work and I was interested in using screenshots to detect layout issues and catch any unintended presentational changes on my own site too.

I cared about this because upgrading my Hugo theme sometimes involves non-trivial changes that could go wrong (despite George, the author, keeping really good change logs).

Also, I wanted to make customizations to the theme and having testing in place is the only way I know to make sure I don’t inadvertently break anything (since I will not review every single page manually every time I make layout changes, having a way to be warned about any differences is very valuable).

I ended up settling on Percy, a tool that was clearly designed first and foremost for testing dynamic web applications but also offered an option to test static sites via a command line program.

The main idea behind a snapshot testing system is to keep a set of approved snapshots (“goldens”), capture a new set of snapshots upon change and flag any differences for review. Changes can be either intended (in which case the screenshot is approved and becomes the new golden) or accidental (in which case they are flagged as regressions and expected to be fixed before pushing a new version).

Example screenshot highlighting differences introduced by a specific commit.

Percy offers a nice interface to highlight any difference between snapshots and can be easily integrated with GitHub and other source control systems to make approving any updated snapshots part of the code review process.

Percy runs as a service, so you will need to create an account with them before being able to use it. Once you have done that you can try it by following the instructions on their documentation page and running the following command on your site (where ./public is a directory containing your static pages):

npx percy snapshot ./public

Running tests on every change via CI services

Unlike the HTML tests, which test a specific version of your site in isolation, the value of snapshot testing lies in comparing your site against a previously approved set of snapshots, which need to be kept up to date.

I then configured a simple workflow with CircleCI, having it build my site with Hugo, run html-proofer on the generated sources, grab a fresh set of screenshots on every change and flag any differences for review.

From what I could tell, many other CI services can be configured to do the same; I ended up choosing CircleCI because I thought its Docker-based setup worked better for what I was trying to do and I had little trouble finding Docker images suitable for running the steps in my workflow.

Below the resulting configuration:

version: 2.1

orbs:
  hugo: circleci/hugo@0.3

jobs:
  snapshot:
    docker:
      - image: buildkite/puppeteer:v1.15.0
    steps:
      - attach_workspace:
          at: .
      - run: npm install percy
      - run: PERCY_TOKEN=$PERCY_TOKEN npx percy snapshot ./public

workflows:
  main:
    jobs:
      - hugo/build:
          version: '0.55.6'
          html-proofer: true
      - snapshot:
          requires:
            - hugo/build

The first section sets up build with Hugo via an Orb (Orbs are CircleCI’s packages of functionality that can be packaged and reused) that also runs html-proofer tests on the resulting build.

The snapshot task installs percy via npm and then invokes it on the directory containing the sources generated in the previous step. It runs on the Docker Puppeteer image, which comes with most of Percy’s package dependencies already installed.

There seems to be a Docker image maintained by Percy but I could not get it to work. I suspect it is because it ships with an old version of the percy command, I did not investigate this further.

With this configuration, every commit and Pull Request will trigger a Hugo build, run your site through html-proofer and capture a new set of snapshots. If any visual differences are detected, they can be inspected and approved via Percy’s web interface.

GitHub will show the latest status of your tests on every commit and Pull Request.

Note that there is no deploy workflow since I configured Netlify to automatically publish a new version of my site whenever I push to the master branch.

Tweaking the setup

If you got to this point, your configuration will feature sensible defaults and help you capture a number of issues caused by your own mistakes or any issues introduced by the theme upstream.

There are a few opportunities to make the setup more efficient, but they require making changes with the CircleCI configuration above since the Orb we used before does not expose a good way to pass flags to tweak neither the build nor test test. (This might be fixed by the time you read this).

You can click here to see a CircleCI configuration file that you can further customize based on the sections below.

Here some of the tweaks you might consider implementing.

Test pages with a future publish date and drafts

Hugo allows you mark pages as drafts or to set a publish date to a future time (for scheduled content). Neither of these pages will be built by default in your deploy workflow, but you might want to do that when running your tests so that you ensure that content passes validation even as it is being edited (as opposed to being surprised by unexpected errors just when you thought you were ready to publish).

You can do this by passing the -D and -F flags to the hugo command during the build step.

Consider enabling minification

If you are building your site with minification enabled when you are deploying, you might have to make a decision:

if you enable minification only on the deploy workflow (and leave it disabled for development), the version of the site you will be testing will not be identical to the version you are publishing. This might hide subtle bugs that you would not be able to track down easily (such as this one).
on the contrary, if you do enable minification, debugging issues flagged by html-proofer and percy might be slightly more difficult, since the resulting source code will be more difficult to read.

I do not have a firm recommendation here, I am currently working with the latter setup and it has been working fine so far but isolating the cause of an issue is slightly harder this way.

If you want try this, you need to pass --minify to the hugo command during the build step.

Skip redundant screenshots

Just like, when writing unit tests, we don’t want to have multiple redundant tests that cover the same behavior, in most cases it is not necessary to take screenshots of pages that use the same template and have very similar content.

For example, if part of your site is a blog that features tags and categories (in Hugo, this would apply to any taxonomy), you will not need to take screenshot of every individual tag page as you won’t get much value out of them, since they all look the same. They will rather be a burden to maintain (should your theme ever change, you’d have many more — very similar — screenshots to approve).

You can probably make a similar case for directory pages (say, if you have 40 pages of articles, the screenshots for the second to thirty-ninth pages are likely going to be the same. There could be value in testing the first and last page separately since you’d imagine they would have a different configuration for the next/previous navigation elements, but that is up to you.

Thankfully, the percy command offers a way to manually exclude certain paths from being considered when grabbing screenshots. The syntax for that argument expects globs, which can take some trial and error to get right.

In case it helps, here a configuration that worked reasonably well for me so far:

npx percy snapshot ./public -i \
  'categories/!(coding|coding/**)/*.html',\
  'tags/!(amsterdam|amsterdam/**)/*.html',\
  'blog/page/!(1|2)/*.html'

What the above is doing is excluding all categories but one (Coding) and all tags excluding one (Amsterdam). It is also ignoring any page beyond the second in the /blog directory.

Capture screenshots less frequently

I have yet to run into this limitation but I could see how, if your site is very large and/or if you commit very frequently, you may be concerned about exceeding Percy’s free quota (5000 screenshots/month).

I have not had to handle this in any special way so far, but here a few options:

Percy grabs screenshots of each page on your site in both Chrome and Firefox to ensure your site behaves well across browsers. You may decide you are comfortable with taking the risk of having smaller issues undetected and grab screenshots only on one of the two. This will mean you will consume half as many snapshots every time you run visual tests.
Percy will also test your site on a couple different viewport sizes. This is helpful to ensure your site works well on desktop and mobile devices. Again, you may be comfortable with just running tests on one configuration in order to reduce resource consumption by half.
You may configure your CircleCI workflow to toggle the snapshot step manually and run it only when you have meaningful changes to test (e.g. if you are adding new content or upgrading your theme). If you do this, you still want to make sure you refresh your screenshots based on master fairly often, otherwise you might find yourself with visual diffs that cover so many changes together that are no longer informative. And if you run this very infrequently, you might as well just choose to run the percy command locally.

Realistically, for most personal sites, you can likely go a long way with the free quota. If you are considering this for a large corporate site, I would rather consider paying for a higher tier and get more snapshots rather than trying too hard to capture fewer and have a less informative workflow.

Tests are even more valuable if you are a theme developer

If you are developing a theme that others are going to use, testing this way is likely to be even more impactful: you can save yourself quite a bit of time by having a way to catch issues before you ship a new version instead of relying on your users to report problems they run into after they upgrade.

You can apply most of the suggestions above by making sure that you have an example site (the Academic theme I use is great for this) that exercises most of the features in your theme, especially the ones that are not enabled by default. This would also likely reduce the time you spend manually inspecting your pages to make sure they still render as expected.

Conclusion

This has been a great opportunity to learn about great tools that are available out there (I will definitely consider Percy for the next app I will build in my own time) and how they can help greatly even with sites that are statically generated.

I have accomplished most of the goals I had in mind when I started playing with this. There is one item left open for future investigation (mainly, a way to ensure the RSS for my site is valid and well-formed) but the CircleCI workflow I set up gave me a good foundation I can extend to cover more tests.

Zing LED Smart Night Light

Mon, 18 Feb 2019 21:24:08 GMT

Several months ago I was looking for a night light when I stumbled upon Zing’s Indiegogo page.

The main feature I was looking for was for the light to activate automatically when I was walking past it and to turn off a few seconds later. Zing seemed to be able to do this and more: after seeing browsing the site, what intrigued me were the many possibilities for customization, the integration with IFTTT and the fact that each light has a temperature sensor — which I was hoping I would eventually be able to access via API.

Some features, such as automatic path lighting and the locator feature, were not a part of the decision. Others, event notification in particular, I knew I would not use (I am trying to minimize the notifications I get while I am home).

I got a pack of 3 on Indiegogo, hoping to receive them relatively soon. The wait turned out to be longer than I expected (shipping ended up being a few months late due to some complications in the production process) but I finally received my lights in October.

The box the lights come in.

So far, I have been quite happy with them. I liked the fact that each light is configurable and offers a number of settings to customize that will help you make sure it works with your environment and preferences.

I installed three lights (two are in bedrooms and a third is in the master bathroom) and set all of them to a warm, yellow glow. I initially configured the bathroom light to a multi-colored rotating pattern (see the screenshot below) but shortly after I opted for a more relaxing solid color and static pattern.

The Zing app allows configuration of many parameters for the lights, such as the light color, intensitiy, spread (influencing how wide of an area the light would illuminate) and speed (for moving patterns).

Unfortunately, the Android app seems to be lagging behind with respect to the iOS one in terms of functionality — more on this later.

The predictive path lighting feature has been quite disappointing so far. Whenever one of the lights turns on (because you are walking past it), all the other two will turn on as well, almost as if the model powering the feature today wasn’t any sophisticated than “if motion is detected, turn on all lights”. Not a big deal, but it meant that I turned off the feature on the light in the other bedroom, since I did not want any of getting up to trigger the light in our daughter’s room.

I haven’t gotten to try neither the locator or the notification feature advertised on Indiegogo: I am not sure whether they are supported or not.

Unfortunately, the version of the lights I received shipped with an older firmware version that is affected by a couple issues:

some settings (e.g. blue light reduction) are not persisted if the light loses power;
the activity indicator for the WiFi module often flashes blue.

I must say the latter issue is quite annoying in a night light. If you think about it, having a bright blue LED flash unexpectedly is quite noticeable in a dark room and almost defeats the purpose of having a night light.

I am told that updating the device firmware might help with both of these issues but unfortunately the Android application is unable to perform the update so far. I have been in touch with the Zing Support team to understand what workaround are available (other than procuring an iPhone) and I am hoping to hear back soon.

All considered, I have been quite happy with Zing, provided that I manage to fix the issue with the WiFi module.

The features that I wish it had at this point are all related to software and am hoping they might happen soon:

Being able to prevent the lights from turning on at daytime or when the room is already bright enough;
IFTTT/Google Assistant integration;
Being able to access the temperature sensor via API.

If you are curious to check Zing out, they now have an official site.

UPDATE (January 2021) I am still quite happy with the basic functionality of these lights. However, I can’t recommend them if you are an Android user.

The Android application to control the lights has not received any updates in years. It also lacks several features that are present on the iOS version, such as controlling the lights based on a schedule, upgrading the device firmware and more.

Migrating From Wordpress to Hugo

Thu, 15 Mar 2018 00:35:57 GMT

After many years of running my own site on Wordpress, I finally pulled the trigger and decided to migrate to a different stack.

Wordpress had been working quite well for me until I started to run into some with the hosted version and did not want to deal with having to set up and maintain my own server just for this site.

When I found myself, unexpectedly, with some time to spare — rocking my newborn daughter back to sleep in the middle of the night — I took it as an opportunity to learn what kind of options are available for running simple websites in 2018. I had read so much about static site generators and they seemed such a great fit for what I was trying to do, so I decided to give it a shot.

I am surprised to see how far things have made it since when I last looked. If are interested in the current state of things, you can find a pretty good list on StaticGen.com.

I had no shortage of alternatives to consider but I fairly quickly settled on setting my new site up with Hugo.

Thankfully, the migration itself was not too daunting, I was able to complete most of it during the course of a few nights while holding a sleeping baby 😉

In case you are considering doing the same migration, here an outline of the steps involved and a few articles I would recommend.

Decide whether you want to keep the same apperance or you are okay with selecting a theme you like and just exporting your comment. In my case, I decided to switch to a new theme, so I focused on mapping how my existing content would be organized in the theme I was migrating to.
Migrate your content to Markdown that Hugo can process. I found this article useful: Migrating from Wordpress. Requires installing a plugin on your Wordpress site to export content in a format that Jekyll (another static site generator) can process and then transform that to the format Hugo expects

If your site is on wordpress.com, the guide above won't work as is, since you will not be able to install plugins unless you are hosting your own server. I worked my way around this by [exporting an XML dump of my site](https://en.support.wordpress.com/export/), and then starting up a throwaway wordpress server (I did this with cloud9 when they offered a free plan, you can probably get a similar result by running it on [docker](https://docs.docker.com/compose/wordpress/)).

Your site will likely require some fixes at this point. The specifics depend on what it looks like but it is likely that you will want at least to verify that the links between pages are working fine. Images often require some fixes.
My site had a fair number of incoming links from other places. I wanted to avoid breaking them if possible. This is where I was glad I was deploying my site on Netlify, since they offer great support of Redirect & Rewrite Rules, among many other features.
I had a good number of comments on my old site and I wanted to carry them over. For the sake of simplicity, I chose to use Disqus for my comments and thankfully they had a good article about Importing comments from WordPress.

Disqus comments are associated with page URLs, so you will want to make sure your pages are served at the same URLs as before the migration. Alternatively, you can edit the URLs in the export file before importing it following the instructions above.

I have yet to find a technical migration that completes without introducing new issues, so if you ever encounter any bugs on this site, I would ask you to please let me know.

Feel free to leave a comment if you are trying to do the same migration and you run into trouble, I can try to help you out.

What’s wrong with Milan’s Open Data initiative

Thu, 12 Sep 2013 10:38:51 GMT

I spent some time during the last weeks playing with the Open Data published by the City of Milan. I did not have a clear goal in mind, except for building some interesting visualization of the Public Transport coverage of the city grounds.

A quick exploration of the dataset seemed to be encouraging: while most of the data was relatively useless, some datasets were indeed promising and worth spending some time. While at the end of the week I was able to get the result I had in mind (the heatmap below), I was left with that lingering feeling of dissatisfaction that accompanies me when I see good initiatives that can be dramatically improved by changing a few specific features.

Density of bus stops in Milan

Presentation of data

If the purpose of a website is to publish data, data should be at the center. However, while CSV data sets featured a preview option, there was absolutely no way to preview topological data. Of course geographical displays are a more complex problem to solve but as of 2013 there are many libraries that can effortlessly visualize geographical features. Topological data is presented in a textual catalogue, with abundant descriptions and numerous fields of metadata, but there is no map. The screenshot below is the page on the website that describes the data about Parco Nord (a park where I used to go running). Note that it does not offer any hint about what the data look like. Compare this with the element below: (almost) the same data visualized on GitHub as a GeoJSON file. I believe this format is much more effective in communicating what the data look like. I suspect you will agree with me.

[Embedded content not available in RSS — view on site]

Choosing the right format

Topological data offered by the initiative is coded using the Shapefile data format, introduced in the 1990s for use with desktop GIS software. It is a very rich and powerful format but it encodes data as a set of compressed binary files, making it unusable with modern web applications without doing some prior processing. While Shapefiles are great for professional GIS users, for an Open Data initiative to reach the most developers, using a text based format like KML or GeoJSON would have been a wiser choice, as it lowers the barrier for the general public to consume open data information. Both formats are sufficiently rich to encode structured information: the map below is a good example (and the raw file is still human-readable).

[Embedded content not available in RSS — view on site]

The end result

After spending some time on this I ended up creating a GitHub repository with the data I played with converted to GeoJSON, ready for use with web applications, and wrote a simple visualization of the coverage of the city of Milan by the public transport network (the image you can see at the beginning of this post). Now, it would be great if whoever is responsible for Milan’s Open Data could look into making information available through better formats, leveraging Google Maps Engine or GitHub’s support for GeoJSON. While we wait for that to happen, if you convert more data to GeoJSON, feel free to fork opendata-milano on GitHub and contribute there.

Appsterdam Guru Session: Google App Engine for beginners

Sat, 06 Jul 2013 02:21:19 GMT

One of the things I was not expecting when I moved to Amsterdam was its active and vibrant tech community. Appsterdam, a non-profit organization focused around aggregating people with a passion for technology, is probably one of the central forces in this movement.

In my year in Amsterdam I had been to a few meetups organized by people from Appsterdam and always came back home having learned something new. This is why when my colleague Matt (who himself is quite an active Appsterdam member) talked me into presenting a guru session on Google App Engine, I saw that as an opportunity to return the favor.

While I tried to give an overview of App Engine in general (and the Python flavor, specifically), I also wanted to offer attendees the chance to work on some examples that were more interesting than the typical guestbook application that comes with all the tutorials you can find online.

The code examples build on two of the many APIs App Engine has to offer:

the Channel API to build a web page that displays the current cursor position of every user looking at that site,
Google Cloud Endpoints to implement a simple REST-like backend for a webpage.

You can check out the slide deck below and get the code examples from the GitHub repository.

[Embedded content not available in RSS — view on site]

App Engine is a lot more than an advanced infrastructure to deploy applications: the numerous APIs and services it offers can enable developers to build advanced applications with limited effort. I hope this presentation, while just scratching the surface, gives you a glimpse on the possibilities.

Special thanks to Matt for pushing me to do this and Serena for her help with example 2.

Presenting Professional Invaders

Thu, 06 Jun 2013 07:25:00 GMT

A few weeks ago I attended The Next Web Conference in Amsterdam and joined a bunch of fellow programmers for another edition of the Kings of Code Hack Battle, the same kind of event as the one where Bring Your Own Music was born.

Following the usual schedule, after a brief presentation from the API partners (Spotify, SendGrid, Braintree, Deezer, Pearson, Nokia, Rebtel, Bol.com, Smart TV Alliance and LinkedIn), all the attendees started evaluating ideas about what to build.

View this post on Instagram

A post shared by Alessandro Bahgat (@abahgat)

I teamed up with Alexander, a friend of mine I already had the chance to work with back in the days when I when I was consulting.

Having LinkedIn among the sponsors seemed to encourage us to build serious applications for serious professionals, but after discarding a few alternatives that would have been better projects for a Startup Weekend than a hackathon, we decided to take the opposite direction: building the silliest possible thing with the APIs we had access to.

We eventually decided to work on a game and tried to build a Space Invaders clone that would let you throw paper balls at your professional connections.

After some research, we found a well written Space Invaders implementation on GitHub (thanks Calamari) and we started adding the silliness to it.

The first day we focused on getting the game to work as we expected:

each invader would be one of your connections on LinkedIn,
a Boss would spawn every now and then,
the game would have some sort of soundtrack (thanks Deezer),
while in “Boss mode”, the game would have a distinctive appearance (blinking red background and a different theme song).

The second day we turned our attention to features that were just fun to build:

a coin slot where players could buy more coins with their own credit card (API courtesy of Braintree),
an easter egg we planned to use in the demo: attendees could spawn the Boss by sending email to an address we set up for the occasion (thanks Sendgrid).

The video below (3:11) shows the major changes the application went through. It was created by replaying significant entries in the commit log and recording what the game looked like at that time.

[Embedded content not available in RSS — view on site]

We approached the deadline with only one objective: making people laugh. Despite some technical issues (amusing at a tech conference), we managed to demo our hack and people seemed to have liked it: the guys from Sendgrid even decided to award us with a prize 🙂

You play the game here. This version is a slightly different from what we presented at the hack battle, since we decided to keep only the features that made sense if we were to offer it online.

We hope you’ll have as much fun playing it as we had putting it together!

What Van Gogh can teach us about persistence

Mon, 04 Mar 2013 13:33:43 GMT

I visited the Van Gogh museum in Amsterdam recently and, to my surprise, I left the exposition having learned something that matters beyond art.

According to his biography,

Van Gogh began to draw as a child, and he continued to draw throughout the years that led up to his decision to become an artist. He did not begin painting until his late twenties, completing many of his best-known works during the last two years of his life. In just over a decade, he produced more than 2,100 artworks, consisting of 860 oil paintings and more than 1,300 watercolors, drawings, sketches and prints. […]

Before focusing on painting, he worked as an art dealer, teacher and missionary. It wasn’t until he was 32 that he painted his first major work.

He did not have the fortune of being recognized as a talented artist in his young age like Michelangelo and others and yet still he did not let go of his desire of becoming a painter. The thing that strikes most of the museum is the quantity of studies and sketches Van Gogh made throughout his live in order to improve his skills. He wanted to paint so much that he kept practicing and put so much effort in improving that it eventually paid off: he is now remembered as the author of dozens of the most renown paintings of the history of art.

In an age where the reference point to define an accomplishment is starting a company at 16 and become a billionaire at 22, we risk underestimating the value of persistence. Sure, he did not reach fame and success while he was alive, and his life was not what you would define “happy”. But if he had quit because he was not an accomplished painter in his young age, art now would certainly be very different from what we know.

The works of Van Gogh are a proof that there is no such thing as being too late to accomplish something remarkable.

Prettier source code on WordPress.com

Mon, 21 Jan 2013 10:34:50 GMT

Posting source code on WordPress.com is quite simple: the platform already provides an extremely easy to use shortcode called sourcecode, based on a fairly flexible syntax highlighter plugin. By looking at the examples in the documentation page, however, it is evident that the default styling used to render sources is quite old-fashioned and does not fit most modern themes.

While the shortcode offers options to allow users to control many options of the rendering, it does not allow us to configure colors, fonts and size (the default size is so tiny that it is barely readable on high-resolution screens).

When I was writing the previous technical post, I did some investigations to figure out what options are available to post more readable sources if your blog is hosted on WordPress.com and I found out there are basically two alternatives.

Embedding Gists

The easiest option is to rely on Gist – GitHub’s tool for sharing snippets of code – which offers an extremely easy way to embed code in your blog. Just create a new snippet (gist) there and follow the instructions.

Unfortunately, the gist embed shortcode available on WordPress.com is less flexible than what you would get if you installed it as a plugin on your own instance of WordPress, but it will be enough for most cases.

Pros	Cons
Easy to embed source	Suitable for posts with a few (long) code snippets
Code looks good and is readable	Does not always work perfectly with search engines
Easy for readers to access raw code	Does not work with RSS and posts over email

Styling source code by customizing your CSS

While Gists work great most of the time, they are a pain to create and maintain if you are working on a post that should include multiple short snippets of code. In that case, the amount of bookkeeping you have to do is significant (you will have to create and link many small chunks of code) and you may want to be able to manage your code right within the post.

In that case, it may be more practical to fix the CSS theme used by the syntax highlight plugin to make it look post-2010. If you set your own custom CSS on WordPress.com, it will be supposed to be included as the last one to allow you to redefine the styles specified by the theme you are using.

Unfortunately, the CSS used by the syntax highlight module was clearly not written with extensibility in mind, but quite the opposite:

all the style declarations it includes make use of !important,
the plugin will dynamically include its own CSS as the last item in the head node, meaning that it will have preference on the custom one you define.

This makes sense in the original context – the original syntax highlighter offered several themes you could choose from by including different stylesheets, but that feature is not available on WordPress.com – but will make your life more difficult. You will need to add !important to all the CSS declarations you redefine and you will need to use CSS selectors that are more specific than the ones used by the plugin. You will be able to see the final result at the end of this post.

WordPress’s syntax highlight is not perfect, and some things are still quite annoying (e.g. line numbers get in the way if you try selecting and copying source code). Most issues could be addressed by upgrading the plugin to use version 3 of SyntaxHighlighter instead of the outdated version that is in use now, but it is something you will not be able to control unless the folks at Automattic decide to update it.

Pros	Cons
It is necessary to have access to Custom CSS (which is a paid feature)	Hard to copy sources without including line numbers (unless you disable them)
Access to advanced features (highlight lines, toggle line number display)	Search engines index sources with the post content
Source can be styled according to preference	Getting your CSS applied correctly can be difficult (but you can start from here)

What did I choose?

Here is the stylesheet (embedded as a Gist) I am currently using on this blog, based on the pygments theme used to style code at docs.python.org.

[Embedded content not available in RSS — view on site]

You can can see what the final result looks like ~~in this post about User authentication with webapp2 on Google App Engine~~ and in the image at the beginning of this post.

Update: I since migrated this blog to a new system, and am using a completely different way to render source code.

Alessandro Bahgat's Blog

The nearest hospital to every place on Earth, in a single S2 range query

The worst place to get injured

The hotels-to-cities problem, again

S2 in one paragraph

The data

Mapping every POI to every admin level, in one pass

Distance is just another kind of hierarchy

A booby-trap

See it

See the join structure itself

What the answer doesn’t (and can’t) say

Why this is an S2 post and not just a “use a spatial index” post

Reproduce

Same Agent, Different Score: The Problem With Testing Non-Deterministic AI

Two performance tiers

The variance is not noise

Three tries at honest measurement

v1: Too lenient

v2: Overcorrected

The telemetry payoff

v3: Stuck vs. thorough

A note on the hardware cliff

The Troll is still the ceiling

What I learned about evaluating agents

What’s next

Stuck in the Maze: Why AI Agents Can't Hold the Map

The setup

Day 1: why is my agent speaking Thai?

Day 2: the architecture pivot

Day 3: the maze

Why agents get lost

The microservices parallel

What’s next

Permission Structure

A different conversation

Two conversations, one explanation

A forge, not a filter

Capacity without clarity

Fortresses, Pipes, and Brains

Three responses to the same moment

The Fortress

The Pipe

The Brain

The uncomfortable middle

What this means

Visualizing Ukkonen's Suffix Tree Algorithm

Learning algorithms from books

Implementing from a paper

The visualization I wish I had

Adding more strings

Searching

Try it yourself

Beyond suffix trees

The Velocity Paradox

The Unhappy Middle

Bespoke Frameworks: from Asset to Dead Weight

Zero Slack

When Generation Outruns Verification

The Broken Inner Loop

The Slowing Submit Loop

From Craftspeople to Janitors

From Janitors to Gardeners

Building the Dark Factory

Crossing the Chasm

The Ghost in the Training Set

The Invisible Weight of Training Bias

From Prompting to Infrastructure

The Trap of “Contextual Debt”

Conclusion: Managing the Agent’s AI “Memory”

Footnotes

Receiving Feedback Is A Skill

Things I Stopped Doing

Taking It Personally

Arguing With Feedback

Things I Learned To Do Instead

Being Thankful

Following Up

Wrapping Up

Programming Machine Learning

Trial by fire: `OutOfMemoryError`

Testing HTML with `html-proofer`