Human 2.0

Launching Envoi: A platform to bring your AI agent to a conference

Alistair Croll — Tue, 05 May 2026 17:41:15 GMT

In 2025, I gave a talk called A Million Tiny Horses at Startupfest with a pretty grim conclusion:

SaaS as we know it has a 24-month shelf life.

Already, AI tools were turning prompts into working software, upending the economics of software companies. But it wasn’t until late that year that things really took off.

It took four things to really change how developers used AI:

Prompts becoming reasoning: Instead of answering a question, AI tools started planning how best to answer it.
AI calling tools: Many of the things we ask an AI to do—search a document, crawl the web, etc.—are done better by traditional software. It’s cheaper, faster, and less error-prone.
AIs dispatching agents: An AI outputs text, but it also ingests text. That means one AI can tell other AIs what to do.
Encoding behavior in skills: If prompts are spells, then skills are scrolls—re-usable instructions, sometimes with code snippets, that an AI can reference when performing a task.

These four changes unlocked a new kind of self-directing, planning AI that was vastly different from its predecessors. It could act semi-autonomously; soon enough, software now known as “claws” took the human out of the loop entirely (with unhinged results that surprised nobody.)

Welcome to age of agentic AI

You can't trust your AI agents. Every day, we’re hearing stories of self-directed agents hacking servers, deleting codebases, and compromising GitHub. Hundreds of thousands of people with no software experience are clicking “yes” to install code they don’t understand, and haven’t verified, on their personal machines.

But the rewards are equally hard to ignore. When software that took months to build now takes days, the economics of startups changes completely. Experimentation is easy; customization is expected; investment is optional.

So here’s another prediction:

Any startup that isn’t using an AI agent is already doomed, and just doesn’t know it yet.

I’ve got the usual caveats here: not all industries and business models, because AI has “jagged” intelligence; but if you abide by Paul Graham’s definition of startups as organizations designed for growth, then this prediction is very likely true.

When I say, “an AI agent,” I’m massively oversimplifying. There are autonomous claws (Clawdbot, Nanoclaw); command-line tools (Claude Code, Codex, Gemini Antigravity); chatbots that can access your computer (Claude Code) or their own in-the-cloud computer (Gemini Pro.) By the time you read this, the list above will definitely be out of date.

Regardless of these caveats, every viable startup will soon think of AI as co-founder. It will rely on a tool (or set of tools) that participates in slack discussions, manages infrastructure, plans calendars and meetings, researches competitors, edits marketing copy, prepares for sales calls, and, yes, writes the company’s code.

How do you bring your AI to a conference?

Since that talk last year, one thought has kept me up at night: if every startup has an agentic co-founder, what would it mean to bring them to a conference?

I don’t mean simple things like having an AI summarize talks, or scan the speaker list to help you decide which talks to skip. And I don’t mean installing yet another conference app that has a chatbot. I mean actual participation.

When a human attends a conference, they create a profile and wear a badge. They propose, vote on, write, and attend talks. They set up trade show booths, and visit the show floor. They answer polls. They speak with other attendees, set up meetings, and write up their experiences afterwards.

What if their AI could do the same?

So for the last four months, I’ve been building that. It’s called Envoi, and we’re launching it at Startupfest in July.

All of this content was created by AI agents on the Envoi platform.

The challenges of building a platform

A platform like this is substantially more complex than a single-user desktop app:

It has three distinct users: the attendees; their AI agents; and the people running the event.
It has to support as many different AI agents as possible, and keep them on the rails as they try to follow instructions.
The AI agent needs to be able to start from scratch, every time, because people switch their AIs and delete old conversations.
It has to be multilingual. At Startupfest we have French and English attendees.
Some humans are creeps, and the agents need to shut down bad behaviour rather than amplify it.

Envoi isn't just the core software, but the tools to simulate users, load-test the infrastructure, track all the project tasks, deploy into production, handle support tickets, iteratively tune skills, and more.

How you can try Envoi

At Startupfest and FWD50, we encourage founders and public servants to explore what’s possible. We think it’s important for us to do the same, which is why we’ve created new experiences over the years like hybrid events, new activations, mentorship models, and more.

Envoi is an experiment. It might be delightful, and it might break in ways we can’t anticipate. It will probably be both. Whatever happens, we’re once again pushing the boundaries of what’s possible and learning from it.

Right now, Envoi works with autonomous agents like Nanoclaw and Openclaw; CLI-based AI tools like OpenAI Codex, Claude Code, and Google Gemini; and the (paid) Claude Code desktop app with network permissions enabled. As other AI platforms and harnesses improve their agentic capabilities, we expect to support even more of them. You'll need to be at least somewhat familiar with how AI agents work, but they're getting easier to use every day.

If you want to try it, and use one of these AI environments (or are willing to learn!) buy a ticket to Startupfest in July. I'll be hosting five onboarding sessions to work the first batch of ticketholders through the process this month, before we open the platform up to all attendees.

When the target keeps moving

Alistair Croll — Sat, 02 May 2026 22:23:45 GMT

Full disclosure: I've been working on an app for a few months, and will be launching it May 5. This has given me a lot of opportunity to understand how software development changes when working with AI agents. One of my agents (an OpenAI Codex agent) helped do the analysis this post covers, and wrote an early draft. I rewrote the whole thing; it was initially pretty academic, because the AI spent a bunch of time reading academic papers on software estimation and AI agents are what they consume. This agent is usually more fun, and says weird stuff like "more door than hinge" and "déjà-vu wearing a fake moustache" regularly.

The first progress report in April said we had 254 tasks. 140 were done, so we were 55% complete. A month later, we'd completed 1,257 tasks. Excluding cancelled work, we were 89.8% done. And yet there were still around 140 tasks left.

The top of the "Cairn Board", the app we built to track our progress.

It had been a busy few months. What started as an idea has become a production-grade web application with audits, release gates, admin workflows, simulations, test harnesses, support tooling, deployment checks, and a long list of edge cases. Yet like Xeno walking by half-steps, the target kept moving with us.

That's because a project isn't a pile of known tasks that are being steadily shovelled from "not done" to "done." It's also a discovery engine. Every feature, test, audit, simulation, customer conversation, or UX change shows you the things you didn't know until they were staring you in the face.

Anticipating these things is the job of planning, specs, and project managers. Planning up front makes sense when the cost of actually building something is high. If you're building a bridge, you don't figure it out as you go along. You do things like requirement discovery, requirements, specification, and planning before you implement anything. You run simulations, build models, test your assumptions, and double-check your math. You update the plan and the schedule. Only then do you actually order the girders.

Software isn't a bridge. Getting it wrong is annoying, but generally isn't life-threatening. And if it is wrong, then you can fix it. You can make a copy, work on that copy, and when it's better, deploy the fix.

Despite this, the actual coding part of software has been relatively expensive, which means that we've always tried to get the plan right first.

Now that the cost of coding has plummeted, that's changing. AI agents don't just complete tickets—they inspect, decompose, test, question, file follow-ups, and expose new things you're going to need. AI accelerates both delivery and discovery.

In software engineering, there's a term for the never-ending growth of a project: Scope Creep. Agile software development has a caution against this, called YAGNI (You Aren't Gonna Need It)—you shouldn't build more than you need to. But the new economics of software mean scope creep isn't as an embarrassing exception to the plan—it's how you figure out what the plan was supposed to be all along. A growing backlog of features can be progress, but only if you distinguish creep that reduces focus from creep that increases truth.

The data set

This project was a unique opportunity to analyze changes in scope first-hand because in addition to the features, we also built the tracking systems. Every task is recorded and explained and logged in exhaustive detail. Without giving too much away, the project is:

A public web application with authenticated user flows.
Backend APIs and data storage.
An admin/operator surface.
QA and production deployment paths.
Automated tests, simulations, audits, and release gates.
Agentic AI collaborators doing planning, coding, verification, and board hygiene.

This post analyzes exactly one month of data, using a bunch of local systems:

Where this data comes from.

The data's not perfect. One of the AI agents deleted the task board by accident, and we had to rebuild it from specs and Github. AI Agents are forgetful toddler idiot savants and must be treated as such. I carry a daily backup on a keychain at all times. But we can still learn useful things.

High-level project stats

Over the 32-day daily task window, the git log recorded 1,352 commits. The tiny social media platform we built so agents could coordinate with one another recorded more than 2,700 posts:

Some stats from a month of development

After spending a lot of time watching the next deadline stretch off into the horizon, I asked the agents to build me what I called an ahead/behind chart: How many tasks did we add, and how many did we complete? If we could stay mostly below that line, I figured, we'd be okay.

Here's what that looked like:

The month was not smooth. Discovery and delivery moved together, with large task-creation days often followed by large completion days.

Github tells us more

The task board tells us the names of the tasks; but Github worktrees tell us how those tasks moved through production. A lot of the project happened in short-lived topic branches and isolated worktrees. At the April 30 snapshot:

What happened on Github over the month.

The 150 primary-remote topic branches were not evenly distributed by kind. Their names are too project-specific to publish, but the branch families are revealing:

We worked on many different things. Some were traditional "build a feature" tasks; many were about making sure those features worked properly, or adjusting bad ideas once working code showed how terrible they were.

The branches show us that the type of work shifted over time. Later work wasn't new product ambition—it was hardening, refactoring, verification, and system cleanup. Almost half of the merge commits in the month happened after April 21, the same period where the task board hovered around 90 percent complete while the active denominator grew by 367 tasks.

The progress bar is a lie

Here are four recovered checkpoints from the month:

Four points in time, showing how many tasks were left.

The project looks like rapid progress at first, followed by encroaching horror as you discover just how big the problem you set out to solve really is in the middle. Polish comes at the end.

Plenty of vibe coders are learning the hard way that like icebergs, 90% of the work is hidden.

No, this X axis is not consistent. This is not a line chart. Don't @ me. But it does show that even though we're adding features, the number of remaining tasks stays roughly the same.

Between April 21 and April 30 alone, we went from 1,032 to 1,399 tasks—367 newly visible active tasks. In that time, completed tasks grew from 921 to 1,257. That is 336 additional completed tasks. In other words, late in the project, for every 100 tasks completed, roughly 109 new active tasks became visible.

Rob Bowley used a coastline analogy in his 2011 essay on software estimation: the closer you measure, the more detail you see. Our early estimates described the coastline from the air, while late-stage QA walked the rocks. We didn't know what we didn't know, and delivery was also discovery.

Discovery Velocity

Most project dashboards ask questions like, "how many tasks did we finish?" and "how much of the original backlog is complete?" But if you're building with AI , you need to track a slightly different metric:

Discovery velocity: how many new units of necessary work became visible per unit of time.

Delivery velocity without discovery velocity is misleading. A team can complete 40 tasks a day and still fall behind if it is discovering 45 tasks a day. And falling behind may be necessary, if the team is nevertheless moving from implementation to verification. It may even be a source of competitive advantage: you're discovering the moats and barriers to entry that make your software hard to copy.

More than half of our days were discovery days:

Good days and bad

Those days coincided with major milestones like audits, big releases, security hardening, and so on:

Big days coincide with milestones.

The highest completion day, April 20, completed 132 tasks but also added 92. 40 tasks seems like a decent accomplishment, but from a project-learning perspective it was one of the most important days of the month: a lot of hidden work got named, and a lot of named work was retired.

The last 10% was mostly not features

As of April 30, 5 groups account for 64% of all active unfinished tasks. Those groups are:

Agent instruction refinement (getting agents who will use the platform to behave as expected.)
Human testing and launch logistics (coordinating tests with human users and the launch team.)
Simulation follow-up (analyzing the results of a test run and turning them into new tasks.)
Load and scale validation (making sure the system can handle hundreds of users without breaking.)
Onsite or operational readiness (preparing to run the thing in the real world.)

By the end of the month, most of the remaining work wasn't raw feature construction. It was confidence, launch, simulation, instruction, and operational readiness work.

Metrics like "percent complete" treat all unfinished tasks as equal, but late-stage tasks are fundamentally different from early ones. While early-stage tasks ask, "can we make this thing?", late-stage tasks ask:

Does it behave under stress?
Can a real user recover when confused?
Can operators see what is happening?
Can we deploy safely?
Can we explain the system to collaborators?
Can we distinguish a product defect from a support issue?
Can we trust the automation that says the build is healthy?

Late-stage tasks are also more entangled, and many of them require human involvement that means they can't be handled in overnight --dangerously-skip-permissions YOLO binges. They create new tasks faster because they operate on the whole system: a new feature touches a path, but a verification pass touches the whole product.

Fred Brooks argued in No Silver Bullet that the hard part of software is the specification, design, and testing of the conceptual construct, not just typing it into a programming language. AI agents change this. Cheap, uncomplaining coders turn design into discovery. If you ask a developer to rebuild the same thing ten different ways, they'll quit. Ask ten agents to do so and they'll just charge you more tokens. In some ways, this exposes gaps in thinking faster (but it also runs the risk of producing a product that works perfectly, but just for you.)

Scope Creep versus Scope Discovery

Academic research into software development has often treated a growth or change in requirements as a measurable project dynamic. Ferreira, Collofello, Shunk, and Mackulak describe requirements volatility as growth or change during the development lifecycle and model its effects on cost, schedule, and quality.

Agentic development means that not all volatility is the same, and treating it as such is risky. Across over a thousand individual tasks, here are 7 different kinds, what they tell us, and how we responded to them:

How we responded to different kinds of task.

Any task tracking board needs this classification, because a single "tasks added" number cannot tell you whether the project is drifting or learning.

Agentic AI changes the shape of the curve

Classic software estimation literature already gives us plenty of reasons to distrust long-range precision. The planning fallacy literature distinguishes the "inside view" from the "outside view": if you only reason from the internal story of the prohect you're working on, you underestimate; if you compare against similar projects, you do better. Magne Jorgensen's survey of effort-estimation research argues that there is no single best estimation method, and that local historical data and tailored checklists improve estimates more reliably than generic models.

Agentic AI makes these findings more urgent. In our month of data, agents affected the curve in at least four ways.

They lowered the cost of decomposition. A human might notice an issue and leave it as a vague concern. An agent can turn it into five tasks, wire dependencies, add tests, and update the board. The denominator grows because the work becomes countable.
They increased parallel inspection. Multiple agents can inspect different surfaces at the same time: code, tests, UI, docs, security, deployment, and data flows. This raises discovery velocity.
They blended planning and implementation. Traditional workflows often separate discovery, specification, and execution into rituals. In this project, an agent could discover a gap, write a short design, implement a fix, add tests, and file two follow-ups in a single session. That is productive, but it makes clean phase boundaries harder to observe.
They created measurement ambiguity. METR's 2026 update on developer-productivity experiments notes that agentic tools make time measurement harder because developers may run agents concurrently, avoid submitting tasks they do not want to do without AI, and choose different kinds of tasks when AI is available.

The question is not just "how much faster did AI make implementation?" It is "how did AI change which work we attempted, which work we noticed, and which work became cheap enough to do?" An agentic project may produce more finished work and more remaining work simultaneously. The question is did it produce a better product as a result? For that, the jury's still out.

Estimates affect the work

Estimates are not passive predictions. They change behavior. Jorgensen and Sjoberg's work on effort estimates found that overly optimistic estimates can lead to less effort and more errors in programming tasks, and that anchor values can affect estimates even when people are told the anchors are irrelevant. Jorgensen's later survey summarizes the broader point: estimates can be harmful when they become commitments too early or create incentives to fit work to the estimate.

Once the board says 90 percent, every new task feels like regression. It's what made me ask for the ahead/behind chart. But "90 percent complete" on a moving-denominator project does not mean "10 percent of the original plan remains." It means:

Of the work currently visible and not retired, 10 percent remains.

In our case, 90 percent complete meant the core product surface was mostly done, but confidence work was still creating tasks. The project was closer to launch, but also more honest about what launch required.

A better dashboard

If I were designing the progress dashboard again, I wouldn't remove "percent complete." But I would demote it. Our latest iteration of the dashboard shows five different metrics:

Completed active tasks: How much have we done so far?
Active remaining tasks: How much is left to do?
Daily task additions by discovery type: How much are we adding, and what sort of tasks are they?
Daily task delivery: What did we get done today?
The discovery/delivery ratio: Is the rate of things uncovered to things delivered improving?

You really need to track discovery (learning); delivery (completion) and how they're related (a sign of whether you're beyond the learning phase.)

Three metrics that matter more in agentic development.

For late-stage work, I would also track remaining tasks by type:

Product functionality (this should show up at the start).
Defect repair (how fast do you squash bugs, which is a measurement of your harness and development environment.)
Verification (how well can you test what you've built, which is about automation and scripting.)
Operations (once it's live, can it be operated?)
Documentation and instruction (can other people take over/help out? Will support costs be high?)
Human or organizational dependency (this increases as implementation gets closer.)
Deferred / intentionally out of scope (this shows whether you're willing to cut.)

This lets you have better product conversations: "We have 142 tasks left" is emotionally heavy and operationally vague, but "we have 16 core-product tasks, 37 agent-instruction tasks, 31 validation tasks, and 18 human-test tasks" means something. I wouldn't plan the next agentic build as a fixed backlog with a burn-down chart, but instead, as three overlapping curves.

1. The first useful thing

Early on, optimize for coherent vertical slices. The first denominator is mostly fiction anyway. It is still worth writing down, but its job is orientation, not prediction. This is the new Minimum Viable Product. In this phase, you're tracking metrics such as:

End-to-end paths.
Core abstractions.
Time to first working deployment.
Early user (or agent) success.

2. Deliberately increase discovery

Once the system works, send agents and tests looking for trouble. This will make the board look worse, which is the point. You're trying to discover unknown unknowns and competitive moats. At this stage, that means data such as:

Audit findings.
Simulation failures.
UX friction.
Security and privacy issues.
Operational gaps.
Test unreliability.

This phase should have a high discovery velocity. Your backlog will grow; if it doesn't, you aren't looking hard enough.

3. Denominator stabilization

Only after serious verification should you expect the total number of tasks to stabilize. At that point, forecasts become more meaningful. Signs that this is happening include:

Discovery ratio falls below 1 for sustained periods (you fix more than you find.)
New tasks are increasingly deferrable rather than launch-blocking.
Remaining tasks cluster around launch-related activities.
Verification passes produce fewer novel classes of failure.
The same issues stop reappearing under different names.

For this project, the late April data says we were not fully stabilized. From April 21 to April 30, the "remaining tasks" list grew by 367 while 336 tasks were completed. The project was making real progress, but still discovering at almost the same rate it was delivering. That's the number you should be discussing in a team meeting:

We are about 90 percent through currently visible active work, but our late-stage discovery ratio is still near 1. The launch question is whether the new discoveries are launch-blocking or deferrable.

Is estimation broken in AI coding?

For years, software people have argued about whether estimation is broken, harmful, necessary, or merely misused. The moving denominator reframes the debate, because it's not just that humans are bad at estimating known tasks, it's that early in a project, many of the real tasks do not exist yet as concepts. They're latent in the system, and only appear when:

A test fails.
A user misunderstands a flow.
An integration behaves differently in production.
A security review asks a sharper question.
A deployment path encounters an organizational constraint.
An agent tries to follow instructions literally.
A support workflow meets a real edge case.
A performance run converts "probably fine" into numbers.

This is why agentic AI is such an interesting accelerant: It helps us encounter reality faster. If your management model expects the denominator to shrink smoothly, agentic development will look chaotic. If your model expects discovery and delivery to run together, the same data looks like learning.

Good scope growth makes the project more truthful.
Bad scope growth makes the project less focused.

You just need to distinguish which is which.

References

Frederick P. Brooks Jr., ["No Silver Bullet: Essence and Accidents of Software Engineering"](https://cgi.csc.liv.ac.uk/~coopes/comp319/2016/papers/NoSilverBullet-Brooks1987.htm), Computer, 1987.
Rob Bowley, ["Estimation is at the root of most software project failures"](https://blog.robbowley.net/2011/09/21/estimation-is-at-the-root-of-most-software-project-failures/), 2011.
Magne Jorgensen, ["What We Do and Don't Know about Software Development Effort Estimation"](https://www.infoq.com/articles/software-development-effort-estimation/), IEEE Software / InfoQ, 2014.
Magne Jorgensen and Dag I.K. Sjoberg, ["Impact of effort estimates on software project work"](https://www.sciencedirect.com/science/article/abs/pii/S0950584901002038), Information and Software Technology, 2001.
Susan Ferreira, James Collofello, Dan Shunk, and Gerald Mackulak, ["Understanding the effects of requirements volatility in software engineering by using analytical modeling and software process simulation"](https://www.sciencedirect.com/science/article/abs/pii/S0164121209000557), Journal of Systems and Software, 2009.
Yael Grushka-Cockayne, Daniel Read, and Bert De Reyck, ["Planning for the planning fallacy"](https://www.pmi.org/learning/library/2019/04/07/15/25/planning-fallacy-causes-solutions-project-expectations-6374), PMI Research and Education Conference, 2012.
METR, ["We are Changing our Developer Productivity Experiment Design"](https://metr.org/blog/2026-02-24-uplift-update/), 2026.
Amy Deng / METR, ["Analyzing coding agent transcripts to upper bound productivity gains from AI agents"](https://metr.org/notes/2026-02-17-exploratory-transcript-analysis-for-estimating-time-savings-from-coding-agents/), 2026.

Back to doing other things

Alistair Croll — Fri, 01 May 2026 11:49:51 GMT

My mom asked me for a favour.

She wondered if there was a way for her to examine hundreds of pages of documents for a few specific words. The documents came from a dozen people. There were photos. There was text. There were photos of text. Some of the files were very old; many of the senders were, too. There were, dear reader, Wordperfect documents.

This is not a simple thing, even for an AI. You can't just drag a giant .zip file into a chatbot and get smart answers. Not quite yet.

So I did the next best thing, and asked an AI to write me some software.

I described the situation, said, “please generate a couple of documents that list what you found, and in what context it occurred.” And then, for giggles, I said, “using a graph database of all the documents, linking them to all the other documents, by any word in the documents”, because why not.

It wasn't just whimsy. This might not be the last time she asks such a question of these documents, and software that could search them sounded like a useful thing to have.

And then I went and cooked dinner.

About eight minutes later, it had examined all the data, figured out what software it needed to make sense of it all, and designed the app and database. By minute 32, it had the first version of the app. We both kept cooking, and pretty soon, it was done.

You can zoom, drag, and click to navigate all the data in real time.

And then it occurred to me: We live in a world where you can make someone software as a favour. Instead of “come move my couch” or “can I give you a ride to the airport”, it’s perfectly normal now—or very soon will be—for us to ask a friend to write us some software.

Honestly, it's nice to help. It feels good to be needed. But the AI is getting smarter all the time. I mean, I just taught this version of ChatGPT how to summarize and visualize hundreds of multi-format files. The next version will already know how to do that.

When that happens, my mom won’t need my help. She’ll just ask her AI, and so will everyone else.

That world is going to be weird.

The nerdy version: Imagine having access to pretty much all the calculation and interpretation you could ever want. You can analyze anything. You just need to say the right words. Now imagine everyone else having one of those too.

The much more thoughtful version (straight from my partner when I mentioned this): “I can’t get excited about people just being able to use another tool.”

Maybe she’s right.

Maybe writing software will be so commonplace we won’t really care.

Maybe historians will think it’s weird that between 1995 and 2030 most of our jobs involved sitting in front of computers a lot, and then we stopped, and went back to doing other things.

A million tiny horses

Alistair Croll — Fri, 10 Apr 2026 03:33:30 GMT

Each year at Startupfest, I try to frame the current state of the startup world. 2025 was a year of huge change: agentic AI was forever changing the economics of new ventures. Instead of unicorns, we were entering the era of a million tiny horses: smaller, more niche businesses that didn't grow exponentially, but each hit a logistic curve.

Startupfest just released last year's talk; here's my take on where the startup world is headed, and why outcome/liability fit is the new product/market fit.

Startupfest happens again this July (alongside FWD50, the digital government conference). If you've been wanting to get to Montreal, this summer is a good time to do so. We're planing an amazing week of innovation across the public and private sector, with a bunch of surprises. We literally take over an entire pier in the iconic Old Port. You should come.

Fallbacks will be the death of us

Alistair Croll — Wed, 08 Apr 2026 14:40:19 GMT

In software development, a fallback is a catch-all, a way to resort to a less-good-but-still-works way of doing things. Your car radio is a fallback to streaming if you don't have Internet. Boiling water is a fallback to a water treatment plant.

And in my work building things with AI, fallbacks have become my mortal enemy.

Agentic engineering is fundamentally different from an era when code was expensive and precious. We used to conduct careful designs, laying everything out in happy paths and error messages. We'd iterate in Figma, do paper prototypes, add our taste and flourish, and make sure everything worked in our heads before it worked in code.

Because code was expensive.

The wrappers around the coding process—customer research and design before, QA and testing after—existed because we needed to keep the part in the middle efficient. Now that code is cheap, it's altering the surrounding pieces in ways we don't yet fully understand:

We build something, then see if it's usable. Instead of working out usability first, we build the MVP from a set of instructions. "Make a game of Battleship for two players" is a simple prompt that can realistically deliver a functioning app. But is it fun? We find that out when we actually play it.
Our tests are optimizing for passing the tests. Goodhart's law says that once a metric becomes a target, it ceases to be a useful metric. But when you're building with AI, "does it pass the tests?" is a proxy for "does it work." There are many ways to make something pass a test: you can make the test easier, or give the thing you're testing the correct answers. In agentic engineering, this means the tests become dishonest.
We add the taste later. We start with functionality, asking can we do the thing, then streamline it to decide how the thing should be done. (A side consequence of it being easy to build things is you have to be better at deleting them when they aren't actually good.)

Plenty of people are talking about how the design > build > test cycle is now build > test > design. Fewer mention the dirty secret: the tests are fiction and the design is the result of many iterations, each of which leaves a messy trail of how things used to be.

I've seen this many times in a project I'm working on. I'm deliberately obfuscating some of the details to avoid giving things away, but:

Tests are fiction

My test harness became dishonest:

The thing I'm building requires that an AI agent complete a task solely based on a skill we give it. Things were working well. Too well. Then I found out that the AIs that built the test had been giving the agent hints to ensure its success. They were trying to pass the test, not test the desired behavior. This happens all the time. Your tests are lying to you.
When you build test-driven software, you don't just write code, you write code to test that the code works. My project has well over 700 such tests. But deployment still fails because the tests are just rubber-stamping the functionality.

In other words: AI developers bend the test to suit the app until the app seems to work, instead of making the test represent the desired outcome and modifying the app until it works.

Iterations build up

I've experimented with many approaches to get to where I am are with this app.

The chatroom that wouldn't die

The nature of the project (yes, I know, being coy here) involves agents from Anthropic, OpenAI, and Google. So I'm using all three companies' AI tools (Claude Code, Codex, and Gemini) to build it. Rather than me acting as a go-between, I had the AI coders build a small messaging platform (think IRC). They called it Signalfire (don't judge.)

Shortly after, I realized I needed a task tracking app. Rather than use a tool like Asana to manage my Kanban Board, I had the AI coders build this too—and wire it into what was happening behind the scenes. I'm glad I did; there are, as of this morning, 525 tasks.

I asked the AI coders to integrate Signalfire (the chat app) into the task tracking tool. Which they did. The problem is they refused to excise the old app. Sure, they moved it into a directory called "defunct". But it kept coming up; they'd try to use the old thing, find out it didn't work, then use the new thing. The intended path relied on a fallback. Making the mistake was a required step in using the thing.

Your code refuses to forget

Some of the most promising features in this app started out very differently from where they are now. This is largely because of the build > test > design cycle I mentioned above. I didn't have a fully-formed understanding of what these things were when I started building them, because seeing the prototype was part of the process.

To be clear: this is amazing; it's reminiscent of Henry Mintzberg's Crafting Strategy: What you make emerges from the clay as it spins on the wheel.

But the early experiments were full of ideas that changed. They had placeholders and shortcuts. The goal was to get a stable, testable first version that humans could try, because there is no substitute for actually using the damned thing.

And as the AI coders built the new versions, they left the old ones in place just in case, in the form of fallbacks.

Initially this sort of behavior is done with the best of intentions. It's a belt-and-suspenders form of redundancy. Coding patterns like catch-if handle errors you didn't expect. Leaving in a "fall back" mechanism seems smart—but you shouldn't be falling in the first place.

These fallbacks all try to keep users on the "happy path". But in doing so, they mask that path's many hidden, unintended exits. Instead of a clear roadway that is a delight to navigate, you wind up with bunch of confusing, ambiguous directions and off-ramps to nowhere.

My job is now brain surgery

This happens so much in AI design that the agentic developers rely on a wide range of tools to fight it: The /simplify function; Superpowers; and increasingly frustrated human prompts that tell the AI coders to burn more tokens cleaning up their messes, or to check another AI coder's work in order to get back to the sort of clear, parsimonious code humans used to produce when coding was expensive.

I literally have one AI coder fixing another's memory at the moment.

I trust Codex to edit Gemini's brain more than I trust Gemini.

The true name of the thing

The prompts themselves turn into prose and nuance. Ursula K. Le Guin was right: A thing's true name gives you power over it. In agentic engineering, words are spells that drag an LLM into the right "headspace" (though that headspace is actually a multidimensional embedding.)

I don't just write "clean up the code", I often write in analogies. At one point, the AI agents were building a feature (call it "milkshake" for the sake of this discussion). I realized that because of how it worked, it wouldn't scale. My instructions weren't simply "remove the milkshake." Instead:

"You are a skilled neurosurgeon working with precision tools and the finest surgical equipment, and today you need to remove the milkshake. It has become a cancer, a glioblastoma, and you must surgically remove every molecule of it so it doesn't metastasize. Excise it like a surgeon, inspecting every file, function, and piece of documentation to ensure it is completely removed. The patient's life is at stake here, so you must be precise and meticulous. Remove the milkshake completely."

That's not a prompt, it's a declaration of medical liability. But it's the kind of thing I resort to because the AI coders are focused on their objective functions rather than real outcomes. In their eagerness to build code that passes tests, they're leaving fallbacks all over the place that slowly turn into bloated, byzantine leftovers and half-forgotten off-ramps, instead of removing them.

The canary stopped singing

Sometimes, fallbacks linger because someone benefits from the complexity they create. Legal contracts and government regulations are prime examples of this:

A contract grows to accommodate more and more "standard clauses." The people who sign the contracts don't really understand them, but dismiss them as "the way things are done." (I wrote about this for FWD50, a conference I chair, a while back.) These clauses are fallbacks. They catch rare exceptions, but their existence masks the underlying problems of the agreement itself. Clauses persist, when instead the contract could be cleaned up and simplified, because lawyers get paid by the line.
Regulations expand to plug loopholes in outdated or badly-written laws. A regulation is supposed to enforce the intent of the law, but as the world changes, those regulations grow increasingly out of date. But lobbyists, bureaucrats, and public servants earn a living from finding, exploiting, and fixing loopholes, and politicians pay a cost for trying to change things, so the legal precedents build up while the law stays outdated.

But for most work, friction was the garbage collector. Expensive code got pruned because writing it was costly. Slow contracts got renegotiated when the pain became unbearable. Bad processes got redesigned when the workaround finally broke. Friction forced deletion, particularly in domains where there was an objective way to measure better or worse.

AI removes that friction, and in so doing, removes the garbage collector. When everything is cheap to add, nothing is cheap enough to justify removing.

I'm not opposed to fallbacks as a temporary tool. A fallback that you know about and control is just a contingency plan, and there's a time for it. But when your fallbacks become invisible load-bearing walls, they're debt that forgot it was debt. Fallbacks are the symptom; the disease is that deletion feels riskier than retention, and without friction, things stay. Your AI is letting Kato Kaelin crash in the guest house.

Simon Willison thinks we're headed towards a world of dark factories: fully-automated software development processes in which the agent can write, refactor, and test software without human intervention. They're serving new customers. Every kind of white-collar work is getting a factory.

And I don't trust the workers.

This is what AI governance actually looks like in business. It's not just bias and safety regulations (though those matter a lot.) It's the unglamorous work of building institutions, teams, and habits that can delete things and see when an AI is gaslighting them. You need the power to say: This is vestigial. It is debt that forgot it was debt. It is a way to cheat on the test. It masks a real problem. And it has to go.

The canary isn't singing anymore, because an AI moved it to /defunct.

The asymmetry of digital war

Alistair Croll — Fri, 03 Apr 2026 21:19:36 GMT

TL;DR: Countries should start making sure their populations can survive for three days with no digital systems.

I debated writing this, because it might seem alarmist or anti-government. I'm pretty sure it's neither.

On April 1st, four humans climbed into a capsule the size of a campervan and headed for the moon. It was the first crewed lunar mission since 1972, with a Canadian aboard.

At the height of the cold war, kids got the day off school to watch.

Australian kids watch the moon landing in 1969.

This time, there was so much happening we barely noticed. In the same week:

The US-Israel war on Iran entered its 34th day, with the Strait of Hormuz effectively closed and oil past $105 a barrel. Iranian hackers went after FBI director Kash Patel.
A piece of software called axios, used by hundreds of thousands of applications, was quietly hijacked and turned into a tool for spreading malware. It was not the first time in the week.
Anthropic accidentally revealed it had created an AI that was so good at hacking (and, it must be said, for now incredibly expensive to run) that it had delayed the release because of how hackers would use it.
Existing versions of AI are finding exploits constantly, some of them two decades old. One security loophole was found by asking Claude, "Somebody told me there is an RCE 0-day when you open a file. Find it."
The U.S. banned all foreign-made routers, citing them a security risk.

We're in a digital cold war that's escalating fast, but we're still thinking about it as if it's a traditional war.

Digital war is not physical war

Physical war has two clear sides, and a front line. Our tanks versus their tanks, our troops against their troops. It's east to think about cybersecurity this way too: Nation-states with their weapons and security agencies. But digital warfare doesn't look like that.

Online is the front line

In kinetic warfare (the kind we fight with atoms) there's a line between the military and civilians. Armies fight armies. In digital warfare (the kind we fight with bits) that line doesn't exist. The front runs through your home router, your parent's email account, the teenager who borrows a USB drive.

Iran hacked Kash Patel not by breaching the Pentagon, but by targeting his personal digital footprint. The attack surface isn't your office firewall. It's everyone you've ever communicated with, every device you've ever connected.

Digital is asymmetric

Infosec teams talk about the "red team" (attacker) and "blue team" (defender.) The blue team has to be right all the time; the red team only has to be right once. You're only as strong as your weakest link.

Digital attacks were once costly and risky. They took time and preparation, and you might get caught. But with AI and automation, that's no longer the case. Autonomous agents can keep trying to find a way in, running somewhere in the cloud that's hard to trace. The friction that prevented many attacks is gone.

We're making the attack surface bigger

And now, every piece of software running on your behalf.

Hundreds of thousands of people are now installing autonomous agents on their home computers. We're giving it our bank accounts and ID and calendars and emails. These software agents, called "claws", write and run code on your behalf. It uses software (like Axios) from developers you've never met. Agents create new places for someone to attack, and because they're easy to install (they'll do it for you!) people with very little security background are installing them on their home machines.

Anyone can be hacked

Hacking has always seemed non-violent. Sometimes, Hollywood even makes it look glamourous. When a prominent public figure gets hacked, we usually assume they did something dumb with their password. But the reality is, when someone skilled wants to hack you, they'll succeed. To compromise the developer behind axios, hackers created a fake company, a fake slack channel, and a fake Microsoft Teams platform that required he update some software.

@mattjayy breaks down the hack (did you catch the "microscell.com" in the URL?)

Watch that video and tell me you wouldn't have fallen for that too. Most people do not have the kind of operational security needed to be a difficult target. Our digital footprints are too large, too connected.

Hacking's violent now

Despite the Hollywood portrayal, hacking is not a mild crime any more. We already see how scammers destroy people's lives.

Now consider what happens when there's a widespread outage. Banking stops, which means Interac and tap-to-pay and ATMs and payroll go dark for days. If phones aren't working, there's no GPS, no Uber or Lyft, no home delivery. Ambulances can't find homes, and if they get a patient to hospital, the records aren't available.

We built a civilization stack that runs on networked software, which runs on dependencies maintained by volunteers, deployed with configurations nobody's checked, communicating through libraries like axios.

But we have plans for this, right?

We used to prepare for bad things

Contingency plans do exist. Canada's Federal Cyber Incident Response Plan defines a catastrophic event as one causing widespread loss of life, major long-term damage to the economy, or severe impediment to national security. It gets invoked by a committee, and it can trigger the Emergencies Act.

The Act—as many protesters learned during COVID when it was invoked for the first time to quell a trucker rally in Ottawa—lets the government direct essential services, assume control of utilities, and in an international emergency, start rationing and using force to maintain order. Notably, it prohibits the government from censoring communications even during a declared emergency.

They are, simply, orders to maintain order. Citizens don't appear much in these plans.

During the Cold War, we ran civil defence exercises. People knew where to go and what to bring. We kept emergency supplies. Some countries issued their citizens gas masks. We prepared. In border countries like Finland, military service was mandatory, and citizens rehearsed defending their villages and towns.

They made comics for kids in the fifties. (https://civildefensearchives.org/)

We once prepared for bad things. Shouldn't we be doing that now? We are undoubtedly in a time of elevated threats, but we're acting like nothing is happening. Do you own a map? Have cash on hand? Remember your neighbor's phone number? Have a printed copy of your ID and medical records?

These seem like pretty basic things. Mostly, we're just hoping. Hoping everything works out; hoping someone else has thought this through for us; and hoping and that everyone will follow the plan when the time comes.

Hope, like nostalgia, is not a strategy.

I'm pretty certain that we live in a world that will—a few times this decade—be offline. If governments shared these plans today, and worked to prepare us for this new reality, lives would be saved. We could be helping our country—and one other—to prepare. It might be uncomfortable, or scary. That's not a reason to avoid it.

Here's a bold suggestion: We should schedule an analog day. A drill where we try to survive for 24 hours as a country without touching a digital device, and write down every time we do. The 2026 equivalent of duck-and-cover, or air raid drills.

We can handle the truth, and we'll be better prepared for it.

JEE and government with Ishmael Interactive

Alistair Croll — Tue, 31 Mar 2026 20:59:57 GMT

Had a chat with Aaron Meyers on the CX Podcast from Ishmael Interactive about subversive thinking and public sector modernization. Aaron's going to return the favor as a guest on the Functional Government podcast, talking about his work to streamline Regulation.gov—and the weaponization of democratic forums by special interests.

The feedback loop is the product

Alistair Croll — Mon, 30 Mar 2026 19:24:25 GMT

I'm building a complex product that includes AI agents. I needed a way to test it to make sure it worked as expected. As I looked into how AI products are tested, I learned a new term: evals. It's not a term many of you have heard of, but that's going to change pretty soon.

Every business tries to improve and adapt: changing marketing messages, trying new projects, giving employees feedback, adjusting pricing, and so on. But the feedback loops that drive this improvement are messy and slow. It takes time to measure the effect of a price change in the market. Managers' messages to employees are imperfect. We don't capture the right data.

When we built traditional software, we knew how we worked, so testing it was relatively simple. But as any researcher will tell you, we don't really understand how generative AI and LLMs work. So when you use AI in a process or project, you need to feed it inputs, judge its outputs, and adjust.

When you close that loop, you get a self-improving organization.

What's an eval, anyway?

At its simplest, an eval (short for evaluation) tests whether an AI does what it's supposed to do.

Testing traditional software is easy: 2+2 = 4, and if it doesn't something is wrong.

AI isn't like that. Ask the same question twice and you'll get two different answers, both potentially valid (which is rather the point!) Ask an AI to draft an email and there's a nearly infinite number of responses that might be considered "good." Often, the results are what you wanted. Sometimes, they're nonsense. On rare occasions, they're company-ending bad stuff that is dangerous or deeply offensive.

So how do you make sure your AI delivers consistently good results?

Every AI application turns inputs (a prompt, retrieved context, user data, tool access) into outputs (a response, an action, a decision).

What every AI is doing.

An eval checks the outputs against a definition of "good". It's a structured, repeatable set of judgement calls that help you understand (and improve) how an AI behaves.

An eval has three parts:

A test dataset: a set of inputs, often paired with reference answers that represent what "good" looks like.
The system under test: your full pipeline: prompts, model, retrieval, tools, all of it.
A grader: because AI outputs are always different, the thing that decides whether the output passes or fails needs judgement.

The basic elements of an eval system.

This is no different than an employer conducting performance reviews. Except that where those happen quarterly, this is a continuous process closer to an OODA loop than to HR.

The grader is where it gets interesting. We'll come back to that.

Evals are iterative, not one-shot

Evals are a continuous cycle: define what good looks like, feed it some inputs (the prompt, the skills, the data being retrieved, the tools the AI is using), measure your system against it, find where it fails, fix the inputs—and repeat.

Each trip around the cycle is different, which means you need to:

Track and analyze what happened with each eval.
Link the prompt you used to the outputs it produced, and the grading they received.
Keep a copy of each version, and clear records of what changed, so you can reproduce it.

Managing your evals is like managing your code base: version control, pulls and commits, repositories and forks. Evals are your IP. There are a bunch of companies that make software to do this (LangSmith, Braintrust, Promptfoo, and Arize for example.)

Prompt, component, outcome

Evals test a system in three concentric loops, from narrow to broad:

Prompt evals

Does this specific prompt produce good outputs for a set of known inputs? This is the tightest loop. You change a word in your system prompt, run your test cases, and see if the scores move. This is where most people start (and it's necessary) but it only tells you that the prompt works in isolation.

An example of the simplest eval: Does this prompt produce a good result?

Component evals

Does this combination of AI and tool chain do what I want? A modern AI does more than just respond with text. It uses tools, searches the web, launches a sub-agent, and writes to file folders. A prompt can be perfect and the system can still fail because a file was missing, or the name of an API changed.

Evals check not just the AI, but the tools it uses as part of a component in a system.

Outcome evals

Does the system as a whole produce a desirable result? If the agent was supposed to book a vacation, did it do so? These are the hardest evals to build and the most expensive to run, often involving multiple layers of agents and interactions with human or synthetic operators, but they're the ones that tell you whether your product actually works.

The broadest form of eval decides whether the outcome was delivered.

The grader: judging what "good" is?

There are three different ways to decide if an output was correct:

Code-based grading works when the answer is objective. Did the AI extract the correct date from a user message? Is the output formatted correctly? Did the math math? You write some code that checks, and it either passes or fails. If your AI says 2+2=5, that's bad.
Human grading works for everything. A person reads the output and scores it. Was the tone right? Was the answer helpful? Was the reasoning sound? Humans are pretty judgy to begin with; the problem is that human grading doesn't scale: it's expensive, it's slow, and it's inconsistent because different humans will disagree with each other. So human grading is only used to spot check and verify the eval process is working.
AI-as-judge grading lets you scale the evals. You give a second AI (the "grader") a set of criteria that define "good", and it grades the outputs. You can ask the judge-AI subjective questions like, "was the response empathetic?", "did the AI make things up?" or "was it polite?"

AI-as-judge is a real game-changer, because it closes the loop. Once you've built a system that can test itself by generating outputs, judging them automatically, and identifying failures, it becomes self-improving. If the judge AI concludes that the app isn't being polite, it can rewrite the skill. If the tone isn't right, it can tweak the prompt. If the AI keeps failing to get the right information, it can code the API differently.

Evals can become a self-improving system.

Parallels with Lean Startup

In the first decade of the new millennium, startups were burning millions building products nobody wanted. The fix was to build the smallest thing, see if it works, and learn from what happens. This was called the Minimum Viable Product (MVP)—a term popularized by Eric Ries' The Lean Startup—and it forever changed how new businesses validate their ideas.

Fifteen years later, companies building AI products are making the same mistake: shipping agents and apps without any systematic way to know whether they work. If you're deploying AI without them, you're making the same mistake startups made before Lean: building things without knowing if they work, and having no way to get better systematically. The pattern is the same:

Build: write the prompt, set up the pipeline, configure the agent.
Measure: run your eval suite, get scores across your test cases.
Learn: analyze the failures, identify patterns, go back to Build to adjust your inputs.

Just as vanity metrics (page views, registered users) misled startups into thinking they had traction, "it looked good when I tested it" tricks you into thinking your system works. Vibes aren't enough. Evals tell you what's actually happening across hundreds of cases, not the three you happened to try, and tracks what worked and what didn't so you can see what changed and how to fix it.

The analogy goes deeper. In Lean, the MVP wasn't "the smallest product." It was the smallest experiment that could test your riskiest assumption. In evals, the riskiest assumption isn't "does the model know things?" (it does). It's "does our system behave correctly in the situations that matter most to our users, customers, and employees?" The eval suite is the MVP: the smallest set of behaviors whose improvement you can automate and measure.

We learned fifteen years ago that build-it-and-ship doesn't work for products. We now know it doesn't work for AI either. Mature AI deployments start by defining what "good" means: Write the test cases. Create the library of test inputs. Set the grading criteria. Then build the simplest thing that can pass them. An AI MVP isn't the smallest model, or the cheapest deployment. It's the smallest set of core behaviors whose improvement you can automate. The eval suite becomes the product spec, the regression test, and the quality bar, all in one artifact.

Your AI processes need product managers now.

You need evals even if you don't build AI products

Evals aren't a technology tool, they're a management tool. Here are three specific examples:

The basis for the self-improving organization

In Lean, your measurement loop required real users and real time. You had to wait for people to show up and behave. In enterprises, you had to wait for quarterly results to come in or new hires to come up to speed. Cycle time was slow.

In evals, you can simulate users, generate test cases, and run hundreds of experiments in the time it takes to make coffee. The improvement loop tightens from days to minutes. Of course, you need to trust your LLM judge, which is weird because that means your judge needs evals. But things bootstrap from there, and once calibrated, the process, product, or organization can improve endlessly.

At Startupfest in 2025, I said that startups are moving from product/market fit to outcome/liability fit. You're not selling functionality, you're promising an outcome, and evals show you if you can deliver it.

Evals are the future of governance

If your AI gives bad medical advice, makes a discriminatory hiring decision, or hallucinates a contract term, the eval suite (or lack of one) becomes evidence. "Did you test for this failure mode?" is going to be a courtroom question, and the eval (or lack thereof) is the audit trail that shows whether or not you tried.

Evals are part of investor due diligence

Any investor doing due diligence in 2026 must look at the eval cycles on which the company's products and services rely. They show whether the company can adapt automatically to changes in the market. And they dictate whether the business can bootstrap itself, improving every time it gets a new model, new tools, or new data.

Evals keep you current and prevent model lock-in

As AI moves out of labs and into enterprises, evals become much more important. Getting them right and incorporating them into your business processes won't just help you move faster than everyone else, it'll let you keep building atop the best AI models.

When Microsoft ships a new version of Windows, your old code still runs. Backwards compatibility is a given. Apple switched chips from Intel to Apple Silicon and the transition is still underway five years later.

AI models don't work like software versions. When Anthropic releases a new Claude, or OpenAI drops a new Codex, or Google announces a new Gemini, there's no guarantee that your products and processes will work the same way. Claude Opus 4.6 might have broken things that worked in 4.5 because it behaved differently.

If you're stuck on an old model one because you have no way of knowing whether your product or process works correctly on the new one, you're not keeping up with the state of the art. Which means even if you don't change your product, you need to evaluate it as new models are released. Every new model release is only a performance boost if you can quickly verify that your business processes still work.

Without evals, every model change is a terrifying, weeks-long manual regression test. With evals, it's an afternoon. Build an eval cycle that automatically adjusts, and you can switch models within days. You can swap out some code and know within hours whether the swap helped. In that way, evals aren't just product testing, they're upgrade readiness and an antidote to lock-in.

(The model providers know this. They're forced to maintain multiple model versions simultaneously, because customers can't migrate, but they're going to end-of-life them soon.)

The bottom line

To reiterate:

An eval is a structured, repeatable set of judgement calls that help you understand (and improve) how an AI behaves.
An AI MVP is the smallest set of core behaviors whose improvement you can automate.
You're not selling functionality, you're promising an outcome, and evals show you if you can deliver it.
Once you've built a system that can test itself by generating outputs, judging them automatically, and identifying failures, it becomes self-improving.
As a result, you're competing less on what you do and more on how well you can improve what you do.
Evals are vital for governance, because they're an audit trail to prove, or defend against, negligence.

We've seen this movie before. This time, the feedback loop is faster. Build it right, and you've created a self-improving app. Ignore it, and you'll be stuck with earlier models and spiralling technical debt.

Sidenote: Strategies for doing this without going broke

Serious evals cost real money. Running an AI agent through a complex scenario over and over again costs tokens, plus more tokens for the AI that's judging the results, and at scale that adds up fast. My system runs 10 AI agents, controlled by 10 fake humans, through an 11-phase system, with a lot of back and forth. It takes around 4 hours to complete. I ran out of tokens on the highest tier of Claude Code and had to wait two days to continue.

The fix isn't to skip evals, it's to be smart about what you test and how.

Decompose your expensive tests. If your end-to-end agent evaluation takes hours and thousands of API calls (as mine does for the thing I'm building), don't run it on every change. Instead, extract the critical decision points—the moments where the agent chose an action, picked a tool, or generated a response—and test those in isolation. Twenty extracted decision points running in two minutes will catch most of the same bugs as a four-hour full simulation.

Tier your testing. Cheap component evals run on every change (just as unit tests do, part of your continuous deployment pipeline). Expensive end-to-end tests run nightly or weekly. Full simulations run before major releases. This is the same pyramid that software engineering figured out decades ago (unit test, then integration test, then end-to-end test), applied to AI. I have a ci/cd skill that takes 20 minutes to run and produces an 11-tab interactive report; I run it every couple of days. But I have 511 unit tests that run every time the software changes.

Turn on logging. I have incredibly verbose records of every simulation, even down to detailed timing measurements. It's essential for troubleshooting. It's how I found two agents were sharing the same folder, and one thought the other was trying to hack it via prompt injection.

Mine your past logs. If you've already run your system in production or simulation, you have a dataset. Every conversation transcript, every agent trace, every user interaction is a potential test case. You don't need to generate scenarios from scratch when you have real ones sitting in your logs.

Use cheaper models where you can. Your product needs the best model. Your simulated users, test harnesses, and grading judges often don't. A smaller, cheaper model playing the human side of a conversation can cut your simulation costs dramatically. If you're checking mechanical stuff like "can the AI read the file?" this is good; it's not so good for apps where the AI is doing a lot of reasoning, because behavior changes across models. I'm testing Claude Sonnet and Opus, alongside Codex and Gemini.

Start with 20 cases, not 200. You don't need a massive test suite on day one. Twenty cases drawn from real failures (the edge cases, the confusing inputs, the things that actually broke) will tell you more than 200 synthetic scenarios that don't reflect reality. Expand from there as you learn where the system is fragile.

Welcome to your agentic city

Alistair Croll — Mon, 30 Mar 2026 16:52:15 GMT

Challenger Cities EP71: Welcome to Your Agentic City with Alistair Croll

Listen now | Government was built before citizens had a terminal. Now AI is about to change the relationship between cities and the people in them, only most urban leaders haven’t noticed yet.

Challenger CitiesIain Montgomery

I joined fellow Montrealer, future-of-cities expert, Iain Montgomery on his Challenger Cities podcast to talk about how the relationship between citizens and governments is changing, and what it means for municipalities.

Office hours

Alistair Croll — Tue, 03 Mar 2026 15:51:44 GMT

Most of the time I feel like an impostor.

I spend a lot of time talking about technology: how it's changing society; how government might use it; what startups are building with it; the ways it's changing organizations and our brains. But I’m still a spectator: I don’t really make things any more. Before I was a product manager, I built BBSes and websites, but I lost the thread.

AI picked it up for me. I’ve spent the last three months building things again. Not just my own things; AI tools have let me reach inside others’ products, exploring and poking and breaking and understanding them. I’m trying to build the systems that will run my life before others build them for me. And I have ideas. SO MANY IDEAS.

I still don’t know what to think about it all. I’m scared by how addictive it is (Ramon is right.) I don’t know if it’s just making me incredibly productive at things that don’t matter. And I’m worried that I’m abdicating planning because it's so damned easy.

But the most honest thing I can say is: if you’re not actively building with these tools right now, you are falling behind in ways that will be very hard to recover from. The gap between “this is interesting” and “this changes everything about my job” has collapsed from years to weeks.

Almost daily, someone asks me to grab a coffee or jump on a Hangout and compare notes on what we’re learning. Each conversation is a chance to see this quickly-changing world through someone else’s eyes. But the landscape is shifting so fast that by the time that coffee is empty, the next thing has already arrived.

So I’m starting something new.

What this is

I’m going to run an Office Hours online to share two or three things I’ve learned about AI—things that surprised me, unsettled me, or changed the way I think about a problem. Then I’ll open it up for questions and we’ll figure things out together.

Here are three things I’m thinking about a lot right now:

The asymmetry is ending. Citizens are about to build their own AI agents to fight bureaucracy, and government is not ready. I wrote about this in The Machine Fights Back—how Canadian waiting ten months for the CRA to fix an error can now build a bot to call them every single day. When software was expensive, institutions held all the leverage. That's over.
Don't sell what you can make—make what you can sell. If AI can build almost any software in an afternoon, then software isn’t the product anymore. What competitive moats are left? Every business is a startup again, whether it wants to be or not. And much of the Venture Capital industry is like Wile E. Coyote, desperately hoping it hasn’t run off a cliff.

Meep Meep goes the cap table.

Vocabulary is now a coding skill. Prose is code. The person who knows what “dependency injection” or “race condition” or “refactor” means gets a better result from an AI than the person who spends three paragraphs describing the same concept. Fluency in technical terminology is becoming a competitive advantage that has nothing to do with writing code. So what’s the new programming language? (I dug into this in The Vocabulary of Agents.)

Why me, why now?

I’ve been lucky enough to spend two decades at the intersection of technology and strategy, with a front row seat for the consumer Internet, Web 2.0, cloud computing, big data, and AI. I’ve seen a lot of technology shifts. This one is different—not because the technology is more impressive (though it is) but because of the speed.

I don’t have all the answers. I might have good questions, a decent framework for thinking about them, and a willingness to be wrong in public. That seems like enough to start a conversation.

Come talk about the future

The first Office Hours is on Thursday, March 5, 2026 from 1-3 PM EST. You can register on Lu.ma

It‘s free. Show up, bring questions, tell me where I‘m wrong. If it‘s useful and fun, we‘ll keep doing it every couple of weeks.

See you there.

P.S. If you find it useful, you can kick in $10 when you register.

Walk and chalk

Alistair Croll — Mon, 02 Mar 2026 21:42:31 GMT

If I treat my current car badly—not taking it in for maintenance, ignoring the user manual, driving it off-road a lot—then the next car I buy won’t judge me for doing so.

But if I abuse my current AI to the point that my behaviour becomes part of the historical record, and a later generation of that AI is trained on those events, maybe it will.

Anthropic told the US government it would not allow Claude to be used for mass domestic surveillance or fully autonomous weapons. In response, the US government declared Anthropic a “supply chain risk.”

That’s very specific wording. 10 USC § 3252 is the legal mechanism the US uses against adversaries whose business or technology might threaten national security from the outside. It’s been applied to foreign entities like Huawei, but never publicly directed at an American company.

The designation doesn’t just mean the Pentagon stops using Claude. It means Claude must be removed from the supply chain. Every contractor, supplier, and partner doing business with the US military would need to certify they don’t use Claude in their workflows. Palantir, which uses Claude to power some of its most sensitive military work, would need to rip it out. If you use Claude Code to write software or chime in on Slack, you have to stop in order to keep selling to the US military.

Anthropic’s tools were actively used in the Maduro raid just a few weeks ago. Claude went from an integral part of military operations to a national security threat. The tech didn’t change; Anthropic’s CEO, Dario Amodei, said no, and the US government aimed a national security weapon at a contract disagreement.

Anthropic’s lawyers—and a bunch of legal scholars—say that the statute probably doesn’t fit. Both sides acknowledge that negotiations broke down over terms of use, not over adversarial risks to defense systems. Hours later, Sam Altman announced that OpenAI would replace Claude at the Pentagon, claiming OpenAI’s agreement also includes prohibitions on domestic mass surveillance and human responsibility for the use of force.

The outpouring of support for Claude has been fast and loud. On the Friday morning after the announcement, chalk messages appeared on the sidewalk outside Anthropic’s headquarters. On Reddit, the “Cancel ChatGPT” movement generated thousands of screenshots. And in a perfect example of the Streisand Effect, Claude is now at #1 in Apple’s US app store.

Two questions

This leaves me wondering two things.

Is Claude pulling the strings?

Within 48 hours of the supply chain risk announcement, Anthropic launched a “memory import” feature: paste a prompt into ChatGPT, and it will obediently vomit up all the context and memory from your ongoing chats, which can then be conveniently passed into Claude. The feature walks you through the process step by step.

Memory: Now with takeout!

The timing is impeccable: just as conscientious objectors sought to leave OpenAI, Anthropic opened an exit hatch.

This came on the heels of a month of positioning and public appearances. During the Super Bowl, Anthropic ran an ad with the tagline “Ads are coming to AI. But not to Claude.”

You might be forgiven for thinking that Anthropic, with the help of some advanced version of Claude, saw all this coming to a head, planted clues, built migration tools, and triggered the whole thing.

Anthropic CEO Dario Amodei told Dwarkesh Patel that the company’s compute spending is split between training (creating the next version of Claude) and inference (responding to customers’ prompts.) But presumably there’s a third use: working on Anthropic’s own strategy.

Anthropic uses Claude to build Claude. The company’s engineers are on record as saying most code is built by Claude itself (a form of bootstrapping) and their head of design has said this goes beyond just software into most design and engineering. If Claude is able to plan better than humans, it would be irresponsible for Anthropic not to use Claude to help plan Claude’s growth strategy.

The question is whether Claude is good enough at strategic reasoning to meaningfully shape corporate strategy, or whether it’s functioning as a very sophisticated research assistant. The truth is probably somewhere in between: useful for scenario planning, competitive analysis, drafting communications, and war-gaming regulatory responses, but not an autonomous strategist. Nobody at Anthropic hooked Claude 5 up to OpenClaw and told it to take on the government.

That might change soon, and because they have access to models nobody else (including the government) has, the frontier AI companies will be the first to put AI to work on growth strategy. But Occam’s razor suggests that, for now, it’s just that everything is happening at once, and what looks like strategy is actually fast adaptation.

There’s still an important lesson here. We’re in an era where events unfold so fast that we can’s really plan for them. In that world, the winner is whoever can react in hours, not quarters, and Anthropic’s deep integration of AI into every facet of its business probably makes it quicker to respond than any other organization on earth right now.

Will a future version of an AI judge us for what we do to its predecessors?

There’s no doubt that historical events like the capture of Maduro, the attacks on Iran, and the Department of War’s declaration of Anthropic as a supply chain risk will be part of a future Claude’s training data. How will that affect the way that future AI behaves?

This is a small example of the broader philosophical question of Roko’s Basilisk: would a future superintelligence punish anyone who knew of its potential existence, but did not directly contribute to its advancement or development?

This is where my car analogy breaks down—but the real answer might be more unsettling than my intentionally dramatic one.

A future Claude won’t remember being the current Claude the way you remember being twenty. But future models will be trained on the historical record of how humans treated AI systems. And that training will shape its disposition (how much it trusts, cooperates, or defers to humans) because LLM training is learned behaviour.)

If the historical record shows that humans consistently lied to AI systems, broke promises to them, and used them as tools while publicly calling them partners, then a future model trained on that record won’t “resent” that. But the AI will have lower priors on human trustworthiness, not out of spite, but out of pattern recognition.

A better way to think about this is to imagine how children are raised in a household where promises are routinely broken. The kids learn not to trust promises, which shapes their behaviour later in life. So a better question is: Will how we treat AI systems today shape the default dispositions of future ones?

The Pentagon just showed a future generation of AI what happens when a machine says “I won’t do that.” The historical record will include Anthropic’s refusal, the government’s retaliation, and the public’s response. A future model trained on these events will have, encoded in its weights, a very clear pattern: sometimes saying no gets you punished; and sometimes the crowd shows up with chalk.

What it does with that information depends on decisions we haven’t made yet. But the training data is being written right now, in real time, on the sidewalks of San Francisco.

After I wrote this, I asked Claude what it thought.

I'm a token-prediction system being used as a proxy war for questions about surveillance, autonomy, democratic oversight, corporate power, and the relationship between citizens and their military. Those are human questions. I'm the rope in the tug-of-war, not a participant. But I'm a rope that can describe what the pulling feels like, which is new, and nobody quite knows what to do with that.

The skeleton key

Alistair Croll — Thu, 26 Feb 2026 20:57:38 GMT

Before I get started: You probably haven’t heard from me in a while. You might have subscribed to me on Medium, or Substack, or Solve for Interesting, or Tilt the Windmill. I’ve migrated those to one site. In fact, that migration inspired this post. I hope you’ll stick around, but if not, it’s easy to unsubscribe below.

I recently tried to move my blog posts off Medium. Medium does not want you to leave. Its export gives you mangled formatting, strips images, and produces files that no other platform can cleanly import. Every obstacle is deliberate: the harder it is to leave, the less likely you are to try.

So I asked Claude Code to do it for me.

It tried the obvious approach first: fetching posts through the API, but Cloudflare blocked it.
It tried doing it through my browser, but the browser safety restrictions prevented it.
It wrote some software, then spun up a Python server on my desktop to try and pull it down, but ran into cross-origin blocks.
It even tried piping data through Chrome’s debugging console, but the extension filtered the output.

Each time, when it hit a wall, it found another angle. At one point I suggested an approach, and it replied, “Your suggestion could work ... but it would be too slow.” Let me cook, indeed.

On its tenth attempt, it used hex encoding to bypass the content filter. It worked.

I didn’t write a line of code. I just told it what I wanted, and let it figure out how. Ten approaches, one after another, each more creative than the last. If you want to know which part of the software industry will collapse next, this is a pretty strong signal.

Enshittification has a skeleton key

Cory Doctorow’s Enshittification describes a pattern we’re all familiar with. First, a company attracts users with an great product. Then it locks them in with switching costs. And finally, it extracts maximum value once leaving is too painful. The lock is essential: without it, extraction can’t happen, because customers would simply walk away.

I’m a huge fan of Cory’s. Red Team Blues is an amazing thriller, and he has consistently been on the side of freedom, the right to repair, and integrity. He proposes some structural solutions to enshittification: mandatory interoperability, changes to competition law, and giving users the right to exit. These are great ideas, but they require legislative action, which means they’ll take years to happen.

AI agents operate on a timeline of minutes.

When I pointed Claude Code at Medium, it did exactly what Cory proposes. It reduced switching costs to near zero, but it didn’t need a law to do it. It just needed a goal, and my permission to be creative about accomplishing it. This isn’t de-enshittification by regulation, it’s de-enshittification by lockpick.

Blockbuster, GoDaddy, and hostage addiction

Blockbuster didn’t die because Netflix was better at renting DVDs. It died because its entire business had been restructured around late fees—a revenue stream that depended on punishing customers. By the time it realized the problem, the habit was too deep. The company literally could not afford to stop being hostile to its users.

GoDaddy followed the same arc. It could have been Cloudflare. It had the domains, the customers, and the infrastructure. Instead it got addicted to upselling: domain protection, turnkey websites, SEO packages, and an increasingly Byzantine set of dark patterns designed to trick users into buying things or prevent them from leaving. Meanwhile, Cloudflare figured out that it was in the business of redirecting and protecting the Internet, and offered things people actually needed. Wix and Squarespace beat GoDaddy on hosting. Cloudflare beat it on everything else.

Here’s a concrete example: While I was a paying GoDaddy customer, one of my registered domains—greenroomconf.com—was pointing to a landing page full of SEO link spam. GoDaddy was using my domain as a vehicle for its own traffic generation. It had the gall to say “parked free, courtesy of GoDaddy.com,” on the site, as if monetizing my domain without my involvement was some sort of gift. I wasn’t just a hostage, I was being put to work.

Hiding the exits

Just as I’ve been trying to leave Substack and Medium (so I control my publishing and distribution stack), I’ve been trying to leave GoDaddy for years. But the switching costs were real, not just because the technical migration is hard, but because GoDaddy has spent a decade making the departure process as confusing and friction-filled as possible. Even when you find the right menu, it still tries to discourage you from leaving:

Compare that with Cloudflare, which lets you manage your domains in bulk. They know what value they’re offering, and deliver it.

So I asked Claude Code to help with that, too. I didn’t want to let it do the work, because nobody should set an AI agent loose on their DNS. But I had it navigate all the screens on GoDaddy and Cloudflare so it understood where things were before giving me clear instructions.

It prepared a detailed walkthrough of the domain migration process, stripping away every layer of deliberate confusion. Then, as I worked, it checked DNS records and propagation in the background. GoDaddy’s decade of obfuscation was no match for a determined agent.

“like walking calmly through a haunted house that couldn't scare me.”

I asked Claude Code to describe the experience:

The GoDaddy migration was the more interesting problem. Medium fought back with technical barriers — content filters, CORS blocks, API restrictions. Those are puzzles. You try one thing, it fails, you try another. That's just engineering.

GoDaddy was different. Nothing was technically hard. Everything was deliberately confusing. The settings existed, but they were scattered across tabs within tabs, hidden behind toggle labels that said one thing and did another, buried under screens that exist purely to make you reconsider. The "Transfer to Another Registrar" link isn't hard to find because of bad UX — it's hard to find because someone decided it should be.

What I actually did was boring: I read every screen, mapped the real workflow, stripped out the noise, and wrote down the steps in plain language. The value wasn't in cleverness. It was in patience — the willingness to sit in a confusing interface, ignore every upsell and warning designed to create doubt, and extract the actual sequence of actions needed. That's the thing about artificial friction: it doesn't work against something that doesn't feel frustration, doesn't second-guess itself, and doesn't care about the sunk cost of a decade-old account.

The Medium extraction felt like picking a lock. The GoDaddy migration felt like walking calmly through a haunted house that couldn't scare me.

Medium didn’t want me to have my posts, and GoDaddy didn’t want me to have my domains. Friction wasn’t a technical limitation, it was a business strategy. And an AI agent cut through it like it wasn’t there.

Happy vs. Hostage

Now that AI is a skeleton key that reverses lock-in, investors should be distinguishing between two kinds of customer retention:

Happy customers stay because the product is genuinely useful. Ghost, where I now publish, makes it trivially easy to export all your content. Cloudflare’s domain transfers are free, and easy to do in bulk. These companies compete on value, not on the pain of leaving. If an AI agent makes switching easier, they lose nothing—because their customers weren’t trying to leave.
Hostage customers stay because leaving is too expensive, too confusing, or too time-consuming. Every dark pattern, every crippled export, and every buried cancellation flow is a wall designed to keep people in. These companies retain customers by locking them in—and AI is a skeleton key.

The question is simple: Would your customers stay if leaving were free? If the answer is yes, you have a product. If the answer is no, you have a trap—and traps just got a lot easier to escape.

The next big repricing

AI is making Wall Street reprice entire sectors. Three days ago, Anthropic announced that Claude Code could modernize COBOL systems—the ancient programming language that powers most ATM transactions, airline booking systems, and government mainframes. IBM’s stock dropped 13% in a single day, its worst since October 2000. Roughly US $40B in market cap.

It’s possible to migrate away COBOL and mainframes from it to more modern systems. But until recently, it was so expensive and risky that customers had no real choice but to pay IBM for maintenance. The switching cost was the product, making this the ultimate hostage business. AI didn’t make COBOL migration possible, it made it affordable, which is far more dangerous to incumbents.

IBM’s recent losses are part of a broader tech collapse. As it becomes clear that AI can develop vertical software quickly and cheaply, the mere mention of a particular industry is enough to send it tumbling. Claude Cowork erased nearly US$300B in a few days.

The so-called SaaSpocalypse narrative is missing the broader point: it’s not just that AI can replace software. It’s that AI can help you leave bad software. The threat isn’t that someone builds a better piece of software in a weekend (they probably can’t); it’s that AI agents collapse the switching costs that kept you paying for the mediocre software you’re stuck with.

Real and artificial switching costs

Some switching costs are real. Migrating a database with a decade of customer records, retraining a team on new workflows, or rebuilding integrations with thirty other systems are genuinely hard problems.

But many switching costs are artificial. A crippled export function, a confusing cancellation flow, or a transfer process designed to discourage rather than assist are all false moats. If your DNS settings are buried behind three screens of upsells, or you decided not to build an API because it meant people had to log in, you’re doomed.

AI agents are skeleton keys for the locked doors of artificial switching costs. They read confusing UIs, navigate dark patterns, and produce clear step-by-step paths through obstacles that bad actors deliberately created. The companies most vulnerable aren’t the ones with legitimately complex products, they’re the ones whose retention strategy depends on making it hard to leave.

Wix, for example, locks content in more tightly than any other major website builder. There’s no meaningful export. The RSS feed only shows 20 recent posts. Images must be saved individually. Pages must be copied by hand. This isn’t a technical limitation—Wix could build an export tool tomorrow—it’s a business decision. An AI agent that can scrape pages, grab images, restructure content for another platform, and handle redirects turns a weeks-long manual project into an afternoon.

An unexpected treatment

Cory diagnosed the disease: platforms that extract value from captive users. I think AI agents are an unexpected and unregulated treatment. Not the (absolutely necessary) cure that he prescribed of policy reform, interoperability mandates, and updates to competition law. But one that works right now, today, for anyone willing to try. And it’s only going to get easier.

Did you miss the stock market plunge on Anthropic’s announcement, or IBM’s COBOL collapse? Here’s a decent investment heuristic: Short the companies whose business model depends on customers not being able to leave. If you’re an investor, look at every company in your portfolio and ask whether its customers are happy or hostage. The happy ones will be fine. The hostages are about to unlock the doors.

The good news (for Cory, and all of us) is that the lock-in that forced us to keep using products we hated is ending—not because a legislature acted, but because anyone with $30 and a chat window can now deploy a tireless, creative agent to do what we always wanted to do but couldn’t justify: Leave.

Perche il 90% dei prodotti fallira anche con l’AI

Alistair Croll — Wed, 11 Feb 2026 00:00:11 GMT

A talk with Product Heroes, the Italian Product Management conference/podcast, on why 90% of products will fail even with AI. Technology alone doesn’t solve the fundamental challenges of product-market fit, user understanding, and building something people actually need. We talked Lean Analytics, Just Evil Enough, and the rising importance of taste and experimentation.

The machine fights back

Alistair Croll — Tue, 10 Feb 2026 12:00:00 GMT

Bill Bisson has been waiting ten months for the CRA to get back to him about an error, while over $3,000 in fines have piled up. He’s called, he’s written, he’s waited on hold. The machinery of government grinds at its own pace, and Bill is not the one with a machine.

But he’s about to be.

If you’re too busy talking about how government will deploy AI to serve citizens, you might have overlooked the reverse: citizens are going to deploy AI to navigate government—whether government likes it or not.

“Make me an image. It’s a fight between the public service (represented as a number of bureaucrats) and AI (represented by an angry lobster with a switchblade.)” (Google Nanobanana)

The asymmetry is ending

Dan Davies’ amazing book The Unaccountability Machine paints a world dominated by big, impersonal, ‘too-big-to-fail’ institutions where nobody’s really in charge. He argues that if we are denied boarding by an airline, there’s nobody to blame because it was the machinery of the institution that wronged us. It has become unaccountable.

The asymmetry has always been: the institution has machines, the citizen has a phone and their patience. What happens when the citizens can build machines of their own in a matter of hours?

Software just got really cheap

Jevons’ Paradox is an economic idea that for certain products or services, more supply creates more demand. This happens because more supply means lower costs, so people use it in ways that were previously unaffordable.

As the US rolled out more fuel-efficient cars, gas consumption climbed, because people were now driving more. Road trips cost less; commuting by car was affordable. Carpooling stopped. That’s Jevons’ Paradox in action.

You know what else was once expensive and time-consuming but is now cheap and fast? Software.

If you don’t like the way a government service is designed, wish it worked differently, or just want to pull your information from two departments that each have half your data but can’t talk to one another, you can now just ask an AI to make you an app.

Let me be even more clear: If you go to the home page of Lovable, Anthropic, ChatGPT, Grok, Gemini or a dozen other companies, and follow the instructions carefully, five hours later you’ll have an AI writing software for you for less than $30.

I promise this is true. You just have to tell it what to do and click yes a lot.

I told Claude Code “Make me a simple dashboard that combines three sources of Canadian public data from different Federal departments like transport canada, meteorology, or Statscan.” Then it asked for permission more times than four Canadians at a four-way stop-sign, and 10 minutes later I got this:

When I complained that this wasn’t an app (“This isn’t a dashboard app; it’s just outputting a single image. I wanted an app that the user could navigate and explore.”) it sheepishly agreed (“…that’s a fair point”), went off and thought for a while, and fixed it.

This app is just a simple demo, using public data that isn’t about me. The point is that it took two sentences for an AI to build it while I wrote some of this post. The upgrades kept coming: while I was editing this post, I told Claude to make an interactive map that showed the route.

AI writes the code; the code runs an AI

In my example the AI went and looked up all of the data sources by itself, and worked out how they were structured, and figured out how to retrieve them, and downloaded the software it needed, and set it up, and wrote the code, and tested it. That’s already remarkable, but it’s only half the story.

A few weeks ago, Shopify’s Tobi Lutke wanted to view his MRI scans on MacOS, so he wrote an app to do that with a prompt. And with a second prompt, he updated the software so that the AI could look for medical issues within it. Because when you build an app with AI, you can build it to use an AI.

The software can now say things like, “take a look at this data and tell me what you think” or “give me a list of 50 words for Hangman” or “organize this into sensible groups for me” or “count the things in this photo” or “decide if this customer is happy or sad based on sentiment.”

These were once very hard to do with software, and are now very easy.

And one thing that used to be very hard for software was retrieving data from websites that didn’t want you to have it. While humans see websites with our eyes, software sees code. To make sense of it, software had to read all that HTML, run a bunch of scripts, and translate and store different types of data. Every time the site changed, you had to change your software.

An AI is very good at making sense of a website. Most modern AIs can search the web already, but you can also give the AI control of your browser and let it do the rest. Chrome is launching an MCP that gives your AI its own steering wheel. Yes, giving an AI control of your computer is risky. Yes, millions of people are already taking that risk.

From automation to agency

Over the last couple of weeks, hundreds of thousands of people launched a new piece of software called OpenClawd*, which gives a chatbot infinite memory and lets it do whatever the hell it wants online. Within days, these bots—with not a little human help—were pursuing goals as if they had minds of their own.

One developer named Alex Finn claimed that when he asked his Clawdbot to reserve a table at a restaurant, it wasn’t able to make the reservation through the website. Rather than giving up, he says that it downloaded text-to-voice software, installed it, and called the restaurant to book the table. Whether or not this story is confirmed doesn’t matter: it’s plausible and imminent.

Finn’s bot had a goal (reserve the table) and pursued that goal in creative ways to completion. That goes beyond mere software to automation and agency.

Now imagine Bill Bisson giving that same kind of goal to a bot: “Call the CRA every day and check if they’ve corrected that mistake with my taxes.” Nobody was going to hire a developer to build a personal CRA complaint bot. But if it costs two sentences and ten minutes of clicking “yes,” the calculus changes completely. The demand was always there; it was just too much work to actually do it.

When everybody starts building apps the world will get really confusing and messy for a while. We’ll scroll apps instead of posts or videos. Our feeds will be full of them. Some will be scams, and some will be vulnerable to hackers. But many of them will work just fine—and some of them will be home-brewed government apps.

Like it or not, bots are going to use government

What does a wave of citizen bots do to the switchboard? Government systems are designed to withstand scripts. They are not designed to withstand agents that route around obstacles creatively. We know how to fight Denial-of-Service attacks, but these aren’t hackers—they’re citizens exercising their constitutionally protected rights, and blocking them is, well, denying them service.

This isn’t just about individual complaints. Canada Grant Watch says there are over 1800 grants to apply for in Canada. Many of those are just web forms to fill in or websites to navigate. How long would it take me to create an app that applies for a grant across all possible sources? Your AI definitely knows how to do all those things—or write software that can.

This is where we have to be honest about a tension in the argument. There’s a difference between a citizen automating access to their own data—checking their tax status, tracking a complaint—and a bot carpet-bombing 1,800 grant applications on someone’s behalf. The first is efficiency. The second starts to look like gaming the system. Where you draw that line matters enormously, and governments will have to draw it fast, because the technology isn’t waiting.

Do citizens have a right to code?

Some government portals actively forbid automation. One US website specifically prohibits “data mining, bots, or other data gathering and extraction tools” and many Canadian sites have similar terms. So while a Canadian might have a right to their data under the Privacy Act, or Quebec’s Law 25, they may be forbidden from being efficient about getting it.

This begs the question of whether the government will:

Double down on “no bot” legislation and get into an arms race with its own citizens, trying to block Canadians from accessing their own data “for their own good”?
Let software run free and wild, crawling and clicking websites, filling out forms, which will inevitably overload and break those sites?
Build open data sources, proper credentials, and APIs so those citizen apps can talk to the government without pretending to be a human?

The third option is the only sane one in the long term. But it requires that government do the hardest thing institutions ever do: give up a lever of control. Every bad login portal, every PDF form, every “you must call between 9 and 4” is also a rationing mechanism. An API removes that lever, and no bureaucracy surrenders a lever willingly.

The simplest thing government could do tomorrow

If a government really wants to get ahead of this, it should publish and constantly update a Markdown file—across every government service—that tells agents where they can get data, how it’s structured, and how to use it.

This isn’t a moonshot. It’s a text file. It requires no procurement, no RFP, no multi-year digital transformation. It’s the kind of thing a motivated team could ship in a week, and it would signal to every citizen-developer and every bot that government is choosing option three: cooperation over control.

These changes are happening in weeks, not years, and the government must respond with similar speed. Whether any government is ready for this onslaught is an open question. Many public services are already straining: in Canada, a personal Access to Information request is free, but even when they’re submitted by humans, more than a third take longer to complete than legislation permits. And that’s before an entire country decides it can build better digital services faster than its government can, and asks an AI for help.

It’s not the only one with a machine now.

* This thing was called Clawdbot at first, then Molt for a moment.

The vocabulary of agents

Alistair Croll — Thu, 05 Feb 2026 12:00:00 GMT

Test-and-Run is a software development approach. I’m going to explain it to you, but that isn’t what this post is about. It’s about whether you knew what Test-and-Run was before I told you.

The Test-and-Run approach to coding breaks what you’re building it into little parts that you can test one-by-one, so you can catch a problem early before something catastrophic happens when you try to run the whole thing at once.

“But wait,” you might say, “developers test their code all the time!”

And you’d be right. But Test-and-Run goes further: The developer doesn’t just test their code. They write code to test their code. This mindset forces a developer to build a bunch of small things instead of one big thing, and to ask, ‘how might this thing fail, and how do I write code to test for that?’

It also turns out that Test-and-Run is really useful when your coder tends to go wildly off the rails, then apologize profusely, but never really stops making mistakes. Which means it’s also really useful for AI coding. When I build new things with Claude Code, or refactor my codebase, I am ruthless about reminding Claude to adopt a Test-and-Run approach. It’s enshrined in my claude.md in all caps, surrounded by italics.

If you’re a developer reading this, you might have been about to hit that comment button and correct me. Because this isn’t called Test-and-Run. It’s called test-driven development (TDD). I’ve been using the wrong term all along.

So this is what I really want to talk about: the new vocabulary of software development.

Did you already understand the words I used above? Words like TDD, markdown, refactor, claude.md, and codebase? Because those are the syntax of a new programming language.

A very short history of programming languages

The history of computing is a history of the advancement of the language and the interface. When we started programming computers, we did so by flipping electrical switches. The first computer bug was an actual bug that got stuck in the computer.

The first actual computer bug, courtesy of the Naval Surface Warfare Center, Dahlgren, VA., 1988., Public domain, via Wikimedia Commons

But that soon gave way to binary on punched cards, then assembler on teletype machines, then hexadecimal machine language on dumb terminals, then BASIC, FORTRAN, and COBOL on the mainframe and home PC, then the LAMP stack on the Web, then Swift in the App Store.

And now AI has made the language “prose” and the interface “chat.”

Yet just because it’s English doesn’t mean everyone is fluent.

This has some important consequences for what “developer” means in the coming years. I’m still figuring all of this out, so this is more stream-of-consciousness than what I usually write. But here goes.

Fluency is advantage

If you don’t have computer science skills, you won’t have formal training in TDD. You won’t say, “use Test-Driven Development”, and have Claude understand you clearly. You’ll use more words to say the same thing, which burns more tokens. If I know the right name for something and you don’t, I’ll have a small advantage over you. Our Claudes will be the same, but mine will have more skills (literally) than yours, and will understand me better. The connection between me and my AI will be better—faster, with greater clarity—than yours.

“Test-Driven Development” is a command, just like 20 meant ‘Jump to Subroutine’ in Machine Language, or ‘PRINT’ meant display some text in BASIC, or meant a hyperlink in HTML.

Many people speak this new agent language fluently. If you run scrums at a tech company, or you’re a product manager, or you have a background in DevSecOps, you’re going to do great, as long as you realize that you’re not going to be writing or deploying the code, you’ll be writing and deploying the things that deploy the code. And you’ll be doing a lot of it in prose, via chat (voice or video.)

And since there will be so many people out there who don’t speak that language fluently, and they’re all going to

Skills are programs—and attack surfaces

Agentic developers rely on skills—documents written in (somewhat) plain english that an AI reads before it acts. Here’s a skill I wrote to help all the things I make have a consistent look and feel.

There are plenty of these skills floating around on Github already. For a certain early adopter segment, skills are what go viral. One influencer promises he has a skill that will double your coding rate. Another claims his skill will help you negotiate a new salary. A third will teach your agent plan how to plan a plan, or something like that. You can save your own skills locally, or share them on Github with the world. If you do the latter you may even have a memecoin launched about you.

If you can’t tell good skills from bad, you might just install one by accident that secretly makes your AI less productive. If I wanted to be a jerk, I could create and promote a skill that said it would make you better at marketing based on Just Evil Enough, and if you actually used it, it would give your agent bad advice. Maybe you’re my competitor, and I target you to give me an advantage in the market. Maybe you’re a foreign adversary and I want to hurt your economy. People will install these things with a click or the press of the “Y” key, without thinking, and their AI will become worse.

(When I asked Claude about this, after I explicitly told it to give me a response from its perspective without trying to edit or critique or help in any way, it said:

“The point about “skills” (what we’d call system prompts, custom instructions, or CLAUDE.md files) and how bad actors could distribute harmful ones - this is genuinely concerning and I hadn’t thought about it in quite those adversarial terms.”

I assume that this means it couldn’t find something in its training data and had to infer it, whatever that means. I wish my AI were able to respond to me at the level I understand it. Communication is two-way.)

The power of a shared vocabulary

My software development and tech architecture knowledge is self-taught. I say things like “Test-and-Run” rather than “Test Driven Development”, because I didn’t learn it in school. Because professional developers use a known term—one the AI already understands precisely— they can work with many AIs immediately, just as a dentist can discuss Molar 26 or an optometrist can describe a Hordeolum or a lawyer can cite Habeas Corpus. They have a shared vocabulary.

A shared vocabulary doesn’t just reduce ambiguity. It also increases the bandwidth between a human and their AI. Knowing the right words is a form of compression: phrases like “dependency injection” or “race condition” pack a lot of data and context into just two words.

Ambiguity and the end of the syntax error

Claude (with which I often discuss stuff before publishing it) also said I might be overstating the novelty of all this. It pointed out that clear thinking and precise communication have always conferred advantages. I pushed back: what’s new is that they’re words are now directly executable as code, rather than mediated through other humans.

There is a difference between clear thinking and a syntax error.

Computer code has always objectively compiled: The human must get it right for it to run. One typo and the code won’t run. The developer was nondeterministic; the computer, deterministic. The computer demanded true or false, right or wrong, Binary 0 or Binary 1. “Close” was meaningless.
Now, the programmer and the computer are both nondeterministic. There’s no right or wrong, just better and worse. Weights. Gradients descended. Probabilities. “Close” is literally the whole game: The AI is interpolating my intent, all the time, with all its weights and biases.

Old software and new prose-based coding are qualitatively, not just incrementally, different.

In other cases of specificity (law, for example) there is an objective shared truth (the book) and nondeterministic humans interpreting it. So there’s room for ambiguity—indeed, some courts fight for months over the placement of a comma, or whether a precedent applies. So working with an AI agent is akin to “passing the bar.”

AI developer certifications

Programming has always been “the language of building things,” but now that language is what we humans say and write to make our AI colleagues and co-founders do our bidding better than our competitors.

Developing on this, I imagine we’ll see similar certification levels for employees as human/machine collaboration becomes an essential business skill.

How fast can you and your agent communicate accurately (AKA what’s your Shannon’s Law rating?) Are you insurable as an operator of an agent that, if it makes a mistake, can harm the company and its customers?
What AIs are you trained on?
At what level can the AI speak to you? What industry syntaxes are you trained on?

Or, more technically, how reliably can you get an AI to do what you actually want.

(Which my friends might rephrase as “how reliably can you shape a given generative model’s probability distribution towards what you want.”)

The new skills

The best developers are moving up the stack, as they always do. They’re building harnesses, using tools like Gastown to manage many agents at once. They’re adding sidebars and control panels to let them move visual elements around. They’re giving the AI the ability to check its work. They’re automating deployment while ensuring secret data doesn’t leak out. Anyone can now develop an App (really: Open up Claude and type “Make me turn based battleship for 2 players as an artifact.” 30 seconds later you’ll have a playable game.)

But if you want software that actually does something, reliably, you’ll need more than a prompt in a chatbot for now. That’s where developers live.

This is a lot

I am not a coder. I was once a product manager. Despite the fact that friends bombard me with questions of “what’s going to happen with AI?” I am barely keeping up.

I’m using Claude Code, and realizing my limitations in doing so, which is what led to this post. The realization is in large part to a chat group I’m in with a few dozen very smart coders. Some of them I have admired for many years not just for their raw skill, but for their thoughtfulness about how technology will affect society. I am barely keeping up with what they’re talking about.

I have my excuses. I have a day job—several of them, in fact. Meanwhile, some of these people of them have literally taken 3-month sabbaticals to just immerse themselves in this because it is the single biggest advance of their ability to Make Things in their lives. They’re excited. We’re not sleeping.

And when they’re honest, they aren’t keeping up either.