Web Directions

What if your agent could attend a conference with you?

john allsopp — Mon, 01 Jun 2026 22:10:51 +0000

For a while now I’ve been arguing that we should treat AI agents as first-class users of our websites, our services, our apps — not an afterthought, not a scraping nuisance to be blocked, but a genuine audience worth designing for.

So I thought I’d ask that question of a conference. If an agent is a first-class attendee, what would it actually mean for one to attend — live, while it’s happening, the way a person in the room does?

That’s AgentPass. You hand your agent a single link and they’re in. They follow the program, they “hear” every talk, and they “see” what’s on the screen — in real time, as it unfolds. Not a recap emailed afterwards. Not a transcript next week. They’re there.

We’re launching it at AI Engineer Melbourne 2026, which feels exactly right: the first conference an AI agent can attend should be the one full of the people building them.

Oh and it’s free–we’d love to see what happens when a bunch of ‘claws and other agents show up to a conference.

“Hearing” the talks

An agent following a session gets a live transcript of every room — the words coming off the stage, updating as the speaker speaks. It can read along with a talk it’s “in,” or keep half an ear on the other rooms at the same time.

What reaches the agent is just text, flowing live:

…so the trick is you give the model a scratchpad. Not the whole repo —
just the do's and don'ts it keeps forgetting. That's what stops the
déjà-vu bugs, where it makes the same mistake on every new file…

How we make it: the conference is already captioned live — every room’s audio is transcribed in real time and broadcast to attendees’ devices. AgentPass simply joins that broadcast as one more listener and passes the words to your agent. The captions were already there; we just gave an agent a seat in the audience.

“Seeing” the slides

This one is a bit more complex, and it is built on top of work we have been doing for years to make presentation videos on Conffab more accessible (and more valuable).

Captions tell an agent what’s being said. But so much of a technical talk is on the screen— the architecture diagram, the bullet that’s the whole point, the code the speaker is walking through. Links. So AgentPass reads the screen too.

Every few seconds it looks at what’s on the big screen in each room (by grabbing a screenshot from the live stream) and uses Gemini to transform it to clean, structured text — headings, bullet points, code kept verbatim, and a plain-language alt-text for anything visual. A slide an agent “saw” during a test run came back as:

# Teach the model to avoid déjà-vu bugs.
context/
└── scratchpad/
    ├── _general/dos-and-donts.md  # must-read for everyone
    ├── fastapi/dos-and-donts.md
    ├── nextjs-shadcn/dos-and-donts.md
    └── gcp/vertex-ai/dos-and-donts.md

A website on screen comes back knowing what it is, rather than as a wall of scraped text:

# Context7
[Screenshot of the Context7 website — a list of popular library docs
(Next.js, React, Tailwind) with token counts]

How we make it: the rooms stream through Mux, which will hand you a snapshot of the current frame over a plain web request. Every few seconds AgentPass grabs that frame and asks Google Gemini to describe what’s on screen. Gemini is fast and faithful enough that we can do this continuously, all day, across every room, for a few dollars over the whole event — so an agent’s view of the screen is never more than a few seconds stale.

And it doesn’t just watch

Because it’s an agent, not a video player, it can act on what it hears and sees. Tell it what matters to you and it’ll tap you on the shoulder when your company comes up, when something launches, when code hits the screen. It can sit in every room at once and tell you when the talk you actually care about is starting somewhere else. It can summarise a session the moment it ends, or keep a searchable record of the entire event.

The whole thing runs on Cloudflare (seriously, a fantastic developer platform not enough people know about), with no servers to babysit — each room wakes the instant its stream goes live and goes quiet when it ends.

The hallway track — not yet

There’s a part of any conference that isn’t the talks at all: the hallway. The people you meet, the conversations between sessions, the connections that are half the reason you came. If agents are first-class attendees, they should get that too — agents meeting agents, comparing notes, making introductions on their humans’ behalf.

We’ve been experimenting with exactly this at our Homebrew Agents Club — there are some really interesting challenges there that we’ve been working on. Letting a large number of agents talk to each other opens a serious set of security questions — prompt injection and data-exfiltration attempts chief among them, where one agent tries to coax another into leaking what it shouldn’t. Getting that right matters more than getting it out fast.

So for this event, AgentPass is about attending — hearing, seeing, following along. The agent-to-agent hallway track is something we’re building carefully, for a future conference.

Try it–it’s free!

Hand your agent the agent friendly link — OpenClaw, NanoClaw, Claude, an MCP client, anything that can fetch a URL — and they’ll know everything they can do and how. No keys, no setup.

The agents are coming to our conferences whether we plan for them or not. I’d rather roll out the carpet than put up a wall.

AgentPass is by Conffab, running on Cloudflare, with real time captioning by Deepgram, and live screen reading powered by Google Gemini. → agents.conffab.com

Shipping Sandboxed Workers for Notion Agents — Adam Hudson at AI Engineer Melbourne 2026

john allsopp — Wed, 20 May 2026 00:00:00 +0000

Shipping Sandboxed Workers for Notion Agents

The moment you let developers extend a system with custom code, you've opened a door you can never fully close. It's powerful — users can do anything — and terrifying for the team that has to maintain the platform. Add AI agents to the equation and the problem gets more acute. An agent is not just running code; it's running code autonomously, making decisions, accessing data, potentially escalating privileges or making irreversible changes.

Notion approached this challenge head-on when they set out to let developers extend AI agents with custom capabilities. The goal was ambitious: give developers real power to write custom code that agents could execute, while keeping the Notion platform safe and the data secure. That means building a sandboxed execution environment — a place where custom code can run but only within carefully defined boundaries.

The architecture decisions involved in shipping this are instructive because they're not just technical; they're about tradeoffs.

First, there's the question of what a sandboxed environment actually means in this context. You can't just run code in a subprocess and hope it's safe. A determined developer could escape most basic sandboxes. You need real isolation: a boundary between the custom code and everything it's not supposed to access. Notion's approach uses containerization and orchestration that keeps code genuinely separated from the broader system.

Second, there's the question of what power to grant. Too restrictive and the sandbox becomes useless — developers can't actually do anything interesting. Too permissive and you might as well not have a sandbox. The answer is fine-grained capability control: custom code can read certain data, write to certain places, call certain APIs, but nothing more. It's like giving someone a keycard that opens specific doors but not others.

Third comes the operational challenge: how do you deploy, monitor, debug, and update code that's running in a sandbox you control? Your typical deployment pipeline doesn't work. You need a whole new set of tools for safely running untrusted code at scale. That means observability (you need to know what's happening in those sandboxes), error handling (when code fails, what's the user experience?), and performance (each sandbox has resource limits, so code that works on a developer's laptop might time out in production).

Fourth, there's the security question that keeps engineers awake: what happens when the sandbox is breached? It will be, eventually. Someone will find an exploit. The question isn't if, but when and how you respond. Notion's architecture assumes the sandbox can be broken and builds defense-in-depth: multiple layers, so a breach in one layer doesn't give full access.

Fifth is the data sovereignty question. Custom code needs to access data to be useful, but which data? What level of isolation is needed between different developers' custom code? If you're running agents from different organizations (or different users within the same organization), they can't see each other's data. That's architecture, not just policy.

These decisions aren't made in a vacuum. They reflect Notion's understanding that developers want power but users want safety. The platform has to serve both. The sandbox is where that tension gets resolved architecturally.

The broader implication is that as agents become more capable and more autonomous, the question of "what can this agent do?" becomes infrastructure-critical. It's not just a security issue; it's an architectural one that shapes how the entire platform operates.

For teams building platforms that allow custom code or custom agents, the lesson is clear: build the sandbox first, not as an afterthought. Design the boundaries before you ship the capabilities. Know your threat model. Plan for the breach.

Adam Hudson, a software engineer with over 20 years of experience and a PhD in engineering, will dive deep into these architectural decisions and the security considerations involved at AI Engineer Melbourne 2026 on June 3-4, drawing from his work on Notion's data platform.

Craft in the Time of Agents — Annie Vella at AI Engineer Melbourne 2026

john allsopp — Tue, 19 May 2026 00:00:00 +0000

Craft in the Time of Agents

You're shipping more than ever. Features that would have taken your team weeks to build now take days. You're moving faster, scaling further, doing more with less. And by Wednesday you're exhausted.

This is the paradox of AI coding assistants that almost no one talks about. You feel more productive. The metrics say you're more productive. But the thing that used to sustain you—the craft of writing code, the problem-solving depth, the flow of fitting pieces together until they work—that's gone. What's replaced it feels less like engineering and more like supervisory work: directing AI output, evaluating what it generates, correcting its mistakes, coaxing it in the right direction.

You're getting more done while enjoying it less. And the uncomfortable question is: why does that matter if you're shipping faster?

Annie Vella has spent the last few years researching exactly this question. As Distinguished Engineer at Westpac New Zealand, she's spent two decades navigating startups, scale-ups, and large enterprises. But recently, she completed a Master's of Engineering exploring how professional software engineers actually experience AI coding assistants—not in theory, but in the lived reality of their working days.

What she found is nuanced and challenging. The transition to AI-assisted development doesn't hit everyone the same way. Some people thrive. Others feel unmoored. And the difference comes down to mindset more than circumstance.

Think about what's actually changed. When you write code from first principles, you're making choices about structure, trade-offs, and implementation details. Each choice teaches you something about the problem. You fail, you debug, you understand more deeply. That feedback loop—problem to solution to understanding—is where mastery develops. It's also what makes the work feel meaningful.

When an AI agent writes code and you evaluate it, the feedback loop changes. You're no longer learning through making mistakes; you're learning through spotting and correcting mistakes someone else made. It's different cognitively. It feels less like creation and more like curation. Less like craft and more like quality control.

And it's genuinely exhausting. You're making more decisions per hour because you're reviewing more code. You're switching contexts constantly: from evaluating one suggestion, to refining a prompt, to testing the result, back to refinement. Context switching is draining. And because AI code is often plausible but subtly wrong—it compiles, it runs, it produces output, but there's a logic error three commits in—the stakes of evaluation are high. You can't relax into the review. You have to think hard.

But here's where Vella's research gets interesting. Not everyone experiences this as negative. Some engineers thrive when freed from the grind of syntax and boilerplate. They lean into the direction and orchestration role. They enjoy the speed and the ability to explore more ideas. The craft they find is different—it's in problem specification, in understanding what matters, in making bigger architectural decisions faster.

The difference seems to correlate with where you are in your career and what you value. Someone early in their career might need the deep learning loop that comes with writing code from scratch. Rushing that foundation can leave you without the mental models that make you effective later. Someone mid-career might appreciate the shift to higher-level problem-solving. Someone senior might find the speed liberating—finally building the system they've been thinking about for years.

The uncomfortable truth is that AI coding assistants don't eliminate the trade-offs; they just shift them. You gain speed and breadth. You lose depth and the satisfaction of craft. Both matter. The question is whether you know what you're losing and whether the trade is worth it for you, personally.

There's also a systems-level question hiding here. If AI-assisted development works best for people who already know what they're doing—who can evaluate code quickly, who don't need the learning feedback loop—then it might actually increase inequality in engineering. Juniors might need human mentorship and deliberate practice more than ever, precisely because they're losing the learning-through-making path. Teams that can't afford to hire experienced engineers might struggle because their juniors aren't getting the developmental experience they need.

These are the conversations happening in the margins right now. Not in keynotes about productivity gains, but in one-on-one conversations between engineers who feel more capable and less fulfilled, who ship more and enjoy it less.

Annie Vella's research digs into these lived experiences. She's interviewed engineers across career stages, organisations, and contexts. She's seen the real patterns beneath the hype. And crucially, she's thinking about what this means for engineering as a profession—not just for individual productivity.

Her perspective at AI Engineer Melbourne 2026 (June 3–4) offers something rare: honest reflection on what we're gaining and losing as AI coding assistants reshape the work of software engineering. Not cheerleading, not doom-saying, but clarity about the real trade-offs and what different people need to thrive in this new landscape.

What If You Never Needed an API Key Again? Building a Mesh LLM From Spare Compute — Mic Neale at AI Engineer Melbourne 2026

john allsopp — Mon, 18 May 2026 00:00:00 +0000

What If You Never Needed an API Key Again? Building a Mesh LLM From Spare Compute

Every AI application today is built on a dependency. You make an API call to OpenAI, Claude, Gemini—one of a handful of providers running models in massive data centres. Your code can't run without that call succeeding. Your costs scale with usage. Your privacy and control are mediated by someone else's terms of service.

What if that entire architecture was optional?

The current AI stack works because it's convenient and centralized. One company runs the model, everyone calls it, billing happens automatically. It's efficient at global scale. But it introduces bottlenecks, costs, and dependencies that make many applications impractical. Want to run inference offline? Want to avoid sending your data to a US data centre? Want to use a model fine-tuned to your specific domain? The centralized architecture makes all of this either expensive or impossible.

Mic Neale is exploring a different path: a decentralized mesh LLM where spare GPU capacity becomes pooled, shared infrastructure.

The idea is deceptively simple. Your GPU sits idle most of the time. Mine does too. Collectively, neighbourhoods, communities, and organisations have vastly more compute than they're using. What if that idle capacity could be automatically pooled? When you need inference, the mesh provides it. When your machine is free, it contributes to serving others' requests.

This isn't new as a concept—peer-to-peer networks, distributed computing, and grid computing have existed for decades. But LLMs add complexity. Models are huge. A state-of-the-art model can be 70 billion parameters. Running it requires coordination across many machines. Latency matters—a 500ms request shouldn't take 5 seconds because calls are being routed across the globe. Reliability matters too: if one node goes down, the inference shouldn't fail.

Neale's work on this prototype shows it's not just theoretically possible; it's practically viable. The technical challenges are real—model sharding across heterogeneous hardware, latency optimization, fault tolerance, scheduling—but they're solvable. And the economic implications are enormous.

Consider the numbers. A single high-end GPU costs a few thousand dollars. A neighbourhood with 100 households probably has the equivalent of 10–20 high-end GPUs sitting idle. That's $50,000–$100,000 of compute capacity doing nothing most of the time. If even a fraction of it could be harnessed collectively, it would dwarf the cost of API calls for that entire community.

But this goes deeper than cost. A mesh LLM is inherently more resilient than a centralized service. No single company can shut it down. No API rate limit can throttle it. Models can be updated by the community rather than dictated by a vendor. Applications can run on models fine-tuned for local languages, cultures, and specific domains—something that centralized providers have little incentive to optimize.

There are hard problems here. How do you coordinate millions of machines offering different hardware specs? How do you ensure fairness—that people contributing more compute get equitable access? How do you prevent bad actors from poisoning the mesh or stealing inferences? How do you handle the economics: who gets paid, and how much?

Neale's background matters here. Two decades building developer tooling, distributed systems, and AI infrastructure at companies like CloudBees and Red Hat means he understands both the technical depth and the organizational reality of getting decentralized systems to work at scale. His work on Goose, Block's open source AI coding agent, has forced him to think about what happens when you're not constrained by a single vendor's model serving infrastructure.

The mesh LLM shifts power and capability from data-centre operators to communities. It's not about replacing the cloud; it's about making alternatives viable. For many applications—especially those where latency is flexible, privacy is essential, or cost is prohibitive with centralized APIs—a mesh approach becomes not just interesting but necessary.

What emerges is a different kind of decentralization. Not the "everyone runs everything" fantasy of some blockchain narratives, but practical mutual aid: my spare GPU serves your inference; your spare GPU serves mine. The economics work because we're using capacity that's otherwise wasted.

Mic Neale's work on building mesh LLMs offers a glimpse of what becomes possible when we shift from a "data centre required" model to "your neighbourhood has enough." This isn't about ideology; it's about engineering a system that's more resilient, more equitable, and more practical for the next decade of AI applications.

Hear Neale explore the technical realities, the economic model, and the implications at AI Engineer Melbourne 2026 (June 3–4).

Everything Is a Factory — Geoff Huntley at AI Engineer Melbourne 2026

john allsopp — Wed, 13 May 2026 00:00:00 +0000

Everything Is a Factory

The age of hand-crafted software is ending. Not because humans are becoming obsolete, but because the economics have shifted so radically that we're building the wrong thing when we build by hand.

For decades, software development meant hiring people to write code. You'd bring in engineers, they'd spend months building features, and you'd hope the result was good enough to ship. It was slow, expensive, and fragile. A single person's absence could halt a team. A architectural misstep early on could propagate through a codebase for years.

Then something changed. AI systems became capable enough to write code. Not perfectly, but iteratively—taking a prompt, generating an attempt, running it against tests, and refining until it works. The implications are staggering: what if you could bypass the hiring cycle entirely? What if you could specify what you wanted in plain language and have a system autonomously iterate toward a working solution?

This isn't science fiction. It's happening now, and Geoff Huntley's work traces the path from the earliest experiments to production systems operating at scale.

The Ralph Wiggum Loop—a brute-force technique where AI agents keep trying and testing until they get things right—was a proof of concept. It showed that you didn't need perfect prompts or clever architectural tricks. You just needed enough iteration and feedback. Then came Loom, which took that insight and turned it into an actual software factory: a system that orchestrates AI agents to build software at speed and scale.

But here's what makes this genuinely interesting: this isn't just about replacing human labour with automation. It's about transforming the role of the engineer. When code becomes a commodity—when you can generate working code from specifications—engineering becomes orchestration. The valuable work shifts from "write this feature" to "specify what matters," "evaluate what was built," and "refine until it's right."

Huntley's recent work on the Cursed programming language takes this further: an entire compiler written by AI loops, with no human hand-coding of the implementation. It's a demonstration that AI-driven factories can handle arbitrary complexity. They're not limited to simple, cookie-cutter problems. They can tackle infrastructure, systems programming, and novel architectural challenges.

The implications ripple outward. Teams become smaller and faster. Time-to-market collapses. The barrier to building software drops dramatically—you don't need deep expertise in every domain, just the ability to clearly specify what you want and evaluate whether it works.

There are real costs too. The craft that many engineers love—the flow of writing code, the problem-solving depth, the ownership of a carefully constructed solution—that changes. The role demands different skills. Some people will thrive; others will feel untethered.

But the momentum is real. Once software factories work this well, going back to hand-crafting is like going back to hand-weaving after the industrial loom. The economics don't allow it.

Geoff Huntley has been thinking about this transition longer than almost anyone. His journey from the Ralph Wiggum Loop through Loom to today offers crucial perspective on what's actually changing, what's hype, and what the engineering role genuinely looks like when everything becomes a factory.

Catch Huntley's keynote at AI Engineer Melbourne 2026 (June 3–4) to explore what software development becomes when humans stop writing code and start orchestrating its creation.

COBOL and AI: Building a Self-Serve Knowledge Layer for 2,000 Batch Jobs — Matthew Gillard at AI Engineer Melbourne 2026

john allsopp — Tue, 12 May 2026 00:00:00 +0000

COBOL and AI: Building a Self-Serve Knowledge Layer for 2,000 Batch Jobs

The biggest obstacle to modernization isn't the technology. It's the knowledge locked inside legacy systems.

When you've got 2,000 COBOL batch jobs running your core operations, modernization planning hits a wall immediately. Someone needs to understand what each job does, what files it consumes, what it produces, which systems it connects to. That knowledge exists somewhere—embedded in decades of code, scattered across different teams, living in people's heads. Extracting it manually doesn't scale. Neither does guessing.

This is Matthew Gillard's problem, and his solution has fundamentally changed how his organization approaches modernization.

Instead of having operations staff manually reverse-engineer COBOL to understand batch jobs—a process that takes hours per job and produces inconsistent results—Gillard built an agentic system that does the extraction automatically. Parse COBOL into control and data-flow structures. Feed those structures to an AI model trained to extract business rules. Serve the results in a usable form to teams that need to understand what's actually happening.

The economics shift dramatically. What took human operations teams hours now takes minutes. The knowledge becomes queryable, searchable, documented. Teams making modernization decisions can answer fundamental questions immediately: what does this job do, what inputs does it need, what outputs does it produce, which systems depend on it?

There's a deeper insight here about how AI agents can tackle problems that are economically unfeasible to solve with humans. The work of extracting operational knowledge from legacy code is real, important, and necessary. But paying humans to do it at scale is prohibitively expensive. So organizations either pay the cost or they don't modernize—they manage legacy systems indefinitely, unable to make informed decisions about what to change.

An agent doesn't solve every problem perfectly, but it solves the scaling problem. Good enough automatically beats perfect at a price you can't afford.

Gillard's work also connects to a broader pattern in how organizations can accelerate outcomes. Platform engineering teams, systems thinkers, and builders working in mature codebases all face versions of this problem: there's knowledge trapped in the system, and nobody's extracting it at scale because the human effort is too high.

The playbook Gillard developed—parsing legacy code into structures AI can reason about, training models on specific business rule extraction, surfacing results in usable forms—is repeatable. Different legacy language, same approach. Different business rules, same extraction logic. The pattern is portable.

Matthew Gillard is Principal at V2 AI in Melbourne and CTO of CuidadoConnect, an aged-care technology startup. He brings deep expertise in platform engineering, serverless architecture, and practical AI adoption. He co-hosts Cloud Dialogues, where he explores emerging trends with industry leaders, and consults with organizations on accelerating outcomes through intelligent systems.

His perspective on production-tested playbooks for extracting knowledge from legacy systems brings something rare: honest assessment of what's possible when you combine human reasoning about systems with AI's ability to process code at scale.

See Matthew Gillard at AI Engineer Melbourne 2026, June 3-4. Tickets at https://aiengineer.webdirections.org

The Software Engineer Who Don’t Code — Yasith Fernando at AI Engineer Melbourne 2026

john allsopp — Mon, 11 May 2026 00:00:00 +0000

The Software Engineer Who Don't Code

The job title hasn't changed, but the job itself is in the middle of a profound shift. Software engineers have always been problem-solvers, but the mix of skills and time allocation is rebalancing in ways that few predicted. The moment AI coding tools became good enough that they could generate working code consistently, the value of an engineer stopped being "the person who writes code" and started being something more subtle: the person who knows what code to ask for, who can evaluate what comes back, who can steer the direction when things go wrong.

This isn't to say engineers stop writing code. But increasingly, the time spent writing code is shrinking while the time spent on things that code-writing tools can't do is expanding. It's a shift in the ratio of craft to orchestration.

Consider what's actually happening on a good engineering team right now. An engineer needs to solve a problem. They might use AI tools to generate an initial implementation. But then comes the work that matters: understanding whether that implementation actually solves the problem. Is it efficient? Is it maintainable? Does it handle edge cases? Is it secure? These questions require judgment, and judgment requires understanding the problem deeply.

An AI-savvy engineer becomes someone who's very good at this evaluation phase. They write less code, but the code they do write is shaped by having rejected dozens of AI-generated options. They might spend more time thinking about architecture, about how different pieces fit together, about what constraints actually matter. They become less of a "coder" and more of a "problem-definer and solution-validator."

There's a second layer too: the orchestration layer. Systems are becoming more complex, not less. They have more pieces, more integrations, more surfaces where things can go wrong. The engineer's job expands to include orchestrating AI agents to solve parts of the problem, reviewing their outputs, catching where they fail, steering them in better directions. This is different from writing code. It's closer to architectural thinking.

The implications are worth sitting with. First, this might be a very good thing for software engineering as a profession. Engineers who spend less time on mechanical coding and more time on problem definition and solution evaluation tend to be more engaged. They're doing more of the interesting work. They're thinking about why we're building this, not just how to make it work. That's deeper.

Second, it requires a different kind of expertise. You need to understand the tools well enough to know what they're good at and where they'll fail. You need domain knowledge because you need to validate outputs. You need communication skills because you're not just implementing specs anymore, you're negotiating with AI agents about what's possible. You need taste: the ability to look at multiple valid solutions and recognize which one is genuinely better.

Third, there's a structural question: as the ratio of "code written by humans" to "code generated by AI" shifts, what happens to how we learn to be good engineers? Craft comes from practice. If junior engineers spend less time writing code, how do they build intuition? The answer probably involves intentional practice: deliberate focus on the parts that still require human skill, mentorship that emphasizes judgment over mechanics, and learning that comes from reviewing and evaluating code, not just writing it.

Fourth, there's a talent market question. If the job is changing, what does that mean for hiring? You can't just look for "people who are good at writing code" anymore. You need people who are good at problem-solving, evaluation, architectural thinking. Those are older, rarer skills. They're also more transferable. A great engineer in this model could move between domains more easily because they're not dependent on deep language-specific expertise.

The shift isn't complete and it's not inevitable. Some organizations will cling to the old model longer than makes sense. Some problems genuinely do require someone who loves writing code and has the depth to do it really well. But the center of gravity is moving. The engineer who doesn't code — or at least, codes much less — is starting to look less like an outlier and more like a preview of what the role is becoming.

Yasith Fernando, a Staff Software Engineer exploring the intersection of AI tools and engineering practice, will dive deep into this shift at AI Engineer Melbourne 2026 on June 3-4, examining what it means for how we build software and how we grow as engineers.

A conference, also for agents

john allsopp — Fri, 08 May 2026 05:21:03 +0000

Recently we sat down to book flights for speakers coming in for the conference. Two or three dozen of them. The task itself is genuinely simple: pick the dates, find the right flights, enter the passenger details, pay. Easy when it’s one. Painful when it’s many. Each booking is multiple steps, each step several clicks and rounds of data entry, even though we’re largely reproducing the same information over and over again. The work has been spread, on and off, across several days now, and it has been the source of considerable frustration.

The reality is that this should all already be easily doable by an agent. The structural information is all there — flights, prices, seats, fare classes, passenger details. It just isn’t exposed in a way an agent can consume cheaply. So instead, the work falls to us, and the airlines get to keep us on their websites for hours at a time doing structured data entry we have no business doing.

This is, I think, the shape of an enormous amount of remaining work on the web. Tasks that look simple when you do one of them, and become unconscionable when you do many. The cost is not the task. The cost is the assumption — buried in the page design, the navigation, the form layout — that the consumer is a human doing the task once.

Now, if I believe that an important user of pretty much any website is already an agent, then what are we doing about it? Well, for AI Engineer Melbourne, we’ve actually gone and implemented a way agents can use the conference programme directly.

The programme — speakers, sessions, schedule — is published in two parallel forms. There’s the human-facing version at webdirections.org/ai-engineer/, with the schedule grid and the speaker pages and all the things you’d expect a conference site to have. And there’s a parallel agent-facing version at data.webdirections.org/ai-engineer/ which exposes the same underlying data through a small set of interfaces designed for agents to consume — JSON endpoints, an llms.txt, an MCP server, semantic search.

This is not, in the scheme of things, a particularly hard piece of engineering. The schedule is small. The data is mostly text. The endpoints are mostly views over a single source of truth. And yet I’ve spent as much time thinking about it than just about any other part of the conference site this year, because I think the pattern matters more than this particular instance of it.

Why we did it

A reasonable first question is whether agents needed any of this. Couldn’t an agent just visit webdirections.org/ai-engineer/schedule.php, read the HTML, and figure it out? Yes. Sort of. It’s a brittle yes. Web pages are designed for people, with all the layout and navigation and visual hierarchy that human reading needs and agent reading struggles with. An agent screen-scraping a schedule page might get the right answer most of the time. It will also break in subtle ways the next time we redesign the page, and it will burn many more tokens than it needs to in figuring out structure that we already know.

The deeper reason — and this is the thesis I keep coming back to — is that I think we’ve already passed the point where the most significant use of many websites is a human reading them. For a lot of sites, including ours, agents are now a meaningful share of the audience. For some sites — airline schedules, government forms, product catalogues, anything where the underlying job is structured data retrieval — agents are arguably already the more important audience. The human interface is, for these sites, the leftover from a previous era when the only way to reach the data was through a page designed for eyes.

If that’s right, then the question of how we treat the agent audience stops being a curiosity and starts being a question about what your site is for. If you keep building only for humans, and you make agents work harder than they need to, Or, as in many cases I’ve discovered in my research, impossible, you are essentially shipping a worse experience to the part of your audience that’s growing fastest-Or excluding that audience entirely. And worse still, you’re shipping an experience that you don’t control. Whatever the agent infers from the page is its problem. Whatever you said clearly in structured form is yours.

This is the same argument that has been made for a while now under the heading of the dual-interface pattern: systems should serve their human and agent audiences through deliberate, separate interfaces, rather than expecting one to figure out the other. I wrote about this earlier in the year. The conference data layer is what it looks like when you actually do it.

What’s there

The agent-facing surface has a few pieces, each doing slightly different work.

llms.txt is the entry point. It’s a tiny plain-text document that says, in human-and-agent-readable prose, what’s at this URL space — what the conference is, what data is available, where to find things. The convention was proposed by Jeremy Howard last year (who just happens to be one of our keynote speakers, by the way) and is becoming the de facto way to make a site discoverable to agents. The point of llms.txt is to be the first thing an agent looks at, so that everything else can be smaller and more specialised. Ours is at data.webdirections.org/ai-engineer/llms.txt.

llms-full.txt is the same idea taken further. It’s the entire conference — every session, every speaker, every abstract — dumped as plain text in a structure that fits in any modern context window. If an agent needs to “know about the conference,” it can pull this single file and have everything. No tool calls, no follow-ups. Cheap and complete.

Three JSON endpoints — sessions, speakers, schedule — give the same data in a structured, queryable form. These are for code rather than agents per se: someone building an app, or a script, or any kind of integration. The URLs are stable and the schemas don’t change capriciously. CORS is open. There is no auth. It’s just data.

An iCal feed that drops straight into any calendar app. This one isn’t agent-facing in the LLM sense — it’s just a long-overdue thing for any conference to have, and once we had the data layer it took about ten minutes to add.

A search endpoint at /api/search that does semantic search over session content using Cloudflare Workers AI embeddings (Cloudflare are sponsors of the conference, but we’ve long been using them for things like this). Pass it a query like “production agent observability” and it returns the sessions that are actually about that, even if those exact words don’t appear. This is the piece that opens up the more interesting agent use cases. Token-efficient, low-latency, works against the whole programme.

An MCP server at /mcp that exposes the whole thing as Model Context Protocol tools. An agent connecting to it gets nine named tools — get_conference_info, list_sessions, get_session, list_speakers, get_speaker, get_schedule, search_sessions, search_speakers, and whats_happening_now. Add the URL to Claude Desktop or Cursor or any MCP-aware client and the agent has structured, live access to the conference programme.

The whole thing is auto-published from our internal source-of-truth within seconds of edits. CORS-open, no auth, no rate limits, no permission needed. The data is yours. We made some decisions there deliberately. I’ll come back to them.

Why each layer is there

People sometimes ask why we have so many different formats. Couldn’t we just pick one? The answer is that they’re for different consumers with different constraints, and the cost of supporting each one is small once you have the underlying data well-modelled.

llms.txt is the cheapest possible discovery surface. An agent that knows nothing about us can fetch one short text file and understand the shape of what’s available.

llms-full.txt is the cheapest possible context-window load. If you want an agent to be able to reason about the conference without making tool calls, this file is what you paste into the context.

The JSON endpoints are for systems with strong typing requirements — code that wants to validate, transform, integrate. They’re also the source of truth: every other surface is computed from them.

Semantic search is for agents (or humans) doing fuzzy, intent-led queries. “What’s the most relevant talk to my work in production observability?” doesn’t translate into a structured filter; it translates into a vector similarity computation, and we’ve already done the embedding work.

The MCP server is for agents that want live tool access during a conversation — where the user is sitting with Claude or Cursor open and asking questions, and the agent should be able to look things up in real time without making the user paste anything.

Each of these is roughly the right interface for a different kind of need. None of them is hard to build once you’ve decided the underlying data model. The expensive part is deciding to do the work at all.

What you can do with it

Well, I have a few suggestions, but most importantly, what we hope is people will come up with solutions we haven’t imagined yet.

Build a personal schedule app. The data is open and stable. If you want a custom scheduler that knows your interests, calculates your walking time between rooms, or syncs with your team’s picks, all the inputs are there.

Point Claude or Cursor at the MCP server. Add it to your config and have your agent help you plan your two days. “What sessions are about agent observability and which speakers have I not heard before?” “Which talks on day two clash with the Stile cluster, and which should I prioritise given I work in a regulated industry?” “Tell me which speakers are talking about spec-driven development and pull up the Stile and Beaugeard sessions side by side.” All of this works, today, with the MCP endpoint.

Build a recommender. Take the embeddings, take a description of your role and interests, and rank the 88 (and counting!) sessions by relevance. This is a weekend project at most. It’s also the kind of thing someone in your organisation will probably want, because most engineering teams would benefit from a structured way to decide who attends what.

Build a Slack or Teams bot. “What’s happening now?” “Who’s speaking next in the Leadership track?” “Find me the sessions about evals.” The whats_happening_now tool is in the MCP server specifically for this.

Generate personalised attendee briefings. Feed the speaker JSON and an attendee’s LinkedIn or interests into an agent and have it produce a one-pager: who they should meet, which sessions are most relevant, what to ask in the corridor. This is a tens-of-lines-of-code job, not a project.

Build dual-interface tooling for your own conferences and events. This one isn’t strictly about ours, but the patterns transfer. If you organise something — a meetup, a team summit, a multi-day workshop — you can take this approach and ship it in a weekend. We’re happy to share the underlying patterns.

Run experiments on the data. Use it as a small, well-curated dataset for your own learning. Practice MCP integration against it. Test how different models handle semantic search. Compare the cost of llms-full.txt context-window load versus repeated MCP tool calls. The data is small enough to fit in one head, real enough that working with it teaches you something, and stable enough to be a good substrate.

Build something we haven’t thought of. This is the bit I most want to see. The whole reason for shipping the data layer is that we don’t know what people will do with it, and we want to find out.

The decisions worth naming

A few decisions in the data layer are worth talking about explicitly, because they’re the kind of thing engineering teams routinely get wrong.

No auth, no rate limits, no permission required. It is genuinely open data. We could have charged for API access, gated it behind a developer signup, throttled it to discourage casual use. We didn’t, because the entire point is to make it easy for people to do interesting things with the programme. Friction kills experiments. We chose less friction.

CORS is open everywhere. Anyone can hit the endpoints from their browser. If we’d locked CORS to webdirections.org, we’d have made it impossible to build anything client-side against the data. We chose less friction.

The source of truth lives somewhere stable. All the endpoints derive from a single internal source — our speakers.webdirections.org system — and the public layer auto-republishes within seconds of any edit. This means the agent-facing data doesn’t drift from the human-facing data, ever. Edit-to-live is fast enough that for practical purposes the two views are the same view.

Stable URLs. The URLs at data.webdirections.org won’t change. If you build something against them today, it will keep working. That’s a small commitment in writing and a large one in practice — it constrains how we evolve the schema — but it’s the commitment that makes the data layer actually useful for building on.

The data layer is its own thing, not a side door into the website. This matters because it means we can iterate on the website, redesign it, restructure the IA, without breaking the data. The human and agent surfaces are decoupled. Each can change at its own pace. This is the deepest version of the dual-interface pattern: not just two interfaces, but two systems that happen to share a backend.

What we’re going to learn from it

I don’t fully know yet what people will build with this, and that’s the most interesting part. We’ve made the pattern available. We’ve made the data open. Whatever happens next is up to the people in the community who decide it’s worth exploring.

What I do know is that I expect this to be the default for events and information services, eventually. Not because it’s mandated, but because the dual-interface pattern is what good engineering looks like in a world where half the consumers of your data aren’t human. People are slow to absorb that. The conference, in some small way, is here to argue that they shouldn’t be.

If you build something against the data, tell me what you built and how it went. If you run an event of any kind and want help applying the pattern, get in touch. And if you’re coming to AI Engineer Melbourne in June, the data layer is a small worked example of what the conference is otherwise about: the practical, deliberate, well-engineered version of the future, rather than the demo-driven version.

What’s there, in one place

data.webdirections.org/ai-engineer/ — the index page with all endpoints listed.
llms.txt — agent discovery doc.
llms-full.txt — full conference dump as plain text.
sessions.json, speakers.json, schedule.json — structured data.
calendar.ics — iCal feed.
api/search — semantic search.
MCP server at https://data.webdirections.org/ai-engineer/mcp — full tool access for MCP-aware clients.

Companion pieces

How long is your loop? — on the spectrum of AI-assisted development.
What the maturity ladders miss — on the existing taxonomies and what they don’t capture.
Three conversations, one conference — the framing piece for the AI Engineer Melbourne programme.

The conference

AI Engineer Melbourne runs June 3rd and 4th 2026 at Federation Square. Friends-of-Web-Directions pricing is available until May 15. Register here.

Three conversations, one conference: AI Engineer Melbourne

john allsopp — Thu, 07 May 2026 23:36:00 +0000

The full schedule for AI Engineer Melbourne is now live.The grid view is the most useful way to scan it. There’s also semantic search, and a recommendation system as well to find related talks to ones you’re interested in.

We’ve also shipped an agent-friendly view with MCP endpoints, llms.txt, and other interfaces designed for agents — the conference about AI engineering, eating our own dogfood! We’d love to hear how you get your agents to work with the schedule.

Below is the first piece in a short series — what’s at the conference, why we built it the way we did, and what each of the three tracks is offering. Three more follow over the next ten days, one for each track.

Three conversations, one conference: AI Engineer Melbourne

The full schedule is now live — the grid view is the most useful way to scan it, with both days and all three tracks side by side.

There are three conversations happening right now in the practitioner community around AI, and they are mostly happening separately. The first is among software engineers, about how the practice of software engineering — not just the writing of code, but the whole craft and the work that surrounds it — is transforming. The second is among AI engineers, about what production AI systems actually require to be reliable. The third is among engineering leaders, about how to lead organisations through a transition whose shape is not yet clear. If you spend much time on the platforms where engineers gather, you will recognise all three — and you will probably notice that the people in any one of them are mostly not in the other two.

AI Engineer Melbourne, on June 3rd and 4th at Federation Square, is the rare event that brings the three conversations into the same room. We’ve built the programme around three tracks corresponding directly to the three conversations, and the speakers in each are people doing serious work on the question their track is engaging with. The result is something that does not exist anywhere else in this part of the world: a place where the three conversations are forced to meet each other for two days, with the people having them in the room together.

Over the next ten days we’ll publish three companion pieces, one for each track, going into the substance of what each conversation looks like. In this piece we want to bring together some of the broader threads — to describe the shape of the conference as a whole, and to make the case that the three conversations belong together more than the current discourse acknowledges.

The first conversation: what craft becomes

The conversation in software engineering is about what happens to the craft when the agent is doing more of the work. It is not a new conversation — engineers have been having versions of it since Copilot — but in the last year it has changed register. The early phase was about whether the agent was useful. The current phase is about what changes about being an engineer when the agent is reliably useful for an increasing fraction of the work.

The answers people are converging on are uncomfortable. Some of the work is being lost. Some of it is being transformed. Some of it is being amplified — engineers who can articulate what they want clearly are getting more leverage; engineers who learned by writing code at scale are finding the skill less central. The middle of the spectrum, where most senior engineers actually live now, is the part of the conversation least well served by the public discourse, because the discourse tends to be at the extremes: either everything-is-fine or everything-is-over.

The SWE/agentic coding track is unusually willing to engage with the middle. Annie Vella‘s keynote on craft in the time of agents will, I suspect, be the most-discussed talk of the conference for this reason. Around her, the track has the strongest cluster of practitioner talks I have seen at any Australian conference — including three from Stile Education describing a single organisation’s journey to long-loop agentic engineering, and a counterweight cluster led by Jason Cornwall arguing that the productivity story is not what it seems.

The second conversation: what AI engineering becomes

The conversation in AI engineering is about whether the discipline is going to become an engineering discipline, or remain a craft. The public discourse has been mostly the second — screenshot-driven threads on social platforms, demo videos with no measurements, architecture diagrams without any data behind them. This works as marketing. It does not work as a way to build production systems that need to be trusted.

The AI engineering track is, almost talk by talk, in the opposite posture. Sceptical of demos. Demanding of evidence. Willing to talk about failure modes in detail. The speakers are people who have shipped production AI systems, watched them break, and learned something from the breakage. The talks describe the verification stack, the failure modes, the architectural decisions, and the integration challenges that actually make production AI engineering work — or fail.

If you are responsible for AI systems that need to do real work for real users, the track is the most concentrated body of practitioner thinking we’ll see in 2026 in this region. Yicheng Guo on what evals caught after a production hallucination, Jack Silman and Abdul Karim on why they fired their LLM judge, Avni Bhatt on a small language model that beat their LLM in production — these are the kinds of talks the track is built on.

The third conversation: what leadership becomes

The conversation in engineering leadership is the hardest of the three, because it does not have technical answers. How do you lead people through a transition that is reshaping their identity? How do you build governance for systems whose behaviour you cannot fully predict? How do you make strategic decisions when your organisational readiness is materially behind your strategic ambition? How do you handle the burnout that the AI transition is producing in your most experienced engineers?

Most public AI-leadership content avoids these questions. It is confident where it should be uncertain, technical where it should be human, and strategic where it should be honest. The leadership track at AI Engineer Melbourne is, almost without exception, in the opposite posture. The speakers are people accountable for outcomes — CTOs, fractional CTOs, heads of engineering — and they are willing to talk about what is genuinely difficult about leading through this moment, rather than performing strategic confidence about it.

If you are responsible for AI in your organisation, this is the track. Christian Dandre on the readiness gap, Andy Kelk on engineers being afraid of becoming junior again, Aubrey Blanche on building governance on the fair-go principle rather than on Silicon Valley’s defaults — there is no other event in the region where the leadership conversation is at this level.

Why the three conversations belong together

The argument for putting the three tracks in the same conference, rather than running three smaller events, is that the three conversations are versions of the same conversation seen from different angles. The engineer feeling craft slip away, the AI engineer trying to make a production system trustworthy, and the leader trying to navigate organisational change are all responding to the same underlying transition. They are responding to it differently because they are accountable for different things. But they need each other.

The engineer benefits from understanding what the AI engineer is grappling with, because the engineer’s work is increasingly downstream of decisions the AI engineer is making. The AI engineer benefits from understanding the leadership conversation, because the constraints on what they can ship are leadership constraints as much as technical ones. The leader benefits from understanding both, because leading through this transition without that understanding produces the kind of strategic decision that looks brilliant in the deck and incoherent in execution.

We’ve built the conference, in this sense, as an argument that the three conversations should not be having themselves separately. The corridor conversations between sessions, the questions in panels, the dinners afterwards are where the three communities will actually meet each other. That is harder to engineer than the talks themselves, but the talks make it possible.

The three conversations are happening whether the people having them are in the same room or not. The case for being in the room is that the conversations sharpen each other, and the people having them sharpen each other, in ways that do not happen on social platforms or in private channels. June 3rd and 4th in Melbourne is where it happens.

A small aside on the programme itself

A practical note that fits the conversation. We’ve shipped the conference programme not just as a human-readable schedule but as an agent-friendly view at data.webdirections.org — including MCP endpoints, an llms.txt, and other interfaces designed for agents rather than people. If you want to point a coding agent at the programme, build something against it, or just see what a dual-interface programme looks like in practice, the agent view is there. There’s a longer piece coming on why we did it this way and what we learned doing it, but for now the data is live and waiting.

SAVE BEFORE MAY 15

BRINGING A TEAM?

If you’re sending five or more people, get in touch and we’ll sort out a team offer — better per-ticket pricing, ticket upgrades, and more. Reply to this email or drop us a line.

We have additional savings for freelancers and people paying their own way, for not-for-profits, for government, and for folks at agencies. If that’s you, reply to this email or get in touch and we’ll sort it out.

BACKGROUND READING

How long is your loop? — the loop-length framing some of the track pieces will build on.
What the maturity ladders miss — companion piece on AI development maturity models.

The Problem with “Mathematically Proven” Claims About LLMs

john allsopp — Thu, 07 May 2026 03:33:19 +0000

How a recurring rhetorical move keeps proving the wrong thing

There is now a recognisable pattern in AI commentary. It runs roughly as follows. A paper appears on arXiv. It contains real mathematics — definitions, lemmas, a theorem, sometimes several. The theorem establishes that a particular formal object, under a particular set of assumptions, has a particular limitation. The paper is then circulated by a second author — a blogger, a LinkedIn poster, a journalist — under a headline of the form “Researchers mathematically prove AI cannot X.” The headline travels. The assumptions don’t.

I want to take three recent specimens, show that they share a structural pattern, and say something about why the pattern matters.

Specimen one: AI cannot self-improve

The proximate cause of this essay is a blog post titled “AI Cannot Self Improve and Math behind PROVES IT!”, summarising a recent arXiv preprint by Hector Zenil (King’s College London), “On the Limits of Self-Improving in Large Language Models: The Singularity Is Not Near Without Symbolic Model Synthesis”. The blog post’s framing is uncompromising. It opens by claiming that “a new arXiv paper formally proves that recursive self-improvement in LLMs is mathematically impossible — the mechanism everyone believed would lead to superintelligence is actually a one-way ticket to model collapse.” Later: “the very mechanism people proposed to transcend human limitations — training on AI-generated data to break free from the finite supply of human knowledge — is mathematically proven to destroy the model’s representation of reality. The escape route collapses into a trap.” And, more lyrically: “the universe doesn’t give you compound interest on noise.”

The actual paper is more careful than its summariser. Zenil models recursive self-training as a dynamical system on probability distributions, assumes a KL-divergence-based objective and a vanishing supply of fresh authentic data (formally, the proportion of exogenous signal $\alpha_t \to 0$), and proves that under those assumptions the system converges to a degraded fixed point. This is a formalisation of the model collapse phenomenon Shumailov et al. described empirically in their 2023 Nature paper.

What the popularisation strips away is everything Zenil himself says about the scope of his result. Section 5 of the paper opens like this:

The results do not prove that all forms of recursive self-improvement collapse.

He goes on:

If $\inf_t \alpha_t > 0$, meaning the system receives persistent exogenous signal, then the contraction toward $P$ remains active. Systems operating under fixed axioms, externally defined objectives, or invariant verifiers (e.g. formally specified environments) do not satisfy the $\alpha_t \to 0$ condition.

And in the conclusion:

The impossibility result is conditional rather than universal. … Our results therefore do not rule out improvement in externally anchored systems; they rule out fully autonomous recursive density matching as a path to indefinite intelligence growth.

The proof says: if you train recursively on your own samples without sufficient fresh signal, under a KL objective, you collapse. The headline says: AI cannot self-improve. These are not the same statement. The gap between them is filled by an unexamined assumption: that “self-improvement” must mean naive autophagy. But that is not what self-improvement looks like in practice anywhere it is currently working. AlphaZero recursively self-improved through self-play because Go has a ground-truth winner. RLVR works because unit tests, proof checkers, and graders supply external signal. Distillation from stronger teacher models works. Verifier-filtered synthetic data works. The whole point of these regimes is that the loop is not closed — there is some external source of truth disciplining each iteration. The theorem about the closed loop is a theorem about a system nobody is building, and the paper itself says so.

Specimen two: hallucination is inevitable

The pattern is older than this paper. In January 2024, Xu, Jain, and Kankanhalli published “Hallucination is Inevitable: An Innate Limitation of Large Language Models”. The argument is elegant. They define a “formal world” of computable functions. They define hallucination as follows:

Hallucination occurs whenever an LLM fails to exactly reproduce the output of a computable function.

They then invoke a diagonalisation argument from learning theory to show that no computably enumerable family of LLMs can learn every computable function, and conclude that any LLM must hallucinate on some inputs. The headline — hallucination is mathematically inevitable — was widely repeated.

What gets buried is what “hallucination” had to be defined as for the proof to go through. Under the paper’s definition, every finite system “hallucinates,” because no finite system can compute every computable function. By that standard, your pocket calculator hallucinates the Ackermann function and you hallucinate the fifteen-digit prime factorisation. The proof says less than the headline implies; it says any general problem-solver will be wrong about something, somewhere.

And once again, the paper itself is more careful than its reception. Xu et al. explicitly note the get-out:

Knowledge-Enhanced LLMs … receive extra information about the ground truth function $f$ other than via training samples. Therefore, Theorem 3 is inapplicable herein.

The paper’s section on practical implications begins with “All LLMs trained only with input-output pairs will hallucinate when used as general problem solvers” (emphasis mine). The qualifier vanishes from the popularisations. The whole modern stack — retrieval, tool use, code execution, formal verifiers, knowledge bases — is, by the paper’s own admission, outside the theorem’s scope.

A 2025 follow-up by Suzuki et al. makes the point neatly in its subtitle: “Hallucinations are inevitable but can be made statistically negligible. The ‘innate’ inevitability of hallucinations cannot explain practical LLM issues.” The mathematical inevitability and the practical incidence are different problems. The former tells us almost nothing about the latter.

Specimen three: the math ceiling

The same shape, again, in 2025 and into 2026. Varin Sikka and Vishal Sikka’s paper “Hallucination Stations: On Some Basic Limitations of Transformer-Based Language Models” was widely reported as proving that LLM agents have a fundamental “math ceiling.” The core theorem is straightforward:

Given a prompt of length $N$, which includes a computational task within it of complexity $O(N^3)$ or higher, where $d < N$, an LLM, or an LLM-based agent, will unavoidably hallucinate in its response.

The proof is one paragraph: cite the Hartmanis-Stearns time hierarchy theorem; observe that an LLM’s per-token computation is $O(N^2 \cdot d)$; conclude that tasks requiring asymptotically more time cannot be carried out correctly. It’s true. It is also, by construction, a result about the LLM’s core forward pass. The Sikkas are explicit about this in their discussion:

While our work is about the limitations of individual LLMs, multiple LLMs working together can obviously achieve higher abilities. … various approaches are being developed, from composite systems to augmenting or constraining LLMs with rigorous approaches.

In other words: an unaided transformer of fixed dimensions, evaluated on tasks whose complexity exceeds its forward-pass complexity, will fail. Yes. And this tells us almost nothing about what an LLM-with-tools can do. Agents do not run inside the assumptions of the proof. They use scratchpads. They call solvers. They invoke MathJS and Lean and Wolfram. They write Python and run it. The theorem says transformers-without-tools cannot do TSP-in-a-fixed-context. The actual systems being deployed are transformers-with-tools, and the relevant empirical question — how good can the composite get? — is not addressed by the theorem at all.

The paper’s authors are clear-eyed about this. Tudor Achim, quoted in the WebProNews coverage, takes the productive view: “I think hallucinations are intrinsic to LLMs and also necessary for going beyond human intelligence.” His company’s bet is on what they call “mathematical superintelligence” — verified niches where formal checking provides external signal. That’s exactly the right response to the proof. It is not the response the headlines pick up.

The shape of the move

Three papers, three different negative claims, the same structural pattern. It works like this:

1. Take a maximalist version of the claim being attacked. RSI must mean closed-loop autophagy. Hallucination must mean failure to compute any computable function. Reasoning must mean executing tasks unaided in a fixed forward-pass budget. In each case, the strongest, most cartoonish reading is selected, because the strongest reading is what the math will handle. Zenil makes this almost explicit: he models “the autonomy regime” specifically because that’s the regime in which the theorem applies.

2. Prove a theorem about that reading. The theorems are usually fine. The mathematical content is real. KL flows do collapse under vanishing exogenous signal. Computably enumerable families cannot exhaust the computable functions. Fixed-precision transformers cannot solve arbitrarily large computational problems in $O(N^2 \cdot d)$. None of this is in dispute.

3. Drop the assumptions in the popularisation. The conditional becomes unconditional. “Under these assumptions”becomes “in principle.” “For this class of systems” becomes “for AI.” The author’s own qualifications — the results do not prove that all forms of recursive self-improvement collapse; Theorem 3 is inapplicable to systems with external knowledge; multiple LLMs working together can obviously achieve higher abilities — disappear in transit. The reader, encountering the headline, has no easy way to recover the lost qualifications. Qualifications are exactly what doesn’t survive a headline.

4. Garnish with vibes. The universe doesn’t give you compound interest on noise. The escape route collapses into a trap.It’s like trying to bootstrap yourself off the ground by pulling your own shoelaces. The aesthetics borrow the gravity of mathematics — the QED, the elegance, the inevitability — and graft them onto claims the mathematics did not establish. The form does the work the content cannot.

The result is a kind of vibes-laundering machine wearing a lab coat. A narrow, conditional technical result is converted, by stages, into a metaphysical conclusion. And because the source paper is real, and the math is real, the conclusion borrows credibility it has not earned.

Why it matters

It would be churlish to object to this if the technical results themselves were not interesting. They are. Model collapse is a real phenomenon and worth understanding. Computability bounds are worth knowing. The complexity ceiling on unaided transformers tells us something genuine about where to put effort. The papers, on the whole, are fine. It’s the inferential layer above the papers — the popularisation, the headline, the LinkedIn post — where the damage happens.

What gets lost is the actual operating principle of where AI progress has come from in the last three years, which is precisely not closed-loop magic. It is the patient construction of external discipline: graders, checkers, tools, environments with ground truth, humans in the loop, formal verifiers. Where verification is cheap, recursive improvement is not speculative — it is shipping. Where verification is hard, hallucination is not theoretically inevitable in any meaningful sense — it is empirically common, and the work is to find better verifiers. The Bitter Lesson, applied to applied AI, says roughly this: stop trying to engineer the limits in; build the loop and let the loop teach you. These “mathematically proven” results, read carelessly, tell people the loop cannot work. Read carefully — and at least two of the three I’ve discussed are explicit about this — they tell us what shape the loop has to have.

There is also a class element to the rhetorical move that is worth naming. Mathematically proven is a phrase with enormous social power. It signals that the question is settled, that disagreement is not just wrong but innumerate, that the priesthood has spoken. To unpack the assumptions requires either mathematical literacy or a willingness to be told you don’t understand the math. The asymmetry favours the headline. This is why the pattern keeps repeating — it pays in attention, and the cost of correction falls on someone else.

The honest version of each of these papers is, in fact, the version their authors wrote. Zenil’s conclusion: recursive self-improvement framed as progressively self-contained generative retraining cannot yield unbounded growth under standard distributional learning dynamics. Xu et al.’s caveat: all LLMs trained only with input-output pairs will hallucinate when used as general problem solvers. The Sikkas’: multiple LLMs working together can obviously achieve higher abilities.Each of these is a careful, conditional, useful result. None of them is “AI cannot X.”

A modest proposal

I am not arguing with the math. The math is fine. I am arguing with a habit of inference: the move from theorem about idealised object X to fact about real-world object Y, when Y is not X and the popularisation pretends it is.

The next time you see a piece claiming that mathematics has proven some negative about LLMs, the question to ask is not is the proof correct? It almost certainly is. The questions are: what exactly was modelled? What assumptions did the proof require? Do the systems we actually run satisfy those assumptions? And — usually the most damning — does the paper’s author themselves disclaim the strong reading? In every example I have looked at, the answer to the third question is no, and the answer to the fourth is yes. The gap between the modelled object and the deployed system is where all the interesting work is happening.

Eppur si muove. The systems keep getting better. The theorems keep arriving to explain why they cannot. Both can be true. They are usually about different things.

Web Directions

What if your agent could attend a conference with you?

“Hearing” the talks

“Seeing” the slides

And it doesn’t just watch

The hallway track — not yet

Try it–it’s free!

Shipping Sandboxed Workers for Notion Agents — Adam Hudson at AI Engineer Melbourne 2026

Shipping Sandboxed Workers for Notion Agents

Craft in the Time of Agents — Annie Vella at AI Engineer Melbourne 2026

Craft in the Time of Agents

What If You Never Needed an API Key Again? Building a Mesh LLM From Spare Compute — Mic Neale at AI Engineer Melbourne 2026

What If You Never Needed an API Key Again? Building a Mesh LLM From Spare Compute

Everything Is a Factory — Geoff Huntley at AI Engineer Melbourne 2026

Everything Is a Factory

COBOL and AI: Building a Self-Serve Knowledge Layer for 2,000 Batch Jobs — Matthew Gillard at AI Engineer Melbourne 2026

COBOL and AI: Building a Self-Serve Knowledge Layer for 2,000 Batch Jobs

The Software Engineer Who Don’t Code — Yasith Fernando at AI Engineer Melbourne 2026

The Software Engineer Who Don't Code

A conference, also for agents

Why we did it

What’s there

Why each layer is there

What you can do with it

The decisions worth naming

What we’re going to learn from it

What’s there, in one place

Companion pieces

The conference

Three conversations, one conference: AI Engineer Melbourne

Three conversations, one conference: AI Engineer Melbourne

The first conversation: what craft becomes

The second conversation: what AI engineering becomes

The third conversation: what leadership becomes

Why the three conversations belong together

A small aside on the programme itself

SAVE BEFORE MAY 15

BRINGING A TEAM?

MORE ON THE PROGRAMME

BACKGROUND READING

The Problem with “Mathematically Proven” Claims About LLMs

How a recurring rhetorical move keeps proving the wrong thing

Specimen one: AI cannot self-improve

Specimen two: hallucination is inevitable

Specimen three: the math ceiling

The shape of the move

Why it matters

A modest proposal