Deciphering Glyph

Adversarial Communication

2026-06-23T13:06:00-07:00

As I have discussed in previous posts, “AIs” can make mistakes. In fact, they do make mistakes, and their mistake-making patterns are such that where and how they will make mistakes is both uncertain and constantly changing.

Thus, in any scenario where you want to attempt to make “productive” use of “AI”, you must have a system in place for checking every result. Not checking some results; checking every result. If each result might have a consequence for you (and if it didn’t have a consequence, why bother automating it?) and you cannot predict in advance which kinds of results will need verification, then verification is always required.

The verification often ends up being just as expensive as doing the work in the first place, which means that if you want your usage of “AI” to be personally profitable, you have to find someone else to externalize the cost of verification onto. This person becomes your adversary, and, if you are successful, your “AI’s” victim.

The Ladder-Climber And Their Reverse-Centaur Rungs

One way that this constellation of facts can straightforwardly assemble themselves into a dystopian nightmare is the phenomenon, described by Cory Doctorow, of the reverse centaur. This is when your employer non-consensually turns you into the verification system. The “AI” does the fun part of initially performing the work, and then you do the boring part where you check if the robot is right and clean up its messes, even if everyone already knows that it would, in aggregate, be cheaper for you to do the work in the first place.

Reverse centaurs can be made from any automation, not only “AI” automation. I think that there is a reason that this term happens to have emerged in the “age of AI”, though, and not with earlier automation technologies (even those which were considerably more viscerally horrific). That reason is: the wrongness of “AI” output is not merely a technical feature that must be compensated for, it is a generalized externality.

As I mentioned above, if you are responsible for the entirety of the work, both extruding the “AI” output and checking it, it’s usually cheaper to have humans do the entirety of the work to begin with. When humans do the writing directly, we can check as we go, and thus verification doesn’t need to be as comprehensive.

When “AI” coding advocates say “code review is the bottleneck”, what they are observing is that the LLM is still rolling the dice for each PR, and a human is still necessary to verify that each of those rolls is a winner. But calling this process “code review” is a bit of a misnomer; it’s not really “code review” in the traditional sense, it’s human understanding.

Before the advent of “AI”, the human understanding was implicit in the process of writing the code in the first place¹, and the code review was a way of diffusing and extending that understanding. Now that the code can be authored with no initial understanding taking place, that cost has not gone away, it has moved.

Human understanding was always the bottleneck.

However, this is taking a collaborative view of a software project, where satisfying the needs and solving the problems of your customers are the goals. We can see that “AI” is a bad tool to satisfy those goals, because all it’s doing is converting the first half of the work, that of understanding the code as you write it, to understanding the agent’s output as you read it.

What if, instead, we were to take the view that every software company is a Hobbesian nightmare, red in tooth and claw? In this view, the only goal of a software project is for the individual developers to make their promo cycles and get their bonuses. Given that there is only a certain amount of money to go around, this is a zero-sum game where each programmer wants to look more productive than their colleagues.

Pretty much every organization finds it easy to reward “productivity” as expressed by lines of code emitted, but the benefits of doing thorough and thoughtful design, analysis, and code review very difficult to reward. In this world, an LLM is an invaluable tool for the sociopathic ladder-climber, particularly if your legacy organization is still structuring their workflows as if the person prompting the bot is “writing” the code, and then they get to foist off the act of “reviewing” the code onto someone else.

Here, the prompter effectively externalizes the cost of the LLM’s failures but internalizes any benefits. The prompter will vibe-code a big feature, so large that the assigned reviewer can’t possibly comprehend it all effectively. When this happens, the reviewer will, eventually, be pressured to approve it, even if they can try to spot a few problems along the way. The reviewer has their own work to get back to, after all, the obligation to review the prompter’s (read: the bot’s) code is a drain on their time that they are not going to get rewarded for.

If this feature is a big success, the prompter gets a promotion. If it causes a big issue, well, the reviewer must not have been careful enough.

This is why LLMs are “good for coding”, and also why their biggest promoters keep having outages.

The Generative Gish Galloper

Coding is the biggest “success story” of this type of adversarial communication, but it is by far not the only instance of such a thing. LLMs create a new form of leverage that can turn Brandolini’s law from a linear advantage into an exponential one. If you are engaged in a political debate where you want to overwhelm the other side in nonsense, an LLM can generate bullshit faster than it is physically possible for a human being to type, let alone respond thoughtfully. There is an asymmetry to the utility of this weapon as well: only one side of the political spectrum wants to flood the zone and destroy trust in institutions and the concept of truth. There’s a good reason that the fascists love it.

Straightforward Spam and Fraud

This is kind of obvious, but LLMs can generate lightly-customized, plausible-looking text much more quickly than any human being. This facilitates their use in fraud, spam, and scams. In a spamming or fraudulent interaction, once again, the costs are externalized onto the victim: the recipient of a spam message has to do all the work of “checking” the LLM’s output. Spammers already expect very low hit rates from boilerplate, and if the LLM can increase those percentages from 1% to 5% the technology will pay for itself; they don’t need anything like reliable accuracy.

Customer “Support”

If you have any kind of commercial relationship with a company, I probably don’t even need to mention this: customer “support” bots are a misery. Everybody knows it at this point. But customer support is usually conceptualized by businesses as an adversarial interaction, because it is a cost center. They maintain internal metrics on time-to-resolution and try to optimize them. Implicitly, this creates a dynamic where the goal of the customer service agent’s job is not to solve your problem, but to emit noise that will cause you to think your problem is resolved, or to give up, as fast as possible. Unsurprisingly, LLMs can emit this noise faster than humans can, getting those customers off the phone. But those customers will remember those interactions, and the story outside the TTR metrics is horrible.

Similarly to the situation in software development, LLMs can look very good on paper for customer support, but mostly what they are doing is illuminating the problems with the industry’s existing metrics, by turning “winning the metrics battle against the customer” into a more obvious and immediate defeat for the company’s long term reputation.

“Education”

In 2026 it is sadly a fact of life that students cheat all the time using “AI”, and that this cheating is very successful, in that the teachers find it very hard to detect.

LLMs are great for cheating on schoolwork because the student is externalizing the work of the checking onto the teachers, who are often starting at a disadvantage to begin with, at least in the US.

My view is that this is happening because of a divergence in the way that students vs. teachers (or, more accurately, “the broader educational system”) view grading.

When a student is asked to write an essay, the teachers see the effort as both intrinsically worthwhile for the student, as well as useful as a pedagogical tool to evaluate and react to the student’s progress. The student, by contrast, sees a stumbling block designed to knock them off the path to success and into a permanent underclass. It is no wonder that the student sees “AI” as useful to their own goals and has no compunction about deploying it.

There is a bitter irony that the ability to understand the inherent value of actually writing the essay on their own is the sort of thing that students can really only learn by writing a bunch of essays. There’s no way that I can think of which makes the benefit legible as long as a shortcut is available.

The net effect here is a downward spiral, where the already-wobbling educational system is sustaining an attack that it doesn’t have the resources to recover from. The individual students’ attacks against their teachers and their schools’ grading systems might appear to momentarily succeed, but they will win the battle and lose the war.

Spamming “For Good”?

Usually when we talk about someone unilaterally choosing to enter into an adversarial relationship, that’s an “attack” and for good reasons we have a negative impression of the attacker. However, I would be remiss if I did not point out that there are some cases where the relationship was already adversarial; just because you’re the attacker doesn’t mean that you are evil.

For example we might imagine use-cases like automatically filing appeals for prior authorizations against health insurance. It’s relatively well-known at this point that the main way for-profit insurers maintain their margins is by denying claims right up to the line of the policies themselves being fraud, so using a spamming tool to fight them might be entirely justifiable² in that case.

Similarly, using an LLM could be justified in a fight against a company refusing to honor a warranty. One could imagine using an LLM to immediately generate replies and escalations.

However, even in imagined cases like these, the underlying problem is that the insurers and the vendors already have a tremendous amount of structural power, so it is more likely that they will have the advantage in deploying a communications weapon like an LLM, as well as enacting policies to simply ignore any LLM-based communication that you might submit. Worse, if these strategies were to become widespread, they might provide an excuse to reject any communications by feeding them into an unreliable “LLM detector” and issuing an automated “computer says no” even to hand-written correspondence.

It is also worth stressing that these cases are imagined, as compared to the very real coworker-abuse, spam, scam, fraud, and disinformation campaigns being waged in real life today.

Therefore, while legitimate uses might exist, it’s hard to imagine that there’s anywhere they would be genuinely valuable and sustainable. In the best case “AI” will provide a temporary advantage for underdogs that will provoke an arms race which the resource-advantaged adversaries will win in the long run, in the worst case the arms race itself will cement permanent structural change that will make things worse.

“Search” By Stealing

Most of the adversarial utility of “AI” is on the “write” side, since write-amplification is more obviously aggressive than reading. But the “read” side of LLMs — summarization and question-answering — can be a form of attack as well.

To begin with, the act of reading itself is currently enormously destructive, but that’s arguably not a fundamental aspect of this technology. They could set reasonable rate-limits and respect things like robots.txt, as search engines have for decades now. They could also refrain from committing criminal levels of copyright infringement. But, today, using “AI” tools does suborn this sort of out-of-control crawling.

More insidiously, consider the scenario described in this YouTube video. The LTT Bros decided to try Linux again, and in the course of so doing, they had problems. When trying to solve these problems, they were faced with a choice: they could consult Reddit, or they could ask an LLM. Asking an LLM would “gaslight the heck out of” them, but they still found it preferable, because they would at least get an answer without getting yelled at.

Initially this sounds great. But it also means that you want to extract knowledge from a community, while mechanically eliding any values or norms that the community may want to impart as part of offering that knowledge. As someone who spent many years in a community tech support role, this is worrying. Many requests for support are people asking how to do things that will momentarily solve a superficial problem but create a long-term reliability problem or even an immediate security risk, that the question-asker doesn’t want to hear about. Consider the question “I’m tired of entering my password so much, how do I make it so my laptop unlocks automatically”. An obsequious chatbot will helpfully tell you how to do this without pushback.

But, this is also a sort of ethically murky area. The Linux community is somewhat famously, for many years now, a toxic cesspool of general hostility, misogyny, etc. It is certainly a good thing that people can get access to this knowledge without subjecting themselves to abuse. But it also means that the people with the power and the privilege to change the community for the better can just quietly withdraw, rather than fixing the problems. It also means that the positive elements of culture cannot be transmitted, and people will have no opportunity to learn about unknown unknowns.

In this case, the “adversarial” communication is with society. The thing that using an LLM for search lets you do is withdraw from society and avoid forming any personal connections. There are some personal connections which are painful and annoying, and so that can feel like a momentary balm. But the need to make connections in general is, like, the concept of society itself.

Who Am I Hurting?

LLMs are good at adversarial communication. They are so good at it, relative to their other benefits, that they will tend to make communications adversarial if you are not remaining vigilant about the possibility that it might do so. My request to you, dear reader, if you are going to use such tools, is to always ask yourself, “who might I be hurting, if I use an LLM for this?”

If you’re using an “AI”, who is its adversary? If you haven’t given it one yet, who might the “AI” turn into an adversary? Who might you overwhelm with an asymmetric amount of output, or, if you’re receiving information and not sending it, who are you taking that information from without consulting?

Figure out the answers to these questions and conduct yourself accordingly; the answer might be “yourself”.

Acknowledgments

Thank you to my patrons who are supporting my writing on this blog. If you like what you’ve read here and you’d like to read more of it, or you’d like to support my various open-source endeavors, you can support my work as a sponsor!

One of the reasons that software developers tend to prefer greenfield development is that when you are given a blank page, you can project your own specific understanding onto it. You can structure the codebase in a way that works for your brain, down to the variable naming conventions and the module layouts. LLM-assisted development makes everything into instant brownfield work, which makes developers instantly miserable; even those who are excited about the technology will frequently complain about how it feels like their agency has been stolen and their joy in the work has been diminished. But I digress. ↩
Modulo the massive amount of other externalities involved in using LLMs, of course, but I don’t have the time or energy to get into those here. ↩

Opaque Types in Python

2026-05-21T17:33:00-07:00

Let’s say you’re writing a Python library.

In this library, you have some collection of state that represents “options” or “configuration” for a bunch of operations. Such a set of options is a bundle of potentially ever-increasing complexity. Thus, you will want it to have an extremely minimal compatibility surface, with a very carefully chosen public interface, that is either small, or perhaps nothing at all. Such an object conveys state and might have some private behavior, but all you want consumers to be able to do is build it in very constrained, specific ways, and then pass it along as a parameter to your own APIs.

By way of example, imagine that you’re wrapping a library that handles shipping physical packages.

There are a zillion ways to do it ship a package. There are different carriers who can ship it for you. There’s air freight, and ground freight, and sea freight. There’s overnight shipping. There’s the option to require a signature. There’s package tracking and certified mail. Suffice it to say, lots of stuff.

If you are starting out to implement such a library, you might need an object called something like ShippingOptions that encapsulates some of this. At the core of your library you might have a function like this:

async def shipPackage(
        how: ShippingOptions,
        where: Address,
    ) -> ShippingStatus:
    ...

If you are starting out implementing such a library, you know that you’re going to get the initial implementation of ShippingOptions wrong; or, at the very least, if not “wrong”, then “incomplete”. You should not want to commit to an expansive public API with a ton of different attributes until you really understand the problem domain pretty well.

Yet, ShippingOptions is absolutely vital to the rest of your library. You’ll need to construct it and pass it to various methods like estimateShippingCost and shipPackage. So you’re not going to want a ton of complexity and churn as you evolve it to be more complex.

Worse yet, this object has to hold a ton of state. It’s got attributes, maybe even quite complex internal attributes that relate to different shipping services.

Right now, today, you need to add something so you can have “no rush”, “standard” and “expedited” options. You can’t just put off implementing that indefinitely until you can come up with the perfect shape. What to do?

The tool you want here is the opaque data type design pattern. C is lousy with such things (FILE, pthread_*_t, fd_set, etc). A typedef in a header file can easily achieve this.

But in Python, if you expose a dataclass — or any class, really — even if you keep all your fields private, the constructor is still, inherently, public. You can make it raise an exception or something, but your type checker still won’t help your users; it’ll still look like it’s a normal class.

Luckily, Python typing provides a tool for this: typing.NewType.

Let’s review our requirements:

We need a type that our client code can use in its type annotations; it needs to be public.
They need to be able to consruct it somehow, even if they shouldn’t be able to see its attributes or its internal constructor arguments.
To express high-level things (like “ship fast”) that should stay supported as we add more nuanced and complex configurations in the future (like “ship with the fastest possible option provided by the lowest-cost carrier that supports signature verification”).

In order to solve these problems respectively, we will use:

a public NewType, which gives us our public name...
which wraps a private class with entirely private attributes, to give us an actual data structure, while not exposing the constructor,
a set of public constructor functions, which returns our NewType.

When we put that all together, it looks like this:

from dataclasses import dataclass
from typing import Literal, NewType

@dataclass
class _RealShipOpts:
    _speed: Literal["fast", "normal", "slow"]

ShippingOptions = NewType("ShippingOptions", _RealShipOpts)

def shipFast() -> ShippingOptions:
    return ShippingOptions(_RealShipOpts("fast"))

def shipNormal() -> ShippingOptions:
    return ShippingOptions(_RealShipOpts("normal"))

def shipSlow() -> ShippingOptions:
    return ShippingOptions(_RealShipOpts("slow"))

As a snapshot in time, this is not all that interesting; we could have just exposed _RealShipOpts as a public class and saved ourselves some time. The fact that this exposes a constructor that takes a string is not a big deal for the present moment. For an initial quick and dirty implementation, we can just do checks like if options._speed == "fast" in our shipping and estimation code.

However, the main thing we are doing here is preserving our flexibility to evolve the related APIs into the future, so let’s see how we might do that. For example, let’s allow the shipping options to contain a concrete and specific carrier and freight method:

from dataclasses import dataclass
from enum import Enum, auto
from typing import NewType

class Carrier(Enum):
    FedEx = auto()
    USPS = auto()
    DHL = auto()
    UPS = auto()

class Conveyance(Enum):
    air = auto()
    truck = auto()
    train = auto()

@dataclass
class _RealShipOpts:
    _carrier: Carrier
    _freight: Conveyance

ShippingOptions = NewType("ShippingOptions", _RealShipOpts)

def shipFast() -> ShippingOptions:
    return ShippingOptions(_RealShipOpts(Carrier.FedEx, Conveyance.air))

def shipNormal() -> ShippingOptions:
    return ShippingOptions(_RealShipOpts(Carrier.UPS, Conveyance.truck))

def shipSlow() -> ShippingOptions:
    return ShippingOptions(_RealShipOpts(Carrier.USPS, Conveyance.train))

def shippingDetailed(
    carrier: Carrier, conveyance: Conveyance
) -> ShippingOptions:
    return ShippingOptions(_RealShipOpts(carrier, conveyance))

As a NewType, our public ShippingOptions type doesn’t have a constructor. Since _RealShipOpts is private, and all its attributes are private, we can completely remove the old versions.

Anything within our shipping library can still access the private variables on ShippingOptions; as a NewType, it’s the same type as its base at runtime, so it presents minimal¹ overhead.

Clients outside our shipping library can still call all of our public constructors: shipFast, shipNormal, and shipSlow all still work with the same (as far as calling code knows) signature and behavior.

If you need to build and convey some state within your public API, while avoiding breakages associated with compatibility churn, hopefully this technique can help you do that!

Acknowledgments

Thanks for reading, and thank you to my patrons who are supporting my writing on this blog. If you like what you’ve read here and you’d like to read more of it, or you’d like to support my various open-source endeavors, you can support my work as a sponsor.

The overhead is minimal, but it is not completely zero. The suggested idiom for converting to a NewType is to call it like a function, as I’ve done in these examples, but if you are wanting to use this pattern inside of a hot loop, you can use # type: ignore[return-value] comments to avoid that small cost. ↩

What Is Code Review For?

2026-03-03T21:24:00-08:00

Humans Are Bad At Perceiving

Humans are not particularly good at catching bugs. For one thing, we get tired easily. There is some science on this, indicating that humans can’t even maintain enough concentration to review more than about 400 lines of code at a time..

We have existing terms of art, in various fields, for the ways in which the human perceptual system fails to register stimuli. Perception fails when humans are distracted, tired, overloaded, or merely improperly engaged.

Each of these has implications for the fundamental limitations of code review as an engineering practice:

Inattentional Blindness: you won’t be able to reliably find bugs that you’re not looking for.
Repetition Blindness: you won’t be able to reliably find bugs that you are looking for, if they keep occurring.
Vigilance Fatigue: you won’t be able to reliably find either kind of bugs, if you have to keep being alert to the presence of bugs all the time.
and, of course, the distinct but related Alert Fatigue: you won’t even be able to reliably evaluate reports of possible bugs, if there are too many false positives.

Never Send A Human To Do A Machine’s Job

When you need to catch a category of error in your code reliably, you will need a deterministic tool to evaluate — and, thanks to our old friend “alert fatigue” above — ideally, to also remedy that type of error. These tools will relieve the need for a human to make the same repetitive checks over and over. None of them are perfect, but:

to catch logical errors, use automated tests.
to catch formatting errors, use autoformatters.
to catch common mistakes, use linters.
to catch common security problems, use a security scanner.

Don’t blame reviewers for missing these things.

Code review should not be how you catch bugs.

What Is Code Review For, Then?

Code review is for three things.

First, code review is for catching process failures. If a reviewer has noticed a few bugs of the same type in code review, that’s a sign that that type of bug is probably getting through review more often than it’s getting caught. Which means it’s time to figure out a way to deploy a tool or a test into CI that will reliably prevent that class of error, without requiring reviewers to be vigilant to it any more.

Second — and this is actually its more important purpose — code review is a tool for acculturation. Even if you already have good tools, good processes, and good documentation, new members of the team won’t necessarily know about those things. Code review is an opportunity for older members of the team to introduce newer ones to existing tools, patterns, or areas of responsibility. If you’re building an observer pattern, you might not realize that the codebase you’re working in already has an existing idiom for doing that, so you wouldn’t even think to search for it, but someone else who has worked more with the code might know about it and help you avoid repetition.

You will notice that I carefully avoided saying “junior” or “senior” in that paragraph. Sometimes the newer team member is actually more senior. But also, the acculturation goes both ways. This is the third thing that code review is for: disrupting your team’s culture and avoiding stagnation. If you have new talent, a fresh perspective can also be an extremely valuable tool for building a healthy culture. If you’re new to a team and trying to build something with an observer pattern, and this codebase has no tools for that, but your last job did, and it used one from an open source library, that is a good thing to point out in a review as well. It’s an opportunity to spot areas for improvement to culture, as much as it is to spot areas for improvement to process.

Thus, code review should be as hierarchically flat as possible. If the goal of code review were to spot bugs, it would make sense to reserve the ability to review code to only the most senior, detail-oriented, rigorous engineers in the organization. But most teams already know that that’s a recipe for brittleness, stagnation and bottlenecks. Thus, even though we know that not everyone on the team will be equally good at spotting bugs, it is very common in most teams to allow anyone past some fairly low minimum seniority bar to do reviews, often as low as “everyone on the team who has finished onboarding”.

Oops, Surprise, This Post Is Actually About LLMs Again

Sigh. I’m as disappointed as you are, but there are no two ways about it: LLM code generators are everywhere now, and we need to talk about how to deal with them. Thus, an important corollary of this understanding that code review is a social activity, is that LLMs are not social actors, thus you cannot rely on code review to inspect their output.

My own personal preference would be to eschew their use entirely, but in the spirit of harm reduction, if you’re going to use LLMs to generate code, you need to remember the ways in which LLMs are not like human beings.

When you relate to a human colleague, you will expect that:

you can make decisions about what to focus on based on their level of experience and areas of expertise to know what problems to focus on; from a late-career colleague you might be looking for bad habits held over from legacy programming languages; from an earlier-career colleague you might be focused more on logical test-coverage gaps,
and, they will learn from repeated interactions so that you can gradually focus less on a specific type of problem once you have seen that they’ve learned how to address it,

With an LLM, by contrast, while errors can certainly be biased a bit by the prompt from the engineer and pre-prompts that might exist in the repository, the types of errors that the LLM will make are somewhat more uniformly distributed across the experience range.

You will still find supposedly extremely sophisticated LLMs making extremely common mistakes, specifically because they are common, and thus appear frequently in the training data.

The LLM also can’t really learn. An intuitive response to this problem is to simply continue adding more and more instructions to its pre-prompt, treating that text file as its “memory”, but that just doesn’t work, and probably never will. The problem — “context rot” is somewhat fundamental to the nature of the technology.

Thus, code-generators must be treated more adversarially than you would a human code review partner. When you notice it making errors, you always have to add tests to a mechanical, deterministic harness that will evaluates the code, because the LLM cannot meaningfully learn from its mistakes outside a very small context window in the way that a human would, so giving it feedback is unhelpful. Asking it to just generate the code again still requires you to review it all again, and as we have previously learned, you, a human, cannot review more than 400 lines at once.

To Sum Up

Code review is a social process, and you should treat it as such. When you’re reviewing code from humans, share knowledge and encouragement as much as you share bugs or unmet technical requirements.

If you must reviewing code from an LLM, strengthen your automated code-quality verification tooling and make sure that its agentic loop will fail on its own when those quality checks fail immediately next time. Do not fall into the trap of appealing to its feelings, knowledge, or experience, because it doesn’t have any of those things.

But for both humans and LLMs, do not fall into the trap of thinking that your code review process is catching your bugs. That’s not its job.

Acknowledgments

How To Argue With Me About AI, If You Must

2026-01-04T21:22:00-08:00

As you already know if you’ve read any of this blog in the last few years, I am a somewhat reluctant — but nevertheless quite staunch — critic of LLMs. This means that I have enthusiasts of varying degrees sometimes taking issue with my stance.

It seems that I am not going to get away from discussions, and, let’s be honest, pretty intense arguments about “AI” any time soon. These arguments are starting to make me quite upset. So it might be time to set some rules of engagement.

I’ve written about all of these before at greater length, but this is a short post because it’s not about the technology or making a broader point, it’s about me. These are rules for engaging with me, personally, on this topic. Others are welcome to adopt these rules if they so wish but I am not encouraging anyone to do so.

Thus, I’ve made this post as short as I can so everyone interested in engaging can read the whole thing. If you can’t make it through to the end, then please just follow Rule Zero.

Rule Zero: Maybe Don’t

You are welcome to ignore me. You can think my take is stupid and I can think yours is. We don’t have to get into an Internet Fight about it; we can even remain friends. You do not need to instigate an argument with me at all, if you think that my analysis is so bad that it doesn’t require rebutting.

Rule One: No ‘Just’

As I explained in a post with perhaps the least-predictive title I’ve ever written, “I Think I’m Done Thinking About genAI For Now”, I’ve already heard a bunch of bad arguments. Don’t tell me to ‘just’ use a better model, use an agentic tool, use a more recent version, or use some prompting trick that you personally believe works better. If you skim my work and think that I must not have deeply researched anything or read about it because you don’t like my conclusion, that is wrong.

Rule Two: No ‘Look At This Cool Thing’

Purely as a productivity tool, I have had a terrible experience with genAI. Perhaps you have had a great one. Neat. That’s great for you. As I explained at great length in “The Futzing Fraction”, my concern with generative AI is that I believe it is probably a net negative impact on productivity, based on both my experience and plenty of citations. Go check out the copious footnotes if you’re interested in more detail.

Therefore, I have already acknowledged that you can get an LLM to do various impressive, cool things, sometimes. If I tell you that you will, on average, lose money betting on a slot machine, a picture of a slot machine hitting a jackpot is not evidence against my position.

Rule Two And A Half: Engage In Metacognition

I specifically didn’t title the previous rule “no anecdotes” because data beyond anecdotes may be extremely expensive to produce. I don’t want to say you can never talk to me unless you’re doing a randomized controlled trial. However, if you are going to tell me an anecdote about the way that you’re using an LLM, I am interested in hearing how you are compensating for the well-documented biases that LLM use tends to induce. Try to measure what you can.

Rule Three: Do Not Cite The Deep Magic To Me

As I explained in “A Grand Unified Theory of the AI Hype Cycle”, I already know quite a bit of history of the “AI” label. If you are tempted to tell me something about how “AI” is really such a broad field, and it doesn’t just mean LLMs, especially if you are trying to launder the reputation of LLMs under the banner of jumbling them together with other things that have been called “AI”, I assure you that this will not be convincing to me.

Rule Four: Ethics Are Not Optional

I have made several arguments in my previous writing: there are ethical arguments, efficacy arguments, structuralist arguments, efficiency arguments and aesthetic arguments.

I am happy to, for the purposes of a good-faith discussion, focus on a specific set of concerns or an individual point that you want to make where you think I got something wrong. If you convince me that I am entirely incorrect about the effectiveness or predictability of LLMs in general or as specific LLM product, you don’t need to make a comprehensive argument about whether one should use the technology overall. I will even assume that you have your own ethical arguments.

However, if you scoff at the idea that one should have any ethical boundaries at all, and think that there’s no reason to care about the overall utilitarian impact of this technology, that it’s worth using no matter what else it does as long as it makes you 5% better at your job, that’s sociopath behavior.

This includes extreme whataboutism regarding things like the water use of datacenters, other elements of the surveillance technology stack, and so on.

Consequences

These are rules, once again, just for engaging with me. I have no particular power to enact broader sanctions upon you, nor would I be inclined to do so if I could. However, if you can’t stay within these basic parameters and you insist upon continuing to direct messages to me about this topic, I will summarily block you with no warning, on mastodon, email, GitHub, IRC, or wherever else you’re choosing to do that. This is for your benefit as well: such a discussion will not be a productive use of either of our time.

The Next Thing Will Not Be Big

2026-01-01T17:59:00-08:00

The dawning of a new year is an opportune moment to contemplate what has transpired in the old year, and consider what is likely to happen in the new one.

Today, I’d like to contemplate that contemplation itself.

The 20th century was an era characterized by rapidly accelerating change in technology and industry, creating shorter and shorter cultural cycles of changes in lifestyles. Thus far, the 21st century seems to be following that trend, at least in its recently concluded first quarter.

The early half of the twentieth century saw the massive disruption caused by electrification, radio, motion pictures, and then television.

In 1971, Intel poured gasoline on that fire by releasing the 4004, a microchip generally recognized as the first general-purpose microprocessor. Popular innovations rapidly followed: the computerized cash register, the personal computer, credit cards, cellular phones, text messaging, the Internet, the web, online games, mass surveillance, app stores, social media.

These innovations have arrived faster than previous generations, but also, they have crossed a crucial threshold: that of the human lifespan.

While the entire second millennium A.D. has been characterized by a gradually accelerating rate of technological and social change — the printing press and the industrial revolution were no slouches, in terms of changing society, and those predate the 20th century — most of those changes had the benefit of unfolding throughout the course of a generation or so.

Which means that any individual person in any given century up to the 20th might remember one major world-altering social shift within their lifetime, not five to ten of them. The diversity of human experience is vast, but most people would not expect that the defining technology of their lifetime was merely the latest in a progression of predictable civilization-shattering marvels.

Along with each of these successive generations of technology, we minted a new generation of industry titans. Westinghouse, Carnegie, Sarnoff, Edison, Ford, Hughes, Gates, Jobs, Zuckerberg, Musk. Not just individual rich people, but entire new classes of rich people that did not exist before. “Radio DJ”, “Movie Star”, “Rock Star”, “Dot Com Founder”, were all new paths to wealth opened (and closed) by specific technologies. While most of these people did come from at least some level of generational wealth, they no longer came from a literal hereditary aristocracy.

To describe this new feeling of constant acceleration, a new phrase was coined: “The Next Big Thing”. In addition to denoting that some Thing was coming and that it would be Big (i.e.: that it would change a lot about our lives), this phrase also carries the strong implication that such a Thing would be a product. Not a development in social relationships or a shift in cultural values, but some new and amazing form of conveying salted meat paste or what-have-you, that would make whatever lucky tinkerer who stumbled into it into a billionaire — along with any friends and family lucky enough to believe in their vision and get in on the ground floor with an investment.

In the latter part of the 20th century, our entire model of capital allocation shifted to account for this widespread belief. No longer were mega-businesses built by bank loans, stock issuances, and reinvestment of profit, the new model was “Venture Capital”. Venture capital is a model of capital allocation explicitly predicated on the idea that carefully considering each bet on a likely-to-succeed business and reducing one’s risk was a waste of time, because the return on the equity from the Next Big Thing would be so disproportionately huge — 10x, 100x, 1000x – that one could afford to make at least 10 bad bets for each good one, and still come out ahead.

The biggest risk was in missing the deal, not in giving a bunch of money to a scam. Thus, value investing and focus on fundamentals have been broadly disregarded in favor of the pursuit of the Next Big Thing.

If Americans of the twentieth century were temporarily embarrassed millionaires, those of the twenty-first are all temporarily embarrassed FAANG CEOs.

The predicament that this tendency leaves us in today is that the world is increasingly run by generations — GenX and Millennials — with the shared experience that the computer industry, either hardware or software, would produce some radical innovation every few years. We assume that to be true.

But all things change, even change itself, and that industry is beginning to slow down. Physically, transistor density is starting to brush up against physical limits. Economically, most people are drowning in more compute power than they know what to do with anyway. Users already have most of what they need from the Internet.

The big new feature in every operating system is a bunch of useless junk nobody really wants and is seeing remarkably little uptake. Social media and smartphones changed the world, true, but… those are both innovations from 2008. They’re just not new any more.

So we are all — collectively, culturally — looking for the Next Big Thing, and we keep not finding it.

It wasn’t 3D printing. It wasn’t crowdfunding. It wasn’t smart watches. It wasn’t VR. It wasn’t the Metaverse, it wasn’t Bitcoin, it wasn’t NFTs¹.

It’s also not AI, but this is why so many people assume that it will be AI. Because it’s got to be something, right? If it’s got to be something then AI is as good a guess as anything else right now.

The fact is, our lifetimes have been an extreme anomaly. Things like the Internet used to come along every thousand years or so, and while we might expect that the pace will stay a bit higher than that, it is not reasonable to expect that something new like “personal computers” or “the Internet”³ will arrive again.

We are not going to get rich by getting in on the ground floor of the next Apple or the next Google because the next Apple and the next Google are Apple and Google. The industry is maturing. Software technology, computer technology, and internet technology are all maturing.

There Will Be Next Things

Research and development is happening in all fields all the time. Amazing new developments quietly and regularly occur in pharmaceuticals and in materials science. But these are not predictable. They do not inhabit the public consciousness until they’ve already happened, and they are rarely so profound and transformative that they change everybody’s life.

There will even be new things in the computer industry, both software and hardware. Foldable phones do address a real problem (I wish the screen were even bigger but I don’t want to carry around such a big device), and would probably be more popular if they got the costs under control. One day somebody’s going to crack the problem of volumetric displays, probably. Some VR product will probably, eventually, hit a more realistic price/performance ratio where the niche will expand at least a little more.

Maybe there will even be something genuinely useful, which is recognizably adjacent to the current “AI” fad, but if it is, it will be some new development that we haven’t seen yet. If current AI technology were sufficient to drive some interesting product, it would already be doing it, not using marketing disguised as science to conceal diminishing returns on current investments.

But They Will Not Be Big

The impulse to find the One Big Thing that will dominate the next five years is a fool’s errand. Incremental gains are diminishing across the board. The markets for time and attention² are largely saturated. There’s no need for another streaming service if 100% of your leisure time is already committed to TikTok, YouTube and Netflix; famously, Netflix has already considered sleep its primary competitor for close to a decade - years before the pandemic.

Those rare tech markets which aren’t saturated are suffering from pedestrian economic problems like wealth inequality, not technological bottlenecks.

For example, the thing preventing the development of a robot that can do your laundry and your dishes without your input is not necessarily that we couldn’t build something like that, but that most households just can’t afford it without wage growth catching up to productivity growth. It doesn’t make sense for anyone to commit to the substantial R&D investment that such a thing would take, if the market doesn’t exist because the average worker isn’t paid enough to afford it on top of all the other tech which is already required to exist in society.

The projected income from the tiny, wealthy sliver of the population who could pay for the hardware, cannot justify an investment in the software past a fake version remotely operated by workers in the global south, only made possible by Internet wage arbitrage, i.e. a more palatable, modern version of indentured servitude.

Even if we were to accept the premise of an actually-“AI” version of this, that is still just a wish that ChatGPT could somehow improve enough behind the scenes to replace that worker, not any substantive investment in a novel, proprietary-to-the-chores-robot software system which could reliably perform specific functions.

What, Then?

The expectation for, and lack of, a “big thing” is a big problem. There are others who could describe its economic, political, and financial dimensions better than I can. So then let me speak to my expertise and my audience: open source software developers.

When I began my own involvement with open source, a big part of the draw for me was participating in a low-cost (to the corporate developer) but high-value (to society at large) positive externality. None of my employers would ever have cared about many of the applications for which Twisted forms a core bit of infrastructure; nor would I have been able to predict those applications’ existence. Yet, it is nice to have contributed to their development, even a little bit.

However, it’s not actually a positive externality if the public at large can’t directly benefit from it.

When real world-changing, disruptive developments are occurring, the bean-counters are not watching positive externalities too closely. As we discovered with many of the other benefits that temporarily accrued to labor in the tech economy, Open Source that is usable by individuals and small companies may have been a ZIRP. If you know you’re gonna make a billion dollars you’re not going to worry about giving away a few hundred thousand here and there.

When gains are smaller and harder to realize, and margins are starting to get squeezed, it’s harder to justify the investment in vaguely good vibes.

But this, itself, is not a call to action. I doubt very much that anyone reading this can do anything about the macroeconomic reality of higher interest rates. The technological reality of “development is happening slower” is inherently something that you can’t change on purpose.

However, what we can do is to be aware of this trend in our own work.

Fight Scale Creep

It seems to me that more and more open source infrastructure projects are tools for hyper-scale application development, only relevant to massive cloud companies. This is just a subjective assessment on my part — I’m not sure what tools even exist today to measure this empirically — but I remember a big part of the open source community when I was younger being things like Inkscape, Themes.Org and Slashdot, not React, Docker Hub and Hacker News.

This is not to say that the hobbyist world no longer exists. There is of course a ton of stuff going on with Raspberry Pi, Home Assistant, OwnCloud, and so on. If anything there’s a bit of a resurgence of self-hosting. But the interests of self-hosters and corporate developers are growing apart; there seems to be far less of a beneficial overflow from corporate infrastructure projects into these enthusiast or prosumer communities.

This is the concrete call to action: if you are employed in any capacity as an open source maintainer, dedicate more energy to medium- or small-scale open source projects.

If your assumption is that you will eventually reach a hyper-scale inflection point, then mimicking Facebook and Netflix is likely to be a good idea. However, if we can all admit to ourselves that we’re not going to achieve a trillion-dollar valuation and a hundred thousand engineer headcount, we can begin to consider ways to make our Next Thing a bit smaller, and to accommodate the world as it is rather than as we wish it would be.

Be Prepared to Scale Down

Here are some design guidelines you might consider, for just about any open source project, particularly infrastructure ones:

Don’t assume that your software can sustain an arbitrarily large fixed overhead because “you just pay that cost once” and you’re going to be running a billion instances so it will always amortize; maybe you’re only going to be running ten.
Remember that such fixed overhead includes not just CPU, RAM, and filesystem storage, but also the learning curve for developers. Front-loading a massive amount of conceptual complexity to accommodate the problems of hyper-scalers is a common mistake. Try to smooth out these complexities and introduce them only when necessary.
Test your code on edge devices. This means supporting Windows and macOS, and even Android and iOS. If you want your tool to help empower individual users, you will need to meet them where they are, which is not on an EC2 instance.
This includes considering Desktop Linux as a platform, as opposed to Server Linux as a platform, which (while they certainly have plenty in common) they are also distinct in some details. Consider the highly specific example of secret storage: if you are writing something that intends to live in a cloud environment, and you need to configure it with a secret, you will probably want to provide it via a text file or an environment variable. By contrast, if you want this same code to run on a desktop system, your users will expect you to support the Secret Service. This will likely only require a few lines of code to accommodate, but it is a massive difference to the user experience.
Don’t rely on LLMs remaining cheap or free. If you have LLM-related features⁴, make sure that they are sufficiently severable from the rest of your offering that if ChatGPT starts costing $1000 a month, your tool doesn’t break completely. Similarly, do not require that your users have easy access to half a terabyte of VRAM and a rack full of 5090s in order to run a local model.

Even if you were going to scale up to infinity, the ability to scale down and consider smaller deployments means that you can run more comfortably on, for example, a developer’s laptop. So even if you can’t convince your employer that this is where the economy and the future of technology in our lifetimes is going, it can be easy enough to justify this sort of design shift, particularly as individual choices. Make your onboarding cheaper, your development feedback loops tighter, and your systems generally more resilient to economic headwinds.

So, please design your open source libraries, applications, and services to run on smaller devices, with less complexity. It will be worth your time as well as your users’.

But if you can fix the whole wealth inequality thing, do that first.

Acknowledgments

These sorts of lists are pretty funny reads, in retrospect. ↩
Which is to say, “distraction”. ↩
... or even their lesser-but-still-profound aftershocks like “Social Media”, “Smartphones”, or “On-Demand Streaming Video” ... secondary manifestations of the underlying innovation of a packet-switched global digital network ... ↩
My preference would of course be that you just didn’t have such features at all, but perhaps even if you agree with me, you are part of an organization with some mandate to implement LLM stuff. Just try not to wrap the chain of this anchor all the way around your code’s neck. ↩

The “Dependency Cutout” Workflow Pattern, Part I

2025-11-10T17:44:00-08:00

Tell me if you’ve heard this one before.

You’re working on an application. Let’s call it “FooApp”. FooApp has a dependency on an open source library, let’s call it “LibBar”. You find a bug in LibBar that affects FooApp.

To envisage the best possible version of this scenario, let’s say you actively like LibBar, both technically and socially. You’ve contributed to it in the past. But this bug is causing production issues in FooApp today, and LibBar’s release schedule is quarterly. FooApp is your job; LibBar is (at best) your hobby. Blocking on the full upstream contribution cycle and waiting for a release is an absolute non-starter.

What do you do?

There are a few common reactions to this type of scenario, all of which are bad options.

I will enumerate them specifically here, because I suspect that some of them may resonate with many readers:

Find an alternative to LibBar, and switch to it.

This is a bad idea because a transition to a core infrastructure component could be extremely expensive.
Vendor LibBar into your codebase and fix your vendored version.

This is a bad idea because carrying this one fix now requires you to maintain all the tooling associated with a monorepo¹: you have to be able to start pulling in new versions from LibBar regularly, reconcile your changes even though you now have a separate version history on your imported version, and so on.
Monkey-patch LibBar to include your fix.

This is a bad idea because you are now extremely tightly coupled to a specific version of LibBar. By modifying LibBar internally like this, you’re inherently violating its compatibility contract, in a way which is going to be extremely difficult to test. You can test this change, of course, but as LibBar changes, you will need to replicate any relevant portions of its test suite (which may be its entire test suite) in FooApp. Lots of potential duplication of effort there.
Implement a workaround in your own code, rather than fixing it.

This is a bad idea because you are distorting the responsibility for correct behavior. LibBar is supposed to do LibBar’s job, and unless you have a full wrapper for it in your own codebase, other engineers (including “yourself, personally”) might later forget to go through the alternate, workaround codepath, and invoke the buggy LibBar behavior again in some new place.
Implement the fix upstream in LibBar anyway, because that’s the Right Thing To Do, and burn credibility with management while you anxiously wait for a release with the bug in production.

This is a bad idea because you are betraying your users — by allowing the buggy behavior to persist — for the workflow convenience of your dependency providers. Your users are probably giving you money, and trusting you with their data. This means you have both ethical and economic obligations to consider their interests.

As much as it’s nice to participate in the open source community and take on an appropriate level of burden to maintain the commons, this cannot sustainably be at the explicit expense of the population you serve directly.

Even if we only care about the open source maintainers here, there’s still a problem: as you are likely to come under immediate pressure to ship your changes, you will inevitably relay at least a bit of that stress to the maintainers. Even if you try to be exceedingly polite, the maintainers will know that you are coming under fire for not having shipped the fix yet, and are likely to feel an even greater burden of obligation to ship your code fast.

Much as it’s good to contribute the fix, it’s not great to put this on the maintainers.

The respective incentive structures of software development — specifically, of corporate application development and open source infrastructure development — make options 1-4 very common.

On the corporate / application side, these issues are:

it’s difficult for corporate developers to get clearance to spend even small amounts of their work hours on upstream open source projects, but clearance to spend time on the project they actually work on is implicit. If it takes 3 hours of wrangling with Legal² and 3 hours of implementation work to fix the issue in LibBar, but 0 hours of wrangling with Legal and 40 hours of implementation work in FooApp, a FooApp developer will often perceive it as “easier” to fix the issue downstream.
it’s difficult for corporate developers to get clearance from management to spend even small amounts of money sponsoring upstream reviewers, so even if they can find the time to contribute the fix, chances are high that it will remain stuck in review unless they are personally well-integrated members of the LibBar development team already.
even assuming there’s zero pressure whatsoever to avoid open sourcing the upstream changes, there’s still the fact inherent to any development team that FooApp’s developers will be more familiar with FooApp’s codebase and development processes than they are with LibBar’s. It’s just easier to work there, even if all other things are equal.
systems for tracking risk from open source dependencies often lack visibility into vendoring, particularly if you’re doing a hybrid approach and only vendoring a few things to address work in progress, rather than a comprehensive and disciplined approach to a monorepo. If you fully absorb a vendored dependency and then modify it, Dependabot isn’t going to tell you that a new version is available any more, because it won’t be present in your dependency list. Organizationally this is bad of course but from the perspective of an individual developer this manifests mostly as fewer annoying emails.

But there are problems on the open source side as well. Those problems are all derived from one big issue: because we’re often working with relatively small sums of money, it’s hard for upstream open source developers to consume either money or patches from application developers. It’s nice to say that you should contribute money to your dependencies, and you absolutely should, but the cost-benefit function is discontinuous. Before a project reaches the fiscal threshold where it can be at least one person’s full-time job to worry about this stuff, there’s often no-one responsible in the first place. Developers will therefore gravitate to the issues that are either fun, or relevant to their own job.

These mutually-reinforcing incentive structures are a big reason that users of open source infrastructure, even teams who work at corporate users with zillions of dollars, don’t reliably contribute back.

The Answer We Want

All those options are bad. If we had a good option, what would it look like?

It is both practically necessary³ and morally required⁴ for you to have a way to temporarily rely on a modified version of an open source dependency, without permanently diverging.

Below, I will describe a desirable abstract workflow for achieving this goal.

Step 0: Report the Problem

Before you get started with any of these other steps, write up a clear description of the problem and report it to the project as an issue; specifically, in contrast to writing it up as a pull request. Describe the problem before submitting a solution.

You may not be able to wait for a volunteer-run open source project to respond to your request, but you should at least tell the project what you’re planning on doing.

If you don’t hear back from them at all, you will have at least made sure to comprehensively describe your issue and strategy beforehand, which will provide some clarity and focus to your changes.

If you do hear back from them, in the worst case scenario, you may discover that a hard fork will be necessary because they don’t consider your issue valid, but even that information will save you time, if you know it before you get started. In the best case, you may get a reply from the project telling you that you’ve misunderstood its functionality and that there is already a configuration parameter or usage pattern that will resolve your problems with no new code. But in all cases, you will benefit from early coordination on what needs fixing before you get to how to fix it.

Step 1: Source Code and CI Setup

Fork the source code for your upstream dependency to a writable location where it can live at least for the duration of this one bug-fix, and possibly for the duration of your application’s use of the dependency. After all, you might want to fix more than one bug in LibBar.

You want to have a place where you can put your edits, that will be version controlled and code reviewed according to your normal development process. This probably means you’ll need to have your own main branch that diverges from your upstream’s main branch.

Remember: you’re going to need to deploy this to your production, so testing gates that your upstream only applies to final releases of LibBar will need to be applied to every commit here.

Depending on your LibBar’s own development process, this may result in slightly unusual configurations where, for example, your fixes are written against the last LibBar release tag, rather than its current⁵ main; if the project has a branch-freshness requirement, you might need two branches, one for your upstream PR (based on main) and one for your own use (based on the release branch with your changes).

Ideally for projects with really good CI and a strong “keep main release-ready at all times” policy, you can deploy straight from a development branch, but it’s good to take a moment to consider this before you get started. It’s usually easier to rebase changes from an older HEAD onto a newer one than it is to go backwards.

Speaking of CI, you will want to have your own CI system. The fact that GitHub Actions has become a de-facto lingua franca of continuous integration means that this step may be quite simple, and your forked repo can just run its own instance.

Optional Bonus Step 1a: Artifact Management

If you have an in-house artifact repository, you should set that up for your dependency too, and upload your own build artifacts to it. You can often treat your modified dependency as an extension of your own source tree and install from a GitHub URL, but if you’ve already gone to the trouble of having an in-house package repository, you can pretend you’ve taken over maintenance of the upstream package temporarily (which you kind of have) and leverage those workflows for caching and build-time savings as you would with any other internal repo.

Step 2: Do The Fix

Now that you’ve got somewhere to edit LibBar’s code, you will want to actually fix the bug.

Step 2a: Local Filesystem Setup

Before you have a production version on your own deployed branch, you’ll want to test locally, which means having both repositories in a single integrated development environment.

At this point, you will want to have a local filesystem reference to your LibBar dependency, so that you can make real-time edits, without going through a slow cycle of pushing to a branch in your LibBar fork, pushing to a FooApp branch, and waiting for all of CI to run on both.

This is useful in both directions: as you prepare the FooApp branch that makes any necessary updates on that end, you’ll want to make sure that FooApp can exercise the LibBar fix in any integration tests. As you work on the LibBar fix itself, you’ll also want to be able to use FooApp to exercise the code and see if you’ve missed anything - and this, you wouldn’t get in CI, since LibBar can’t depend on FooApp itself.

In short, you want to be able to treat both projects as an integrated development environment, with support from your usual testing and debugging tools, just as much as you want your deployment output to be an integrated artifact.

Step 2b: Branch Setup for PR

However, for continuous integration to work, you will also need to have a remote resource reference of some kind from FooApp’s branch to LibBar. You will need 2 pull requests: the first to land your LibBar changes to your internal LibBar fork and make sure it’s passing its own tests, and then a second PR to switch your LibBar dependency from the public repository to your internal fork.

At this step it is very important to ensure that there is an issue filed on your own internal backlog to drop your LibBar fork. You do not want to lose track of this work; it is technical debt that must be addressed.

Until it’s addressed, automated tools like Dependabot will not be able to apply security updates to LibBar for you; you’re going to need to manually integrate every upstream change. This type of work is itself very easy to drop or lose track of, so you might just end up stuck on a vulnerable version.

Step 3: Deploy Internally

Now that you’re confident that the fix will work, and that your temporarily-internally-maintained version of LibBar isn’t going to break anything on your site, it’s time to deploy.

Some deployment heritage should help to provide some evidence that your fix is ready to land in LibBar, but at the next step, please remember that your production environment isn’t necessarily emblematic of that of all LibBar users.

Step 4: Propose Externally

You’ve got the fix, you’ve tested the fix, you’ve got the fix in your own production, you’ve told upstream you want to send them some changes. Now, it’s time to make the pull request.

You’re likely going to get some feedback on the PR, even if you think it’s already ready to go; as I said, despite having been proven in your production environment, you may get feedback about additional concerns from other users that you’ll need to address before LibBar’s maintainers can land it.

As you process the feedback, make sure that each new iteration of your branch gets re-deployed to your own production. It would be a huge bummer to go through all this trouble, and then end up unable to deploy the next publicly released version of LibBar within FooApp because you forgot to test that your responses to feedback still worked on your own environment.

Step 4a: Hurry Up And Wait

If you’re lucky, upstream will land your changes to LibBar. But, there’s still no release version available. Here, you’ll have to stay in a holding pattern until upstream can finalize the release on their end.

Depending on some particulars, it might make sense at this point to archive your internal LibBar repository and move your pinned release version to a git hash of the LibBar version where your fix landed, in their repository.

Before you do this, check in with the LibBar core team and make sure that they understand that’s what you’re doing and they don’t have any wacky workflows which may involve rebasing or eliding that commit as part of their release process.

Step 5: Unwind Everything

Finally, you eventually want to stop carrying any patches and move back to an official released version that integrates your fix.

You want to do this because this is what the upstream will expect when you are reporting bugs. Part of the benefit of using open source is benefiting from the collective work to do bug-fixes and such, so you don’t want to be stuck off on a pinned git hash that the developers do not support for anyone else.

As I said in step 2b⁶, make sure to maintain a tracking task for doing this work, because leaving this sort of relatively easy-to-clean-up technical debt lying around is something that can potentially create a lot of aggravation for no particular benefit. Make sure to put your internal LibBar repository into an appropriate state at this point as well.

Up Next

This is part 1 of a 2-part series. In part 2, I will explore in depth how to execute this workflow specifically for Python packages, using some popular tools. I’ll discuss my own workflow, standards like PEP 517 and pyproject.toml, and of course, by the popular demand that I just know will come, uv.

Acknowledgments

if you already have all the tooling associated with a monorepo, including the ability to manage divergence and reintegrate patches with upstream, you already have the higher-overhead version of the workflow I am going to propose, so, never mind. but chances are you don’t have that, very few companies do. ↩
In any business where one must wrangle with Legal, 3 hours is a wildly optimistic estimate. ↩
c.f. @mcc@mastodon.social ↩
c.f. @geofft@mastodon.social ↩
In an ideal world every project would keep its main branch ready to release at all times, no matter what but we do not live in an ideal world. ↩
In this case, there is no question. It’s 2b only, no not-2b. ↩

The Futzing Fraction

2025-08-15T00:51:00-07:00

The most optimistic vision of generative AI¹ is that it will relieve us of the tedious, repetitive elements of knowledge work so that we can get to work on the really interesting problems that such tedium stands in the way of. Even if you fully believe in this vision, it’s hard to deny that today, some tedium is associated with the process of using generative AI itself.

Generative AI also isn’t free, and so, as responsible consumers, we need to ask: is it worth it? What’s the ROI of genAI, and how can we tell? In this post, I’d like to explore a logical framework for evaluating genAI expenditures, to determine if your organization is getting its money’s worth.

Perpetually Proffering Permuted Prompts

I think most LLM users would agree with me that a typical workflow with an LLM rarely involves prompting it only one time and getting a perfectly useful answer that solves the whole problem.

Generative AI best practices, even from the most optimistic vendors all suggest that you should continuously evaluate everything. ChatGPT, which is really the only genAI product with significantly scaled adoption, still says at the bottom of every interaction:

ChatGPT can make mistakes. Check important info.

If we have to “check important info” on every interaction, it stands to reason that even if we think it’s useful, some of those checks will find an error. Again, if we think it’s useful, presumably the next thing to do is to perturb our prompt somehow, and issue it again, in the hopes that the next invocation will, by dint of either:

better luck this time with the stochastic aspect of the inference process,
enhanced application of our skill to engineer a better prompt based on the deficiencies of the current inference, or
better performance of the model by populating additional context in subsequent chained prompts.

Unfortunately, given the relative lack of reliable methods to re-generate the prompt and receive a better answer², checking the output and re-prompting the model can feel like just kinda futzing around with it. You try, you get a wrong answer, you try a few more times, eventually you get the right answer that you wanted in the first place. It’s a somewhat unsatisfying process, but if you get the right answer eventually, it does feel like progress, and you didn’t need to use up another human’s time.

In fact, the hottest buzzword of the last hype cycle is “agentic”. While I have my own feelings about this particular word³, its current practical definition is “a generative AI system which automates the process of re-prompting itself, by having a deterministic program evaluate its outputs for correctness”.

A better term for an “agentic” system would be a “self-futzing system”.

However, the ability to automate some level of checking and re-prompting does not mean that you can fully delegate tasks to an agentic tool, either. It is, plainly put, not safe. If you leave the AI on its own, you will get terrible results that will at best make for a funny story⁴⁵ and at worst might end up causing serious damage⁶⁷.

Taken together, this all means that for any consequential task that you want to accomplish with genAI, you need an expert human in the loop. The human must be capable of independently doing the job that the genAI system is being asked to accomplish.

When the genAI guesses correctly and produces usable output, some of the human’s time will be saved. When the genAI guesses wrong and produces hallucinatory gibberish or even “correct” output that nevertheless fails to account for some unstated but necessary property such as security or scale, some of the human’s time will be wasted evaluating it and re-trying it.

Income from Investment in Inference

Let’s evaluate an abstract, hypothetical genAI system that can automate some work for our organization. To avoid implicating any specific vendor, let’s call the system “Mallory”.

Is Mallory worth the money? How can we know?

Logically, there are only two outcomes that might result from using Mallory to do our work.

We prompt Mallory to do some work; we check its work, it is correct, and some time is saved.
We prompt Mallory to do some work; we check its work, it fails, and we futz around with the result; this time is wasted.

As a logical framework, this makes sense, but ROI is an arithmetical concept, not a logical one. So let’s translate this into some terms.

In order to evaluate Mallory, let’s define the Futzing Fraction, “ $FF$ ”, in terms of the following variables:

$H$: the average amount of time a Human worker would take to do a task, unaided by Mallory
$I$: the amount of time that Mallory takes to run one Inference⁸
$C$: the amount of time that a human has to spend Checking Mallory’s output for each inference
$P$: the Probability that Mallory will produce a correct inference for each prompt
$W$: the average amount of time that it takes for a human to Write one prompt for Mallory
$E$: since we are normalizing everything to time, rather than money, we do also have to account for the dollar of Mallory as as a product, so we will include the Equivalent amount of human time we could purchase for the marginal cost of one⁹ inference.

As in last week’s example of simple ROI arithmetic, we will put our costs in the numerator, and our benefits in the denominator.

FF = \frac{W + I + C + E}{P H}

The idea here is that for each prompt, the minimum amount of time-equivalent cost possible is $W + I + C + E$ . The user must, at least once, write a prompt, wait for inference to run, then check the output; and, of course, pay any costs to Mallory’s vendor.

If the probability of a correct answer is $P = \frac{1}{3}$ , then they will do this entire process 3 times¹⁰, so we put $P$ in the denominator. Finally, we divide everything by $H$ , because we are trying to determine if we are actually saving any time or money, versus just letting our existing human, who has to be driving this process anyway, do the whole thing.

If the Futzing Fraction evaluates to a number greater than 1, as previously discussed, you are a bozo; you’re spending more time futzing with Mallory than getting value out of it.

Figuring out the Fraction is Frustrating

In order to even evaluate the value of the Futzing Fraction though, you have to have a sound method to even get a vague sense of all the terms.

If you are a business leader, a lot of this is relatively easy to measure. You vaguely know what $H$ is, because you know what your payroll costs, and similarly, you can figure out $E$ with some pretty trivial arithmetic based on Mallory’s pricing table. There are endless YouTube channels, spec sheets and benchmarks to give you $I$ . $W$ is probably going to be so small compared to $H$ that it hardly merits consideration¹¹.

But, are you measuring $C$ ? If your employees are not checking the outputs of the AI, you’re on a path to catastrophe that no ROI calculation can capture, so it had better be greater than zero.

Are you measuring $P$ ? How often does the AI get it right on the first try?

Challenges to Computing Checking Costs

In the fraction defined above, the term $C$ is going to be large. Larger than you think.

Measuring $P$ and $C$ with a high degree of precision is probably going to be very hard; possibly unreasonably so, or too expensive¹² to bother with in practice. So you will undoubtedly need to work with estimates and proxy metrics. But you have to be aware that this is a problem domain where your normal method of estimating is going to be extremely vulnerable to inherent cognitive bias, and find ways to measure.

Margins, Money, and Metacognition

First let’s discuss cognitive and metacognitive bias.

My favorite cognitive bias is the availability heuristic and a close second is its cousin salience bias. Humans are empirically predisposed towards noticing and remembering things that are more striking, and to overestimate their frequency.

If you are estimating the variables above based on the vibe that you’re getting from the experience of using an LLM, you may be overestimating its utility.

Consider a slot machine.

If you put a dollar in to a slot machine, and you lose that dollar, this is an unremarkable event. Expected, even. It doesn’t seem interesting. You can repeat this over and over again, a thousand times, and each time it will seem equally unremarkable. If you do it a thousand times, you will probably get gradually more anxious as your sense of your dwindling bank account becomes slowly more salient, but losing one more dollar still seems unremarkable.

If you put a dollar in a slot machine and it gives you a thousand dollars, that will probably seem pretty cool. Interesting. Memorable. You might tell a story about this happening, but you definitely wouldn’t really remember any particular time you lost one dollar.

Luckily, when you arrive at a casino with slot machines, you probably know well enough to set a hard budget in the form of some amount of physical currency you will have available to you. The odds are against you, you’ll probably lose it all, but any responsible gambler will have an immediate, physical representation of their balance in front of them, so when they have lost it all, they can see that their hands are empty, and can try to resist the “just one more pull” temptation, after hitting that limit.

Now, consider Mallory.

If you put ten minutes into writing a prompt, and Mallory gives a completely off-the-rails, useless answer, and you lose ten minutes, well, that’s just what using a computer is like sometimes. Mallory malfunctioned, or hallucinated, but it does that sometimes, everybody knows that. You only wasted ten minutes. It’s fine. Not a big deal. Let’s try it a few more times. Just ten more minutes. It’ll probably work this time.

If you put ten minutes into writing a prompt, and it completes a task that would have otherwise taken you 4 hours, that feels amazing. Like the computer is magic! An absolute endorphin rush.

Very memorable. When it happens, it feels like $P = 1$ .

But... did you have a time budget before you started? Did you have a specified N such that “I will give up on Mallory as soon as I have spent N minutes attempting to solve this problem with it”? When the jackpot finally pays out that 4 hours, did you notice that you put 6 hours worth of 10-minute prompt coins into it in?

If you are attempting to use the same sort of heuristic intuition that probably works pretty well for other business leadership decisions, Mallory’s slot-machine chat-prompt user interface is practically designed to subvert those sensibilities. Most business activities do not have nearly such an emotionally variable, intermittent reward schedule. They’re not going to trick you with this sort of cognitive illusion.

Thus far we have been talking about cognitive bias, but there is a metacognitive bias at play too: while Dunning-Kruger, everybody’s favorite metacognitive bias does have some problems with it, the main underlying metacognitive bias is that we tend to believe our own thoughts and perceptions, and it requires active effort to distance ourselves from them, even if we know they might be wrong.

This means you must assume any intuitive estimate of $C$ is going to be biased low; similarly $P$ is going to be biased high. You will forget the time you spent checking, and you will underestimate the number of times you had to re-check.

To avoid this, you will need to decide on a Ulysses pact to provide some inputs to a calculation for these factors that you will not be able to able to fudge if they seem wrong to you.

Problematically Plausible Presentation

Another nasty little cognitive-bias landmine for you to watch out for is the authority bias, for two reasons:

People will tend to see Mallory as an unbiased, external authority, and thereby see it as more of an authority than a similarly-situated human¹³.
Being an LLM, Mallory will be overconfident in its answers¹⁴.

The nature of LLM training is also such that commonly co-occurring tokens in the training corpus produce higher likelihood of co-occurring in the output; they’re just going to be closer together in the vector-space of the weights; that’s, like, what training a model is, establishing those relationships.

If you’ve ever used an heuristic to informally evaluate someone’s credibility by listening for industry-specific shibboleths or ways of describing a particular issue, that skill is now useless. Having ingested every industry’s expert literature, commonly-occurring phrases will always be present in Mallory’s output. Mallory will usually sound like an expert, but then make mistakes at random.¹⁵.

While you might intuitively estimate $C$ by thinking “well, if I asked a person, how could I check that they were correct, and how long would that take?” that estimate will be extremely optimistic, because the heuristic techniques you would use to quickly evaluate incorrect information from other humans will fail with Mallory. You need to go all the way back to primary sources and actually fully verify the output every time, or you will likely fall into one of these traps.

Mallory Mangling Mentorship

So far, I’ve been describing the effect Mallory will have in the context of an individual attempting to get some work done. If we are considering organization-wide adoption of Mallory, however, we must also consider the impact on team dynamics. There are a number of possible potential side effects that one might consider when looking at, but here I will focus on just one that I have observed.

I have a cohort of friends in the software industry, most of whom are individual contributors. I’m a programmer who likes programming, so are most of my friends, and we are also (sigh), charitably, pretty solidly middle-aged at this point, so we tend to have a lot of experience.

As such, we are often the folks that the team — or, in my case, the community — goes to when less-experienced folks need answers.

On its own, this is actually pretty great. Answering questions from more junior folks is one of the best parts of a software development job. It’s an opportunity to be helpful, mostly just by knowing a thing we already knew. And it’s an opportunity to help someone else improve their own agency by giving them knowledge that they can use in the future.

However, generative AI throws a bit of a wrench into the mix.

Let’s imagine a scenario where we have 2 developers: Alice, a staff engineer who has a good understanding of the system being built, and Bob, a relatively junior engineer who is still onboarding.

The traditional interaction between Alice and Bob, when Bob has a question, goes like this:

Bob gets confused about something in the system being developed, because Bob’s understanding of the system is incorrect.
Bob formulates a question based on this confusion.
Bob asks Alice that question.
Alice knows the system, so she gives an answer which accurately reflects the state of the system to Bob.
Bob’s understanding of the system improves, and thus he will have fewer and better-informed questions going forward.

You can imagine how repeating this simple 5-step process will eventually transform Bob into a senior developer, and then he can start answering questions on his own. Making sufficient time for regularly iterating this loop is the heart of any good mentorship process.

Now, though, with Mallory in the mix, the process now has a new decision point, changing it from a linear sequence to a flow chart.

We begin the same way, with steps 1 and 2. Bob’s confused, Bob formulates a question, but then:

Bob asks Mallory that question.

Here, our path then diverges into a “happy” path, a “meh” path, and a “sad” path.

The “happy” path proceeds like so:

Mallory happens to formulate a correct answer.
Bob’s understanding of the system improves, and thus he will have fewer and better-informed questions going forward.

Great. Problem solved. We just saved some of Alice’s time. But as we learned earlier,

Mallory can make mistakes. When that happens, we will need to check important info. So let’s get checking:

Mallory happens to formulate an incorrect answer.
Bob investigates this answer.
Bob realizes that this answer is incorrect because it is inconsistent with some of his prior, correct knowledge of the system, or his investigation.
Bob asks Alice the same question; GOTO traditional interaction step 4.

On this path, Bob spent a while futzing around with Mallory, to no particular benefit. This wastes some of Bob’s time, but then again, Bob could have ended up on the happy path, so perhaps it was worth the risk; at least Bob wasn’t wasting any of Alice’s much more valuable time in the process.¹⁶

Notice that beginning at the start of step 4, we must begin allocating all of Bob’s time to $C$ , so $C$ already starts getting a bit bigger than if it were just Bob checking Mallory’s output specifically on tasks that Bob is doing.

That brings us to the “sad” path.

Mallory happens to formulate an incorrect answer.
Bob investigates this answer.
Bob does not realize that this answer is incorrect because he is unable to recognize any inconsistencies with his existing, incomplete knowledge of the system.
Bob integrates Mallory’s incorrect information of the system into his mental model.
Bob proceeds to make a larger and larger mess of his work, based on an incorrect mental model.
Eventually, Bob asks Alice a new, worse question, based on this incorrect understanding.
Sadly we cannot return to the happy path at this point, because now Alice must unravel the complex series of confusing misunderstandings that Mallory has unfortunately conveyed to Bob at this point. In the really sad case, Bob actually doesn’t believe Alice for a while, because Mallory seems unbiased¹⁷, and Alice has to waste even more time convincing Bob before she can simply explain to him.

Now, we have wasted some of Bob’s time, and some of Alice’s time. Everything from step 5-10 is $C$ , and as soon as Alice gets involved, we are now adding to $C$ at double real-time. If more team members are pulled in to the investigation, you are now multiplying $C$ by the number of investigators, potentially running at triple or quadruple real time.

But That’s Not All

Here I’ve presented a brief selection reasons why $C$ will be both large, and larger than you expect. To review:

Gambling-style mechanics of the user interface will interfere with your own self-monitoring and developing a good estimate.
You can’t use human heuristics for quickly spotting bad answers.
Wrong answers given to junior people who can’t evaluate them will waste more time from your more senior employees.

But this is a small selection of ways that Mallory’s output can cost you money and time. It’s harder to simplistically model second-order effects like this, but there’s also a broad range of possibilities for ways that, rather than simply checking and catching errors, an error slips through and starts doing damage. Or ways in which the output isn’t exactly wrong, but still sub-optimal in ways which can be difficult to notice in the short term.

For example, you might successfully vibe-code your way to launch a series of applications, successfully “checking” the output along the way, but then discover that the resulting code is unmaintainable garbage that prevents future feature delivery, and needs to be re-written¹⁸. But this kind of intellectual debt isn’t even specific to technical debt while coding; it can even affect such apparently genAI-amenable fields as LinkedIn content marketing¹⁹.

Problems with the Prediction of $P$

$C$ isn’t the only challenging term though. $P$ , is just as, if not more important, and just as hard to measure.

LLM marketing materials love to phrase their accuracy in terms of a percentage. Accuracy claims for LLMs in general tend to hover around 70%²⁰. But these scores vary per field, and when you aggregate them across multiple topic areas, they start to trend down. This is exactly why “agentic” approaches for more immediately-verifiable LLM outputs (with checks like “did the code work”) got popular in the first place: you need to try more than once.

Independently measured claims about accuracy tend to be quite a bit lower²¹. The field of AI benchmarks is exploding, but it probably goes without saying that LLM vendors game those benchmarks²², because of course every incentive would encourage them to do that. Regardless of what their arbitrary scoring on some benchmark might say, all that matters to your business is whether it is accurate for the problems you are solving, for the way that you use it. Which is not necessarily going to correspond to any benchmark. You will need to measure it for yourself.

With that goal in mind, our formulation of $P$ must be a somewhat harsher standard than “accuracy”. It’s not merely “was the factual information contained in any generated output accurate”, but, “is the output good enough that some given real knowledge-work task is done and the human does not need to issue another prompt”?

Surprisingly Small Space for Slip-Ups

The problem with reporting these things as percentages at all, however, is that our actual definition for $P$ is $\frac{1}{attempts}$ , where $attempts$ for any given attempt, at least, must be an integer greater than or equal to 1.

Taken in aggregate, if we succeed on the first prompt more often than not, we could end up with a $P > \frac{1}{2}$ , but combined with the previous observation that you almost always have to prompt it more than once, the practical reality is that $P$ will start at 50% and go down from there.

If we plug in some numbers, trying to be as extremely optimistic as we can, and say that we have a uniform stream of tasks, every one of which can be addressed by Mallory, every one of which:

we can measure perfectly, with no overhead
would take a human 45 minutes
takes Mallory only a single minute to generate a response
Mallory will require only 1 re-prompt, so “good enough” half the time
takes a human only 5 minutes to write a prompt for
takes a human only 5 minutes to check the result of
has a per-prompt cost of the equivalent of a single second of a human’s time

Thought experiments are a dicey basis for reasoning in the face of disagreements, so I have tried to formulate something here that is absolutely, comically, over-the-top stacked in favor of the AI optimist here.

Would that be a profitable? It sure seems like it, given that we are trading off 45 minutes of human time for 1 minute of Mallory-time and 10 minutes of human time. If we ask Python:

>>> def FF(H, I, C, P, W, E):
...     return (W + I + C + E) / (P * H)
... FF(H=45.0, I=1.0, C=5.0, P=1/2, W=5.0, E=0.01)
...
0.48933333333333334

We get a futzing fraction of about 0.4893. Not bad! Sounds like, at least under these conditions, it would indeed be cost-effective to deploy Mallory. But… realistically, do you reliably get useful, done-with-the-task quality output on the second prompt? Let’s bump up the denominator on $P$ just a little bit there, and see how we fare:

>>> FF(H=45.0, I=1.0, C=5.0, P=1/3, W=5.0, E=0.01)
0.734

Oof. Still cost-effective at 0.734, but not quite as good. Where do we cap out, exactly?

>>> from itertools import count
... for A in count(start=4):
...     print(A, result := FF(H=45.0, I=1.0, C=5.0, P=1 / A, W=5.0, E=1/60.))
...     if result > 1:
...         break
...
4 0.9792592592592594
5 1.224074074074074
>>>

With this little test, we can see that at our next iteration we are already at 0.9792, and by 5 tries per prompt, even in this absolute fever-dream of an over-optimistic scenario, with a futzing fraction of 1.2240, Mallory is now a net detriment to our bottom line.

Harm to the Humans

We are treating $H$ as functionally constant so far, an average around some hypothetical Gaussian distribution, but the distribution itself can also change over time.

Formally speaking, an increase to $H$ would be good for our fraction. Maybe it would even be a good thing; it could mean we’re taking on harder and harder tasks due to the superpowers that Mallory has given us.

But an observed increase to $H$ would probably not be good. An increase could also mean your humans are getting worse at solving problems, because using Mallory has atrophied their skills²³ and sabotaged learning opportunities²⁴²⁵. It could also go up because your senior, experienced people now hate their jobs²⁶.

For some more vulnerable folks, Mallory might just take a shortcut to all these complex interactions and drive them completely insane²⁷ directly. Employees experiencing an intense psychotic episode are famously less productive than those who are not.

This could all be very bad, if our futzing fraction eventually does head north of 1 and you need to reconsider introducing human-only workflows, without Mallory.

Abridging the Artificial Arithmetic (Alliteratively)

To reiterate, I have proposed this fraction:

FF = \frac{W + I + C + E}{P H}

which shows us positive ROI when FF is less than 1, and negative ROI when it is more than 1.

This model is heavily simplified. A comprehensive measurement program that tests the efficacy of any technology, let alone one as complex and rapidly changing as LLMs, is more complex than could be captured in a single blog post.

Real-world work might be insufficiently uniform to fit into a closed-form solution like this. Perhaps an iterated simulation with variables based on the range of values seem from your team’s metrics would give better results.

However, in this post, I want to illustrate that if you are going to try to evaluate an LLM-based tool, you need to at least include some representation of each of these terms somewhere. They are all fundamental to the way the technology works, and if you’re not measuring them somehow, then you are flying blind into the genAI storm.

I also hope to show that a lot of existing assumptions about how benefits might be demonstrated, for example with user surveys about general impressions, or by evaluating artificial benchmark scores, are deeply flawed.

Even making what I consider to be wildly, unrealistically optimistic assumptions about these measurements, I hope I’ve shown:

in the numerator, $C$ might be a lot higher than you expect,
in the denominator, $P$ might be a lot lower than you expect,
repeated use of an LLM might make $H$ go up, but despite the fact that it's in the denominator, that will ultimately be quite bad for your business.

Personally, I don’t have all that many concerns about $E$ and $I$ . $E$ is still seeing significant loss-leader pricing, and $I$ might not be coming down as fast as vendors would like us to believe, if the other numbers work out I don’t think they make a huge difference. However, there might still be surprises lurking in there, and if you want to rationally evaluate the effectiveness of a model, you need to be able to measure them and incorporate them as well.

In particular, I really want to stress the importance of the influence of LLMs on your team dynamic, as that can cause massive, hidden increases to $C$ . LLMs present opportunities for junior employees to generate an endless stream of chaff that will simultaneously:

wreck your performance review process by making them look much more productive than they are,
increase stress and load on senior employees who need to clean up unforeseen messes created by their LLM output,
and ruin their own opportunities for career development by skipping over learning opportunities.

If you’ve already deployed LLM tooling without measuring these things and without updating your performance management processes to account for the strange distortions that these tools make possible, your Futzing Fraction may be much, much greater than 1, creating hidden costs and technical debt that your organization will not notice until a lot of damage has already been done.

If you got all the way here, particularly if you’re someone who is enthusiastic about these technologies, thank you for reading. I appreciate your attention and I am hopeful that if we can start paying attention to these details, perhaps we can all stop futzing around so much with this stuff and get back to doing real work.

Acknowledgments

I do not share this optimism, but I want to try very hard in this particular piece to take it as a given that genAI is in fact helpful. ↩
If we could have a better prompt on demand via some repeatable and automatable process, surely we would have used a prompt that got the answer we wanted in the first place. ↩
The software idea of a “user agent” straightforwardly comes from the legal principle of an agent, which has deep roots in common law, jurisprudence, philosophy, and math. When we think of an agent (some software) acting on behalf of a principal (a human user), this historical baggage imputes some important ethical obligations to the developer of the agent software. genAI vendors have been as eager as any software vendor to dodge responsibility for faithfully representing the user’s interests even as there are some indications that at least some courts are not persuaded by this dodge, at least by the consumers of genAI attempting to pass on the responsibility all the way to end users. Perhaps it goes without saying, but I’ll say it anyway: I don’t like this newer interpretation of “agent”. ↩
“Vending-Bench: A Benchmark for Long-Term Coherence of Autonomous Agents”, Axel Backlund, Lukas Petersson, Feb 20, 2025 ↩
“random thing are happening, maxed out usage on api keys”, @leojr94 on Twitter, Mar 17, 2025 ↩
“New study sheds light on ChatGPT’s alarming interactions with teens” ↩
“Lawyers submitted bogus case law created by ChatGPT. A judge fined them $5,000”, by Larry Neumeister for the Associated Press, June 22, 2023 ↩
During which a human will be busy-waiting on an answer. ↩
Given the fluctuating pricing of these products, and fixed subscription overhead, this will obviously need to be amortized; including all the additional terms to actually convert this from your inputs is left as an exercise for the reader. ↩
I feel like I should emphasize explicitly here that everything is an average over repeated interactions. For example, you might observe that a particular LLM has a low probability of outputting acceptable work on the first prompt, but higher probability on subsequent prompts in the same context, such that it usually takes 4 prompts. For the purposes of this extremely simple closed-form model, we’d still consider that a $P$ of 25%, even though a more sophisticated model, or a monte carlo simulation that sets progressive bounds on the probability, might produce more accurate values. ↩
No it isn’t, actually, but for the sake of argument let’s grant that it is. ↩
It’s worth noting that all this expensive measuring itself must be included in $C$ until you have a solid grounding for all your metrics, but let’s optimistically leave all of that out for the sake of simplicity. ↩
“AI Company Poll Finds 45% of Workers Trust the Tech More Than Their Peers”, by Suzanne Blake for Newsweek, Aug 13, 2025 ↩
AI Chatbots Remain Overconfident — Even When They’re Wrong by Jason Bittel for the Dietrich College of Humanities and Social Sciences at Carnegie Mellon University, July 22, 2025 ↩
AI Mistakes Are Very Different From Human Mistakes by Bruce Schneier and Nathan E. Sanders for IEEE Spectrum, Jan 13, 2025 ↩
Foreshadowing is a narrative device in which a storyteller gives an advance hint of an upcoming event later in the story. ↩
“People are worried about the misuse of AI, but they trust it more than humans” ↩
“Why I stopped using AI (as a Senior Software Engineer)”, theSeniorDev YouTube channel, Jun 17, 2025 ↩
“I was an AI evangelist. Now I’m an AI vegan. Here’s why.”, Joe McKay for the greatchatlinkedin YouTube channel, Aug 8, 2025 ↩
“What LLM is The Most Accurate?” ↩
“Study Finds That 52 Percent Of ChatGPT Answers to Programming Questions are Wrong”, by Sharon Adarlo for Futurism, May 23, 2024 ↩
“Off the Mark: The Pitfalls of Metrics Gaming in AI Progress Races”, by Tabrez Syed on BoxCars AI, Dec 14, 2023 ↩
“I tried coding with AI, I became lazy and stupid”, by Thomasorus, Aug 8, 2025 ↩
“How AI Changes Student Thinking: The Hidden Cognitive Risks” by Timothy Cook for Psychology Today, May 10, 2025 ↩
“Increased AI use linked to eroding critical thinking skills” by Justin Jackson for Phys.org, Jan 13, 2025 ↩
“AI could end my job — Just not the way I expected” by Manuel Artero Anguita on dev.to, Jan 27, 2025 ↩
“The Emerging Problem of “AI Psychosis”” by Gary Drevitch for Psychology Today, July 21, 2025. ↩

R0ML’s Ratio

2025-08-08T21:41:00-07:00

My father, also known as “R0ML” once described a methodology for evaluating volume purchases that I think needs to be more popular.

If you are a hardcore fan, you might know that he has already described this concept publicly in a talk at OSCON in 2005, among other places, but it has never found its way to the public Internet, so I’m giving it a home here, and in the process, appropriating some of his words.¹

Let’s say you’re running a circus. The circus has many clowns. Ten thousand clowns, to be precise. They require bright red clown noses. Therefore, you must acquire a significant volume of clown noses. An enterprise licensing agreement for clown noses, if you will.

If the nose plays, it can really make the act. In order to make sure you’re getting quality noses, you go with a quality vendor. You select a vendor who can supply noses for $100 each, at retail.

Do you want to buy retail? Ten thousand clowns, ten thousand noses, one hundred dollars: that’s a million bucks worth of noses, so it’s worth your while to get a good deal.

As a conscientious executive, you go to the golf course with your favorite clown accessories vendor and negotiate yourself a 50% discount, with a commitment to buy all ten thousand noses.

Is this a good deal? Should you take it?

To determine this, we will use an analytical tool called R0ML’s Ratio (RR).

The ratio has 2 terms:

the Full Undiscounted Retail List Price of Units Used (FURLPoUU), which can of course be computed by the individual retail list price of a single unit (in our case, $100) multiplied by the number of units used
the Total Price of the Entire Enterprise Volume Licensing Agreement (TPotEEVLA), which in our case is $500,000.

It is expressed as:

RR = \frac{TPotEEVLA}{FURLPoUU}

Crucially, you must be able to compute the number of units used in order to complete this ratio. If, as expected, every single clown wears their nose at least once during the period of the license agreement, then our Units Used is 10,000, our FURLPoUU is $1,000,000 and our TPotEEVLA is $500,000, which makes our RR 0.5.

Congratulations. If R0ML’s Ratio is less than 1, it’s a good deal. Proceed.

But… maybe the nose doesn’t play. Not every clown’s costume is an exact clone of the traditional, stereotypical image of a clown. Many are avant-garde. Perhaps this plentiful proboscis pledge was premature. Here, I must quote the originator of this theoretical framework directly:

What if the wheeze doesn’t please?

What if the schnozz gives some pause?

In other words: what if some clowns don’t wear their noses?

If we were to do this deal, and then ask around afterwards to find out that only 200 of our 10,000 clowns were to use their noses, then FURLPoUU comes out to 200 * $100, for a total of $20,000. In that scenario, RR is 25, which you may observe is substantially greater than 1.

If you do a deal where R0ML’s ratio is greater than 1, then you are the bozo.

I apologize if I have belabored this point. As R0ML expressed in the email we exchanged about this many years ago,

I do not mind if you blog about it — and I don't mind getting the credit — although one would think it would be obvious.

And yeah, one would think this would be obvious? But I have belabored it because many discounted enterprise volume purchasing agreements still fail the R0ML’s Ratio Bozo Test.²

In the case of clown noses, if you pay the discounted price, at least you get to keep the nose; maybe lightly-used clown noses have some resale value. But in software licensing or SaaS deals, once you’ve purchased the “discounted” software or service, once you have provisioned the “seats”, the money is gone, and if your employees don’t use it, then no value for your organization will ever result.

Measuring number of units used is very important. Without this number, you have no idea if you are a bozo or not.

It is often better to give your individual employees a corporate card and allow them to make arbitrary individual purchases of software licenses and SaaS tools, with minimal expense-reporting overhead; this will always keep R0ML’s Ratio at 1.0, and thus, you will never be a bozo.

It is always better to do that the first time you are purchasing a new software tool, because the first time making such a purchase you (almost by definition) have no information about “units used” yet. You have no idea — you cannot have any idea — if you are a bozo or not.

If you don’t know who the bozo is, it’s probably you.

Acknowledgments

Thank you for reading, and especially thank you to my patrons who are supporting my writing on this blog. Of course, extra thanks to dad for, like, having this idea and doing most of the work here beyond my transcription. If you like my dad’s ideas and you’d like to post more of them, or you’d like to support my various open-source endeavors, you can support my work as a sponsor!

One of my other favorite posts on this blog was just stealing another one of his ideas, so hopefully this one will be good too. ↩
This concept was first developed in 2001, but it has some implications for extremely recent developments in the software industry; but that’s a post for another day. ↩

The Best Line Length

2025-08-07T22:37:00-07:00

What’s a good maximum line length for your coding standard?

This is, of course, a trick question. By posing it as a question, I have created the misleading impression that it is a question, but Black has selected the correct number for you; it’s 90 characters or so, about 10% wider than the width of a standard VT100 hardware terminal.

Thanks for reading my blog.

OK, OK. Clearly, there’s more to it than that. This is an age-old debate on the level of “tabs versus spaces”. So contentious, in fact, that even the famously opinionated Black does in fact let you change it.

Ancient History

One argument that certain silly people¹ like to make is “why are we wrapping at 80 characters like we are using 80 character teletypes, it’s the 2020s! I have an ultrawide monitor!”. The implication here is that the width of 80-character terminals is an antiquated relic, based entirely around the hardware limitations of a bygone era, and modern displays can put tons of stuff on one line, so why not use that capability?

This feels intuitively true, given the huge disparity between ancient times and now: on my own display, I can comfortably fit about 350 characters on a line. What a shame, to have so much room for so many characters in each line, and to waste it all on blank space!

But... is that true?

I stretched out my editor window all the way to measure that ‘350’ number, but I did not continue editing at that window width. In order to have a more comfortable editing experience, I switched back into writeroom mode, a mode which emulates a considerably more writerly application, which limits each line length to 92 characters, regardless of frame width.

You’ve probably noticed this too. Almost all sites that display prose of any kind limit their width, even on very wide screens.

As silly as that tiny little ribbon of text running down the middle of your monitor might look with a full-screened stereotypical news site or blog, if you full-screen a site that doesn’t set that width-limit, although it makes sense that you can now use all that space up, it will look extremely, almost unreadably bad.

Blogging software does not set a column width limit on your text because of some 80-character-wide accident of history in the form of a hardware terminal.

Similarly, if you really try to use that screen real estate to its fullest for coding, and start editing 200-300 character lines, you’ll quickly notice it starts to feel just a bit weird and confusing. It gets surprisingly easy to lose your place. Rhetorically the “80 characters is just because of dinosaur technology! Use all those ultrawide pixels!” talking point is quite popular, but practically people usually just want a few more characters worth of breathing room, maxing out at 100 characters, far narrower than even the most svelte widescreen.

So maybe those 80 character terminals are holding us back a little bit, but... wait a second. Why were the terminals 80 characters wide in the first place?

Ancienter History

As this lovely Software Engineering Stack Exchange post summarizes, terminals were probably 80 characters because teletypes were 80 characters, and teletypes were probably 80 characters because punch cards were 80 characters, and punch cards were probably 80 characters because that’s just about how many typewritten characters fit onto one line of a US-Letter piece of paper.

Even before typewriters, consider the average newspaper: why do we call a regularly-occurring featured article in a newspaper a “column”? Because broadsheet papers were too wide to have only a single column; they would always be broken into multiple! Far more aggressive than 80 characters, columns in newspapers typically have 30 characters per line.

The first newspaper printing machines were custom designed and could have used whatever width they wanted, so why standardize on something so narrow?³

Science!

There has been a surprising amount of scientific research around this issue, but in brief, there’s a reason here rooted in human physiology: when you read a block of text, you are not consciously moving your eyes from word to word like you’re dragging a mouse cursor, repositioning continuously. Human eyes reading text move in quick bursts of rotation called “saccades”. In order to quickly and accurately move from one line of text to another, the start of the next line needs to be clearly visible in the reader’s peripheral vision in order for them to accurately target it. This limits the angle of rotation that the reader can perform in a single saccade, and, thus, the length of a line that they can comfortably read without hunting around for the start of the next line every time they get to the end.

So, 80 (or 80 plus about 10%) characters isn’t too unreasonable for a limit. It’s longer than 30 characters, that’s for sure!

But, surely that’s not all, or this wouldn’t be so contentious in the first place?

Caveats

The screen is wide, though.

The ultrawide aficionados do have a point, even if it’s not really the simple one about “old terminals” they originally thought. Our modern wide-screen displays are criminally underutilized, particularly for text. Even adding in the big chunky file, class, and method tree browser over on the left and the source code preview on the right, a brief survey of a Google Image search for “vs code” shows a lot of editors open with huge, blank areas on the right side of the window.

Big screens are super useful as they allow us to leverage our spatial memories to keep more relevant code around and simply glance around as we think, rather than navigate interactively. But it only works if you remember to do it.

Newspapers allowed us to read a ton of information in one sitting with minimum shuffling by packing in as much as 6 columns of text. You could read a column to the bottom of the page, back to the top, and down again, several times.

Similarly, books fill both of their opposed pages with text at the same time, doubling the amount of stuff you can read at once before needing to turn the page.

You may notice that reading text in a book, even in an ebook app, is more comfortable than reading a ton of text by scrolling around in a web browser. That’s because our eyes are built for saccades, and repeatedly tracking the continuous smooth motion of the page as it scrolls to a stop, then re-targeting the new fixed location to start saccading around from, is literally more physically strenuous on your eye’s muscles!

There’s a reason that the codex was a big technological innovation over the scroll. This is a regression!

Today, the right thing to do here is to make use of horizontally split panes in your text editor or IDE, and just make a bit of conscious effort to set up the appropriate code on screen for the problem you’re working on. However, this is a potential area for different IDEs to really differentiate themselves, and build multi-column continuous-code-reading layouts that allow for buffers to wrap and be navigable newspaper-style.

Similar, modern CSS has shockingly good support for multi-column layouts, and it’s a shame that true multi-column, page-turning layouts are so rare. If I ever figure out a way to deploy this here that isn’t horribly clunky and fighting modern platform conventions like “scrolling horizontally is substantially more annoying and inconsistent than scrolling vertically” maybe I will experiment with such a layout on this blog one day. Until then… just make the browser window narrower so other useful stuff can be in the other parts of the screen, I guess.

Code Isn’t Prose

But, I digress. While I think that columnar layouts for reading prose are an interesting thing more people should experiment with, code isn’t prose.

The metric used for ideal line width, which you may have noticed if you clicked through some of those Wikipedia links earlier, is not “character cells in your editor window”, it is characters per line, or “CPL”.

With an optimal CPL somewhere between 45 and 95, a code-line-width of somewhere around 90 might actually be the best idea, because whitespace uses up your line-width budget. In a typical object-oriented Python program², most of your code ends up indented by at least 8 spaces: 4 for the class scope, 4 for the method scope. Most likely a lot of it is 12, because any interesting code will have at least one conditional or loop. So, by the time you’re done wasting all that horizontal space, a max line length of 90 actually looks more like a maximum of 78... right about that sweet spot from the US-Letter page in the typewriter that we started with.

What about soft-wrap?

In principle, source code is structured information, whose presentation could be fully decoupled from its serialized representation. Everyone could configure their preferred line width appropriate to their custom preferences and the specific physiological characteristics of their eyes, and the code could be formatted according to the language it was expressed in, and “hard wrapping” could be a silly antiquated thing.

The problem with this argument is the same as the argument against “but tabs are semantic indentation”, to wit: nope, no it isn’t. What “in principle” means in the previous paragraph is actually “in a fantasy world which we do not inhabit”. I’d love it if editors treated code this way and we had a rich history and tradition of structured manipulations rather than typing in strings of symbols to construct source code textually. But that is not the world we live in. Hard wrapping is unfortunately necessary to integrate with diff tools.

So what’s the optimal line width?

The exact, specific number here is still ultimately a matter of personal preference.

Hopefully, understanding the long history, science, and underlying physical constraints can lead you to select a contextually appropriate value for your own purposes that will balance ease of reading, integration with the relevant tools in your ecosystem, diff size, presentation in the editors and IDEs that your contributors tend to use, reasonable display in web contexts, on presentation slides, and so on.

But — and this is important — counterpoint:

No it isn’t, you don’t need to select an optimal width, because it’s already been selected for you. It is the full width of a standard VT100 terminal plus 10%.

Acknowledgments

Thank you for reading, and especially thank you to my patrons who are supporting my writing on this blog. If you like what you’ve read here and you’d like to read more of it, or you’d like to support my various open-source endeavors, you can support my work as a sponsor!

I love the fact that this message is, itself, hard-wrapped to 77 characters. ↩
Let’s be honest; we’re all object-oriented python programmers here, aren’t we? ↩
Unsurprisingly, there are also financial reasons. More, narrower columns meant it was easier to fix typesetting errors and to insert more advertisements as necessary. But readability really did have a lot to do with it, too; scientists were looking at ease of reading as far back as the 1800s. ↩

I Think I’m Done Thinking About genAI For Now

2025-06-04T22:22:00-07:00

The Problem

Like many other self-styled thinky programmer guys, I like to imagine myself as a sort of Holmesian genius, making trenchant observations, collecting them, and then synergizing them into brilliant deductions with the keen application of my powerful mind.

However, several years ago, I had an epiphany in my self-concept. I finally understood that, to the extent that I am usefully clever, it is less in a Holmesian idiom, and more, shall we say, Monkesque.

For those unfamiliar with either of the respective franchises:

Holmes is a towering intellect honed by years of training, who catalogues intentional, systematic observations and deduces logical, factual conclusions from those observations.
Monk, on the other hand, while also a reasonably intelligent guy, is highly neurotic, wracked by unresolved trauma and profound grief. As both a consulting job and a coping mechanism, he makes a habit of erratically wandering into crime scenes, and, driven by a carefully managed jenga tower of mental illnesses, leverages his dual inabilities to solve crimes. First, he is unable to filter out apparently inconsequential details, building up a mental rat’s nest of trivia about the problem; second, he is unable to let go of any minor incongruity, obsessively ruminating on the collection of facts until they all make sense in a consistent timeline.

Perhaps surprisingly, this tendency serves both this fictional wretch of a detective, and myself, reasonably well. I find annoying incongruities in abstractions and I fidget and fiddle with them until I end up building something that a lot of people like, or perhaps something that a smaller number of people get really excited about. At worst, at least I eventually understand what’s going on. This is a self-soothing activity but it turns out that, managed properly, it can very effectively soothe others as well.

All that brings us to today’s topic, which is an incongruity I cannot smooth out or fit into a logical framework to make sense. I am, somewhat reluctantly, a genAI skeptic. However, I am, even more reluctantly, exposed to genAI Discourse every damn minute of every damn day. It is relentless, inescapable, and exhausting.

This preamble about personality should hopefully help you, dear reader, to understand how I usually address problematical ideas by thinking and thinking and fidgeting with them until I manage to write some words — or perhaps a new open source package — that logically orders the ideas around it in a way which allows my brain to calm down and let it go, and how that process is important to me.

In this particular instance, however, genAI has defeated me. I cannot make it make sense, but I need to stop thinking about it anyway. It is too much and I need to give up.

My goal with this post is not to convince anyone of anything in particular — and we’ll get to why that is a bit later — but rather:

to set out my current understanding in one place, including all the various negative feelings which are still bothering me, so I can stop repeating it elsewhere,
to explain why I cannot build a case that I think should be particularly convincing to anyone else, particularly to someone who actively disagrees with me,
in so doing, to illustrate why I think the discourse is so fractious and unresolvable, and finally
to give myself, and hopefully by proxy to give others in the same situation, permission to just peace out of this nightmare quagmire corner of the noosphere.

But first, just because I can’t prove that my interlocutors are Wrong On The Internet, doesn’t mean I won’t explain why I feel like they are wrong.

The Anti-Antis

Most recently, at time of writing, there have been a spate of “the genAI discourse is bad” articles, almost exclusively written from the perspective of, not boosters exactly, but pragmatically minded (albeit concerned) genAI users, wishing for the skeptics to be more pointed and accurate in our critiques. This is anti-anti-genAI content.

I am not going to link to any of these, because, as part of their self-fulfilling prophecy about the “genAI discourse”, they’re also all bad.

Mostly, however, they had very little worthwhile to respond to because they were straw-manning their erstwhile interlocutors. They are all getting annoyed at “bad genAI criticism” while failing to engage with — and often failing to even mention — most of the actual substance of any serious genAI criticism. At least, any of the criticism that I’ve personally read.

I understand wanting to avoid a callout or Gish-gallop culture and just express your own ideas. So, I understand that they didn’t link directly to particular sources or go point-by-point on anyone else’s writing. Obviously I get it, since that’s exactly what this post is doing too.

But if you’re going to talk about how bad the genAI conversation is, without even mentioning huge categories of problem like “climate impact” or “disinformation”¹ even once, I honestly don’t know what conversation you’re even talking about. This is peak “make up a guy to get mad at” behavior, which is especially confusing in this circumstance, because there’s an absolutely huge crowd of actual people that you could already be mad at.

The people writing these pieces have historically seemed very thoughtful to me. Some of them I know personally. It is worrying to me that their critical thinking skills appear to have substantially degraded specifically after spending a bunch of time intensely using this technology which I believe has a scary risk of degrading one’s critical thinking skills. Correlation is not causation or whatever, and sure, from a rhetorical perspective this is “post hoc ergo propter hoc” and maybe a little “ad hominem” for good measure, but correlation can still be concerning.

Yet, I cannot effectively respond to these folks, because they are making a practical argument that I cannot, despite my best efforts, find compelling evidence to refute categorically. My experiences of genAI are all extremely bad, but that is barely even anecdata. Their experiences are neutral-to-positive. Little scientific data exists. How to resolve this?²

The Aesthetics

As I begin to state my own position, let me lead with this: my factual analysis of genAI is hopelessly negatively biased. I find the vast majority of the aesthetic properties of genAI to be intensely unpleasant.

I have been trying very hard to correct for this bias, to try to pay attention to the facts and to have a clear-eyed view of these systems’ capabilities. But the feelings are visceral, and the effort to compensate is tiring. It is, in fact, the desire to stop making this particular kind of effort that has me writing up this piece and trying to take an intentional break from the subject, despite its intense relevance.

When I say its “aesthetic qualities” are unpleasant, I don’t just mean the aesthetic elements of output of genAIs themselves. The aesthetic quality of genAI writing, visual design, animation and so on, while mostly atrocious, is also highly variable. There are cherry-picked examples which look… fine. Maybe even good. For years now, there have been, famously, literally award-winning aesthetic outputs of genAI³.

While I am ideologically predisposed to see any “good” genAI art as accruing the benefits of either a survivorship bias from thousands of terrible outputs or simple plagiarism rather than its own inherent quality, I cannot deny that in many cases it is “good”.

However, I am not just talking about the product, but the process; the aesthetic experience of interfacing with the genAI system itself, rather than the aesthetic experience of the outputs of that system.

I am not a visual artist and I am not really a writer⁴, particularly not a writer of fiction or anything else whose experience is primarily aesthetic. So I will speak directly to the experience of software development.

I have seen very few successful examples of using genAI to produce whole, working systems. There are no shortage of highly public miserable failures, particularly from the vendors of these systems themselves, where the outputs are confused, self-contradictory, full of subtle errors and generally unusable. While few studies exist, it sure looks like this is an automated way of producing a Net Negative Productivity Programmer, throwing out chaff to slow down the rest of the team.⁵

Juxtapose this with my aforementioned psychological motivations, to wit, I want to have everything in the computer be orderly and make sense, I’m sure most of you would have no trouble imagining that sitting through this sort of practice would make me extremely unhappy.

Despite this plethora of negative experiences, executives are aggressively mandating the use of AI⁶. It looks like without such mandates, most people will not bother to use such tools, so the executives will need muscular policies to enforce its use.⁷

Being forced to sit and argue with a robot while it struggles and fails to produce a working output, while you have to rewrite the code at the end anyway, is incredibly demoralizing. This is the kind of activity that activates every single major cause of burnout at once.

But, at least in that scenario, the thing ultimately doesn’t work, so there’s a hope that after a very stressful six month pilot program, you can go to management with a pile of meticulously collected evidence, and shut the whole thing down.

I am inclined to believe that, in fact, it doesn’t work well enough to be used this way, and that we are going to see a big crash. But that is not the most aesthetically distressing thing. The most distressing thing is that maybe it does work; if not well enough to actually do the work, at least ambiguously enough to fool the executives long-term.

This project, in particular, stood out to me as an example. Its author, a self-professed “AI skeptic” who “thought LLMs were glorified Markov chain generators that didn’t actually understand code and couldn’t produce anything novel”, did a green-field project to test this hypothesis.

Now, this particular project is not totally inconsistent with a world in which LLMs cannot produce anything novel. One could imagine that, out in the world of open source, perhaps there is enough “OAuth provider written in TypeScript” blended up into the slurry of “borrowed⁸” training data that the minor constraint of “make it work on Cloudflare Workers” is a small tweak⁹. It is not fully dispositive of the question of the viability of “genAI coding”.

But it is a data point related to that question, and thus it did make me contend with what might happen if it were actually a fully demonstrative example. I reviewed the commit history, as the author suggested. For the sake of argument, I tried to ask myself if I would like working this way. Just for clarity on this question, I wanted to suspend judgement about everything else; assuming:

the model could be created with ethically, legally, voluntarily sourced training data
its usage involved consent from labor rather than authoritarian mandates
sensible levels of energy expenditure, with minimal CO2 impact
it is substantially more efficient to work this way than to just write the code yourself

and so on, and so on… would I like to use this magic robot that could mostly just emit working code for me? Would I use it if it were free, in all senses of the word?

No. I absolutely would not.

I found the experience of reading this commit history and imagining myself using such a tool — without exaggeration — nauseating.

Unlike many programmers, I love code review. I find that it is one of the best parts of the process of programming. I can help people learn, and develop their skills, and learn from them, and appreciate the decisions they made, develop an impression of a fellow programmer’s style. It’s a great way to build a mutual theory of mind.

Of course, it can still be really annoying; people make mistakes, often can’t see things I find obvious, and in particular when you’re reviewing a lot of code from a lot of different people, you often end up having to repeat explanations of the same mistakes. So I can see why many programmers, particularly those more introverted than I am, hate it.

But, ultimately, when I review their code and work hard to provide clear and actionable feedback, people learn and grow and it’s worth that investment in inconvenience.

The process of coding with an “agentic” LLM appears to be the process of carefully distilling all the worst parts of code review, and removing and discarding all of its benefits.

The lazy, dumb, lying robot asshole keeps making the same mistakes over and over again, never improving, never genuinely reacting, always obsequiously pretending to take your feedback on board.

Even when it “does” actually “understand” and manages to load your instructions into its context window, 200K tokens later it will slide cleanly out of its memory and you will have to say it again.

All the while, it is attempting to trick you. It gets most things right, but it consistently makes mistakes in the places that you are least likely to notice. In places where a person wouldn’t make a mistake. Your brain keeps trying to develop a theory of mind to predict its behavior but there’s no mind there, so it always behaves infuriatingly randomly.

I don’t think I am the only one who feels this way.

The Affordances

Whatever our environments afford, we tend to do more of. Whatever they resist, we tend to do less of. So in a world where we were all writing all of our code and emails and blog posts and texts to each other with LLMs, what do they afford that existing tools do not?

As a weirdo who enjoys code review, I also enjoy process engineering. The central question of almost all process engineering is to continuously ask: how shall we shape our tools, to better shape ourselves?

LLMs are an affordance for producing more text, faster. How is that going to shape us?

Again arguing in the alternative here, assuming the text is free from errors and hallucinations and whatever, it’s all correct and fit for purpose, that means it reduces the pain of circumstances where you have to repeat yourself. Less pain! Sounds great; I don’t like pain.

Every codebase has places where you need boilerplate. Every organization has defects in its information architecture that require repetition of certain information rather than a link back to the authoritative source of truth. Often, these problems persist for a very long time, because it is difficult to overcome the institutional inertia required to make real progress rather than going along with the status quo. But this is often where the highest-value projects can be found. Where there’s muck, there’s brass.

The process-engineering function of an LLM, therefore, is to prevent fundamental problems from ever getting fixed, to reward the rapid-fire overwhelm of infrastructure teams with an immediate, catastrophic cascade of legacy code that is now much harder to delete than it is to write.

There is a scene in Game of Thrones where Khal Drogo kills himself. He does so by replacing a stinging, burning, therapeutic antiseptic wound dressing with some cool, soothing mud. The mud felt nice, addressed the immediate pain, removed the discomfort of the antiseptic, and immediately gave him a lethal infection.

The pleasing feeling of immediate progress when one prompts an LLM to solve some problem feels like cool mud on my brain.

The Economics

We are in the middle of a mania around this technology. As I have written about before, I believe the mania will end. There will then be a crash, and a “winter”. But, as I may not have stressed sufficiently, this crash will be the biggest of its kind — so big, that it is arguably not of a kind at all. The level of investment in these technologies is bananas and the possibility that the investors will recoup their investment seems close to zero. Meanwhile, that cost keeps going up, and up, and up.

Others have reported on this in detail¹⁰, and I will not reiterate that all here, but in addition to being a looming and scary industry-wide (if we are lucky; more likely it’s probably “world-wide”) economic threat, it is also going to drive some panicked behavior from management.

Panicky behavior from management stressed that their idea is not panning out is, famously, the cause of much human misery. I expect that even in the “good” scenario, where some profit is ultimately achieved, will still involve mass layoffs rocking the industry, panicked re-hiring, destruction of large amounts of wealth.

It feels bad to think about this.

The Energy Usage

For a long time I believed that the energy impact was overstated. I am even on record, about a year ago, saying I didn’t think the energy usage was a big deal. I think I was wrong about that.

It initially seemed like it was letting regular old data centers off the hook. But recently I have learned that, while the numbers are incomplete because the vendors aren’t sharing information, they’re also extremely bad.¹¹

I think there’s probably a version of this technology that isn’t a climate emergency nightmare, but that’s not the version that the general public has access to today.

The Educational Impact

LLMs are making academic cheating incredibly rampant.¹²

Not only is it so common as to be nearly universal, it’s also extremely harmful to learning.¹³

For learning, genAI is a forklift at the gym.

To some extent, LLMs are simply revealing a structural rot within education and academia that has been building for decades if not centuries. But it was within those inefficiencies and the inconveniences of the academic experience that real learning was, against all odds, still happening in schools.

LLMs produce a frictionless, streamlined process where students can effortlessly glide through the entire credential, learning nothing. Once again, they dull the pain without regard to its cause.

This is not good.

The Invasion of Privacy

This is obviously only a problem with the big cloud models, but then, the big cloud models are the only ones that people actually use. If you are having conversations about anything private with ChatGPT, you are sending all of that private information directly to Sam Altman, to do with as he wishes.

Even if you don’t think he is a particularly bad guy, maybe he won’t even create the privacy nightmare on purpose. Maybe he will be forced to do so as a result of some bizarre kafkaesque accident.¹⁴

Imagine the scenario, for example, where a woman is tracking her cycle and uploading the logs to ChatGPT so she can chat with it about a health concern. Except, surprise, you don’t have to imagine, you can just search for it, as I have personally, organically, seen three separate women on YouTube, at least one of whom lives in Texas, not only do this on camera but recommend doing this to their audiences.

Citation links withheld on this particular claim for hopefully obvious reasons.

I assure you that I am neither particularly interested in menstrual products nor genAI content, and if I am seeing this more than once, it is probably a distressingly large trend.

The Stealing

The training data for LLMs is stolen. I don’t mean like “pirated” in the sense where someone illicitly shares a copy they obtained legitimately; I mean their scrapers are ignoring both norms¹⁵ and laws¹⁶ to obtain copies under false pretenses, destroying other people’s infrastructure¹⁷ in the process.

The Fatigue

I have provided references to numerous articles outlining rhetorical and sometimes data-driven cases for the existence of certain properties and consequences of genAI tools. But I can’t prove any of these properties, either at a point in time or as a durable ongoing problem.

The LLMs themselves are simply too large to model with the usual kind of heuristics one would use to think about software. I’d sooner be able to predict the physics of dice in a casino than a 2 trillion parameter neural network. They resist scientific understanding, not just because of their size and complexity, but because unlike a natural phenomenon (which could of course be considerably larger and more complex) they resist experimentation.

The first form of genAI resistance to experiment is that every discussion is a motte-and-bailey. If I use a free model and get a bad result I’m told it’s because I should have used the paid model. If I get a bad result with ChatGPT I should have used Claude. If I get a bad result with a chatbot I need to start using an agentic tool. If an agentic tool deletes my hard drive by putting os.system(“rm -rf ~/”) into sitecustomize.py then I guess I should have built my own MCP integration with a completely novel heretofore never even considered security sandbox or something?

What configuration, exactly, would let me make a categorical claim about these things? What specific methodological approach should I stick to, to get reliably adequate prompts?

For the record though, if the idea of the free models is that they are going to be provocative demonstrations of the impressive capabilities of the commercial models, and the results are consistently dogshit, I am finding it increasingly hard to care how much better the paid ones are supposed to be, especially since the “better”-ness cannot really be quantified in any meaningful way.

The motte-and-bailey doesn’t stop there though. It’s a war on all fronts. Concerned about energy usage? That’s OK, you can use a local model. Concerned about infringement? That’s okay, somewhere, somebody, maybe, has figured out how to train models consensually¹⁸. Worried about the politics of enriching the richest monsters in the world? Don’t worry, you can always download an “open source” model from Hugging Face. It doesn’t matter that many of these properties are mutually exclusive and attempting to fix one breaks two others; there’s always an answer, the field is so abuzz with so many people trying to pull in so many directions at once that it is legitimately difficult to understand what’s going on.

Even here though, I can see that characterizing everything this way is unfair to a hypothetical sort of person. If there is someone working at one of these thousands of AI companies that have been springing up like toadstools after a rain, and they really are solving one of these extremely difficult problems, how can I handwave that away? We need people working on problems, that’s like, the whole point of having an economy. And I really don’t like shitting on other people’s earnest efforts, so I try not to dismiss whole fields. Given how AI has gotten into everything, in a way that e.g. cryptocurrency never did, painting with that broad a brush inevitably ends up tarring a bunch of stuff that isn’t even really AI at all.

The second form of genAI resistance to experiment is the inherent obfuscation of productization. The models themselves are already complicated enough, but the products that are built around the models are evolving extremely rapidly. ChatGPT is not just a “model”, and with the rapid¹⁹ deployment of Model Context Protocol tools, the edges of all these things will blur even further. Every LLM is now just an enormous unbounded soup of arbitrary software doing arbitrary whatever. How could I possibly get my arms around that to understand it?

The Challenge

I have woefully little experience with these tools.

I’ve tried them out a little bit, and almost every single time the result has been a disaster that has not made me curious to push further. Yet, I keep hearing from all over the industry that I should.

To some extent, I feel like the motte-and-bailey characterization above is fair; if the technology itself can really do real software development, it ought to be able to do it in multiple modalities, and there’s nothing anyone can articulate to me about GPT-4o which puts it in a fundamentally different class than GPT-3.5.

But, also, I consistently hear that the subjective experience of using the premium versions of the tools is actually good, and the free ones are actually bad.

I keep struggling to find ways to try them “the right way”, the way that people I know and otherwise respect claim to be using them, but I haven’t managed to do so in any meaningful way yet.

I do not want to be using the cloud versions of these models with their potentially hideous energy demands; I’d like to use a local model. But there is obviously not a nicely composed way to use local models like this.

Since there are apparently zero models with ethically-sourced training data, and litigation is ongoing²⁰ to determine the legal relationships of training data and outputs, even if I can be comfortable with some level of plagiarism on a project, I don’t feel that I can introduce the existential legal risk into other people’s infrastructure, so I would need to make a new project.

Others have differing opinions of course, including some within my dependency chain, which does worry me, but I still don’t feel like I can freely contribute further to the problem; it’s going to be bad enough to unwind any impact upstream. Even just for my own sake, I don’t want to make it worse.

This especially presents a problem because I have way too much stuff going on already. A new project is not practical.

Finally, even if I did manage to satisfy all of my quirky²¹ constraints, would this experiment really be worth anything? The models and tools that people are raving about are the big, expensive, harmful ones. If I proved to myself yet again that a small model with bad tools was unpleasant to use, I wouldn’t really be addressing my opponents’ views.

I’m stuck.

The Surrender

I am writing this piece to make my peace with giving up on this topic, at least for a while. While I do idly hope that some folks might find bits of it convincing, and perhaps find ways to be more mindful with their own usage of genAI tools, and consider the harm they may be causing, that’s not actually the goal. And that is not the goal because it is just so much goddamn work to prove.

Here, I must return to my philosophical hobbyhorse of sprachspiel. In this case, specifically to use it as an analytical tool, not just to understand what I am trying to say, but what the purpose for my speech is.

The concept of sprachspiel is most frequently deployed to describe the goal of the language game being played, but in game theory, that’s only half the story. Speech — particularly rigorously justified speech — has a cost, as well as a benefit. I can make shit up pretty easily, but if I want to do anything remotely like scientific or academic rigor, that cost can be astronomical. In the case of developing an abstract understanding of LLMs, the cost is just too high.

So what is my goal, then? To be king Canute, standing astride the shore of “tech”, whatever that is, commanding the LLM tide not to rise? This is a multi-trillion dollar juggernaut.

Even the rump, loser, also-ran fragment of it has the power to literally suffocate us in our homes²² if they so choose, completely insulated from any consequence. If the power curve starts there, imagine what the winners in this industry are going to be capable of, irrespective of the technology they’re building - just with the resources they have to hand. Am I going to write a blog post that can rival their propaganda apparatus? Doubtful.

Instead, I will just have to concede that maybe I’m wrong. I don’t have the skill, or the knowledge, or the energy, to demonstrate with any level of rigor that LLMs are generally, in fact, hot garbage. Intellectually, I will have to acknowledge that maybe the boosters are right. Maybe it’ll be OK.

Maybe the carbon emissions aren’t so bad. Maybe everybody is keeping them secret in ways that they don’t for other types of datacenter for perfectly legitimate reasons. Maybe the tools really can write novel and correct code, and with a little more tweaking, it won’t be so difficult to get them to do it. Maybe by the time they become a mandatory condition of access to developer tools, they won’t be miserable.

Sure, I even sincerely agree, intellectual property really has been a pretty bad idea from the beginning. Maybe it’s OK that we’ve made an exception to those rules. The rules were stupid anyway, so what does it matter if we let a few billionaires break them? Really, everybody should be able to break them (although of course, regular people can’t, because we can’t afford the lawyers to fight off the MPAA and RIAA, but that’s a problem with the legal system, not tech).

I come not to praise “AI skepticism”, but to bury it.

Maybe it really is all going to be fine. Perhaps I am simply catastrophizing; I have been known to do that from time to time. I can even sort of believe it, in my head. Still, even after writing all this out, I can’t quite manage to believe it in the pit of my stomach.

Unfortunately, that feeling is not something that you, or I, can argue with.

Acknowledgments

Thank you to my patrons. Normally, I would say, “who are supporting my writing on this blog”, but in the case of this piece, I feel more like I should apologize to them for this than to thank them; these thoughts have been preventing me from thinking more productive, useful things that I actually have relevant skill and expertise in; this felt more like a creative blockage that I just needed to expel than a deliberately written article. If you like what you’ve read here and you’d like to read more of it, well, too bad; I am sincerely determined to stop writing about this topic. But, if you’d like to read more stuff like other things I have written, or you’d like to support my various open-source endeavors, you can support my work as a sponsor!

And yes, disinformation is still an issue even if you’re “just” using it for coding. Even sidestepping the practical matter that technology is inherently political, validation and propagation of poor technique is a form of disinformation. ↩
I can’t resolve it, that’s the whole tragedy here, but I guess we have to pretend I will to maintain narrative momentum here. ↩
The story in Creative Bloq, or the NYT, if you must ↩
although it’s not for lack of trying, Jesus, look at the word count on this ↩
These are sometimes referred to as “10x” programmers, because they make everyone around them 10x slower. ↩
Douglas B. Laney at Forbes, Viral Shopify CEO Manifesto Says AI Now Mandatory For All Employees ↩
The National CIO Review, AI Mandates, Minimal Use: Closing the Workplace Readiness Gap ↩
Matt O’Brien at the AP, Reddit sues AI company Anthropic for allegedly ‘scraping’ user comments to train chatbot Claude ↩
Using the usual tricks to find plagiarism like searching for literal transcriptions of snippets of training data did not pull up anything when I tried, but then, that’s not how LLMs work these days, is it? If it didn’t obfuscate the plagiarism it wouldn’t be a very good plagiarism-obfuscator. ↩
David Gerard at Pivot to AI, “Microsoft and AI: spending billions to make millions”, Edward Zitron at Where’s Your Ed At, “The Era Of The Business Idiot”, both sobering reads ↩
James O’Donnell and Casey Crownhart at the MIT Technology Review, We did the math on AI’s energy footprint. Here’s the story you haven’t heard. ↩
Lucas Ropek at Gizmodo, AI Cheating Is So Out of Hand In America’s Schools That the Blue Books Are Coming Back ↩
James D. Walsh at the New York Magazine Intelligencer, Everyone Is Cheating Their Way Through College ↩
Ashley Belanger at Ars Technica, OpenAI slams court order to save all ChatGPT logs, including deleted chats ↩
Ashley Belanger at Ars Technica, AI haters build tarpits to trap and trick AI scrapers that ignore robots.txt ↩
Blake Brittain at Reuters, Judge in Meta case warns AI could ‘obliterate’ market for original works ↩
Xkeeper, TCRF has been getting DDoSed ↩
Kate Knibbs at Wired, Here’s Proof You Can Train an AI Model Without Slurping Copyrighted Content ↩
and, I should note, extremely irresponsible ↩
Porter Anderson at Publishing Perspectives, Meta AI Lawsuit: US Publishers File Amicus Brief ↩
It feels bizarre to characterize what feel like baseline ethical concerns this way, but the fact remains that within the “genAI community”, this places me into a tiny and obscure minority. ↩
Ariel Wittenberg for Politico, ‘How come I can’t breathe?’: Musk’s data company draws a backlash in Memphis ↩

Deciphering Glyph

Adversarial Communication

The Ladder-Climber And Their Reverse-Centaur Rungs

The Generative Gish Galloper

Straightforward Spam and Fraud

Customer “Support”

“Education”

Spamming “For Good”?

“Search” By Stealing

Who Am I Hurting?

Acknowledgments

Opaque Types in Python

Acknowledgments

What Is Code Review For?

Humans Are Bad At Perceiving

Never Send A Human To Do A Machine’s Job

What Is Code Review For, Then?

Oops, Surprise, This Post Is Actually About LLMs Again

To Sum Up

Acknowledgments

How To Argue With Me About AI, If You Must

Rule Zero: Maybe Don’t

Rule One: No ‘Just’

Rule Two: No ‘Look At This Cool Thing’

Rule Two And A Half: Engage In Metacognition

Rule Three: Do Not Cite The Deep Magic To Me

Rule Four: Ethics Are Not Optional

Consequences

The Next Thing Will Not Be Big

There Will Be Next Things

But They Will Not Be Big

What, Then?

Fight Scale Creep

Be Prepared to Scale Down

Acknowledgments

The “Dependency Cutout” Workflow Pattern, Part I

The Answer We Want

Step 0: Report the Problem

Step 1: Source Code and CI Setup

Optional Bonus Step 1a: Artifact Management

Step 2: Do The Fix

Step 2a: Local Filesystem Setup

Step 2b: Branch Setup for PR

Step 3: Deploy Internally

Step 4: Propose Externally

Step 4a: Hurry Up And Wait

Step 5: Unwind Everything

Up Next

Acknowledgments

The Futzing Fraction

Perpetually Proffering Permuted Prompts

Income from Investment in Inference

Figuring out the Fraction is Frustrating

Challenges to Computing Checking Costs

Margins, Money, and Metacognition

Problematically Plausible Presentation

Mallory Mangling Mentorship

But That’s Not All

Problems with the Prediction of P

Surprisingly Small Space for Slip-Ups

Harm to the Humans

Abridging the Artificial Arithmetic (Alliteratively)

Acknowledgments

R0ML’s Ratio

Acknowledgments

The Best Line Length

Ancient History

Ancienter History

Science!

Caveats

The screen is wide, though.

Code Isn’t Prose

What about soft-wrap?

So what’s the optimal line width?

Acknowledgments

I Think I’m Done Thinking About genAI For Now

The Problem

The Anti-Antis

The Aesthetics

The Affordances

Problems with the Prediction of $P$