Giant Robots Smashing Into Other Giant Robots

The Four Signals of AI Observability

2026-06-01T00:00:00+00:00

A few months ago we shipped a chat experience to production. Users ask a question, our app routes it through an LLM model, the model calls a few internal tools, and an answer comes back from it.

It worked. Sort of.

When the model answered well, we had no idea why. When it answered badly, we had no idea either. The model was a black box attached to our app, and our best debugging tool was reading logs and guessing.

We realized our app could not answer a very normal operational question:

Show us every chat where the user said the answer was bad, group them by which version of the system prompt was loaded, and let us read the whole conversation, including which tools the model called.

It’s the AI equivalent of “show me every 500 errors on this endpoint after deploy X.” But our app couldn’t answer it.

That was the trigger to stop looking for a smarter model and start looking to add an observability layer. We ended up using Langfuse, but the specific vendor matters less than the capabilities. Helicone, Arize Phoenix, LangSmith, and Braintrust all solve versions of the same problem.

After a couple of months of iteration, we noticed that the things we need came in four flavors. I call them the four signals that every AI feature needs to emit about itself.

A version on every prompt. Which exact words did the model see today?
A trace shaped like the actual work. What did it call, in what order, with what arguments?
A score from the user. Did the human like the result?
A score from another model. When the human is quiet, who is grading?

Of course we can build an AI feature without all four. We just can’t improve it on purpose.

A version on every prompt

The first thing we did was move every prompt out of the code and into a versioned store the app fetches at runtime.

# The code never references a version. It asks for a label.
template = PromptRepo.compile(name: "classify_question", label: "production")

# A human moves "production" between versions in the Langfuse UI.
# Promotion is a click. Rollback is a click. No deploy.

The first time we rolled back a bad prompt by clicking a button instead of reverting a PR and waiting for CI, we knew this was the right shape.

Once prompts became content, the people closest to the problem became the people writing the prompts. The feedback loop got much shorter, and the quality went up.

A trace shaped like the actual work

A chat is not a single call. It is a small program. Classify the question, load the right prompt, call a tool or two, then compose an answer.

If your trace is one row, you only know that something happened. A trace tree tells you what actually happened. If your trace is a tree of calls, you have a database of decisions the model made.

# Before: one log line, no shape
[INFO] chat_completed user_id=123 duration_ms=4200 tokens=1840

# After: a tree of decisions
trace: "chat"
  span:       load-prompt                  (version=production:v12)
  generation: classify-question            (model=haiku, category="billing")
  generation: compose-answer
    span:       tool-call.lookup_invoice   (200ms)
    span:       tool-call.lookup_customer  (180ms)
  generation: final-response               (model=sonnet, 1.2k tokens)

Each node carries the prompt name and version, the model id, token usage, and a set of metadata fields we control. The customer it ran for, the category the question was classified as, which tools ran, whether the conversation was new.

That metadata is the part that turned out to matter most.

The first time we filtered traces to “every chat in scope X where a particular tool ran and the user said the answer was bad”, we had a small realization. The trace list was not a log anymore. It was a queryable database of decisions the model made.

The rule we would write on a sticky note: tag your traces with the dimensions you will want to filter on later. It is cheap up front and impossible to add later, once you wish you had it.

A score from the user

Every assistant message in the UI has a thumbs up and a thumbs down. When a user clicks one, we save a row and post it back to the observability tool as a score on the trace.

A thumbs-down on its own isn’t actionable. A thumbs-down attached to a trace tells you what the model saw, what it called, which prompt version produced it, and what category the question fell into. Now you can ask: are downvotes concentrated in one category? On one prompt version? After one specific tool call?

You should review downvoted traces. It takes time, sometimes they’re noise, the user wanted something we don’t support, or hit thumbs-down by accident. But maybe one in ten is a real signal, and that’s the one that turns into a prompt change, a new tool, or a bug fix.

The point of all this plumbing is one new query.

Show us every trace a user labeled bad.

Once you can run that query and read the entire conversation that produced it (prompt version, tool calls, model, latency, everything), you stop the guessing game.

A score from another model

Human feedback is useful but rare. Most users do not click anything.

So we added a second model to grade the first one. A background job pulls finished chats, runs them through a separate “judge” prompt (versioned and labeled in the same store as the production prompts), and writes the result back as a score on the same trace.

Now the trace carries two streams of judgment. When the user and the judge agree, our judge is in sync with real users. When they disagree, that is the most interesting trace in the system. Either way, the judge runs on every chat, so a regression shows up the same day we ship the prompt that caused it, not a week later when somebody complains.

Our judge scores things like factuality, instruction-following, completeness, hallucination, and whether the assistant actually used the right internal context.

We underestimated this one. A judge that catches a regression before it ships is worth more than a faster or smarter model. It is the only signal that scales when nobody is clicking thumbs.

The lesson we had to learn: the judge is just a prompt. It can be wrong. It needs versioning and a Playground and a rollback button, exactly like a user-facing prompt.

Each signal writes back to the same trace. That’s the whole trick

Four signals, one idea

The four signals overlap, and that’s on purpose. The prompt version shows up on the trace. The user score attaches to the trace. The judge score attaches to the trace too. They are not really four separate things. They are the same idea viewed from four different angles.

Make the AI feature observable, then you can change it on purpose.

For a while I treated AI features like a different category of software: less debuggable, less testable, less under our control. An AI feature is software. It has inputs, makes decisions, produces outputs, and can be observed like anything else.

The four signals overlap on purpose. They are one idea, make the system observable, viewed from four angles. What changes once you have them isn’t that the model gets smarter. It’s that you stop hoping. You ship a prompt change knowing the judge will tell if it regressed. You read a downvote knowing you can replay the exact conversation that produced it. You promote a new prompt to production knowing you can roll it back in one click if it breaks.

The model is the engine. The observability layer is the dashboard. You can drive without one. You just can’t drive on purpose.

Can you really launch a tech business with a no-code app builder?

2026-05-29T00:00:00+00:00

Sometimes AI really does feel like magic, and right now just about any no-code app builder provides that magical experience for entrepreneurs with big ideas but no coding skills.

You can spend half an hour typing a few details about your app into Lovable and create a simple, functional app with a polished user interface. Then add six new features over the weekend without ever hiring a developer or talking to a real user.

Besides Lovable, there’s Bolt, Base44, Bubble, Replit and others. These AI app generators allow just about anyone to build software by describing it through text-based prompts. No technical knowledge, design sprint or VC funding required. Many even allow you to export the code, so you can continue your project outside the tool.

It’s exciting to mockup ideas so quickly, but can you launch a real business with a no-code app builder? Are non-developers really creating high quality, production-ready code?

Making money with AI-built apps

Startup culture loves to celebrate the outliers, the big, against-the-odds success stories. Lovable’s own ad campaign hypes ShiftNex, a healthcare staffing platform that hit $1 million annual recurring revenue (ARR) in five months, and Plinq, a background check app for dating built in 45 days that achieved $465,000 in ARR.

Not to be left out, Replit publicizes GEN AIPI, an AI education platform with training courses, payments, certifications and an admin system that was originally built in just three days with no dev team. The business achieved $180,000 in revenue in the first six weeks.

But the ad campaigns don’t include all the entrepreneurs who encountered big—or even impassable—roadblocks attempting similar business results. There’s often a point of diminishing returns with no-code app builders. At first, there’s instant gratification, but eventually, it becomes slower, harder and sometimes impossible to refine existing features or add new ones.

Are AI-generated apps production ready?

Unless you’re in a highly regulated industry, you can push software created with a no-code app builder to production. But that’s just the first challenge on the road to a stable and profitable tech business. Will performance hold up when you hit 1,000 or 10,000 users? Can you easily add new features? When do you add an engineering team?

AI app generators allow you to bring an MVP to life quickly, but scaling a successful, long-term tech business is much tougher. It’s nearly impossible for non-technical founders to evaluate potential risks in a code base or to even know what risks to look for in the first place. A big one is data security: How hard would it be for a bad actor to tap into sensitive data or user information?

And even when a founder does identify a bug or issue, it can be hard for an AI app builder to solve. If you already have users and things start breaking, it can easily become a hair-on-fire emergency. One that might leave traditional developers trying to get up to speed on thousands of lines of code with no context.

Then there’s an even bigger issue. Entrepreneurs are often so focused on what they can build that they forget to think about what they should build. With traditional development barriers gone, it’s easy to skip the strategic work of identifying a real problem to solve for real users. We’ve honed an entire Shaping Sprint process to work with founders on solidifying a product strategy and direction.

Despite these concerns, no-code app builders may be a fit for small businesses in industries with relatively low volume, regulation and security risk. We’re just at the beginning of AI app builders evolving and growing in capability, so it’s impossible to know exactly what the future holds.

No-code app builders 2.0

Right now, most software created by no-code app builders amounts to a prototype: useful for learning, but often difficult to scale into a successful long-term business. The current generation of tools is optimized for speed and instant gratification, but not necessarily for helping founders build the right product or make thoughtful product decisions along the way.

We think there’s an opportunity for the next generation of AI product tools to evolve beyond pure “vibe coding.”

Not just generating interfaces and features faster, but helping founders and teams:

Think through product direction
Validate assumptions
Prioritize the right problems
And move from idea to real product more intentionally

That’s part of why our team has been experimenting publicly with new workflows and AI-assisted product design approaches through our ReadySetGo initiative and weekly AI in Focus livestream series.

We’re still early in exploring what this category could become, but one thing already feels clear: as AI lowers the barrier to building software, the ability to identify the right thing to build may become even more important.

Giant Robots Podcast Ep 612: Do fish drink?

2026-05-28T00:00:00+00:00

The Giant Robots trio are back to discuss the development of thoughtbot’s ReadySetGo app, and whether AI might be causing developers to go backwards.

This week in #dev (May 15, 2026)

2026-05-28T00:00:00+00:00

Welcome to another edition of This Week in #dev, a series of posts where we bring some of our most interesting Slack conversations to the public.

Alternative Text for CSS-Generated Content

Matheus Richard learned that the CSS content property accepts alternative text for screen readers, separated by a /:

.warning::before {
  content: "⚠️" / "Warning";
}

Without the alt text, assistive technology either reads out the emoji name or skips it entirely. More details in Stefan Judis’ article.

A Faster UI for Large GitHub Diffs

Matheus Richard shares diffshub, a tool that renders PR diffs GitHub struggles with. It’s a drop-in replacement: swap github.com for diffshub.com in any PR URL, like https://diffshub.com/oven-sh/bun/pull/30412.

Aube, a New JavaScript Package Manager

Jared Turner shares Aube, a JavaScript package manager from the creator of Mise. It’s pitched as fast, compatible with existing lockfiles, and security-focused, including a 24-hour cooldown before newly published versions can be installed.

Thanks

This edition was brought to you by Jared Turner and Matheus Richard. Thanks to all contributors! 🎉

Lost, forgotten, and unfamiliar HTML

2026-05-27T00:00:00+00:00

I ran HTML-validate and Axe core and a Claude prompt against a new website I’m building, and they caught a bunch of stuff I missed! This gave me a chance to remember the easily overlooked bits of building a website. And I visited a few dark corners of the HTML spec I hadn’t been to yet!

Data attributes should be lowercase

data-dialogOpen is invalid - it should be data-dialogopen. But did you know that all HTML attribute names get automatically lowercased? I didn’t.

HTTP headers are also case-insensitive except in HTTP2 where they MUST be lowercase.

Invalid id attributes

I learned that in HTML5, an id can be anything as long as it’s 1 character with no whitespace (and it’s unique). id="_0$!11" is totally valid and I think even emojis are ok!.

However, in HTML4 ,ids need to start with a letter and can only contain letters, numbers, and a few punctuation symbols. So it’s probably best not to go too wild. Backwards compatibility is nice.

Oh, and the uniqueness requirement? ids inside iFrames only need to be unique within their document. Otherwise, imagine how tricky it would be to iFrame in an arbitrary page.

Redundant for attributes

A bit of a nitpick: when you label an input by putting it inside a label, the for attribute is redundant. When the input is outside the label, you definitely need that for!

<!-- Rails-style: no `for=""` needed -->
<label>
  Username <input type="text" name="username" />
</label>

<!-- non-Rails-style: don't forget the `for=""`! -->
<label for="username>Username</label>
<input type="text" name="username" id="username" />

Some reasons that thoughtbot prefers inputs inside labels:

it reduces the need for an extra wrapper div
since the label is clickable, this often results in a bigger click/tap area
you don’t need to generate unique IDs for inputs

Extra whitespace in a textarea

Claude spotted this one: I accidentally had a blank space inside a textarea.

<textarea name="explain"> </textarea>

An easy mistake to make and kind of annoying to an end user, especially because it will cause the required validation to be skipped. I wish one of my automated scanners had caught it.

False positive: aria-label misuse

HTML-validate told me that using the aria-label attribute on <search> is invalid. Nope - I was using it correctly!

W3c explicitly recommends it:

If a page includes more than one search landmark, each should have a unique label.

<search aria-label="Site-wide">
  <form>
    ...
  </form>
</search>

I filed a bug report.

iFrames with unique names

I had trouble with this one, but I’m glad Axe caught it because it’s genuinely useful for screen reader users.

Every iFrame needs a title, and those titles should be unique so they can be differentiated. But also, landmarks INSIDE the iFrames must be unique across the entire page, including the parent document.

I had 3 iFrames on a page, all with <main aria-label="Component Example">. Sure enough, when I opened Voiceover it read out 3 of the same landmark:

Component Example main
Component Example main
Component Example main

That’s not a great experience.

First, I tried to fix it by removing the aria-labels, but Axe warns me that the document has multiple <main>s without unique labels. I had to refactor how the iFrames were generated so that each one had both a unique title and <main> label.

Color contrast issues

Automated scanners are the best at finding contrast issues. I happened to have a link state that used a slightly-too-light purple on white. It didn’t pass WCAG’s minimum contrast levels. Easy for me to miss, but troublesome for someone with reduced vision.

Keyboard-accessible overflow scrolling

This was a new one for me! Axe tells me that when a region scrolls using overflow: scroll or similar, it must contain a focusable element. This seems to be a Safari-specific bug.

I tested with Safari and confirmed that it’s true: using the keyboard I was unable to scroll down to see the cut-off content.

The simplest solution is to add tabindex="0" to an element inside the scrolling region.

Forgotten SVGs

I’m constantly forgetting to check that SVGs have the right label and role. With images it’s easy: just make sure you’ve got an alt tag. But inline SVGs can either be decorative or presentational.

Decorative SVGs must use aria-hidden="true" to keep them out of the accessibility tree.

Presentational ones must use role="image and NEED a <title> tag to serve the same function as alt text. And since not all screen readers catch the <title> tag, you usually want to associate it with the <svg> tag using aria-labelledby. And if the SVG contains multiple images, text blocks, or interactivity, there’s even more to consider.

I dug into the WAI-ARIA rabbit hole and learned that maybe some of my SVGs could be role="graphics-symbol"

A graphical object used to convey a simple meaning or category, where the meaning is more important than the particular visual appearance.

Axe missed all this, but Claude caught it. I wonder if there’s an automated scanner that could help me out.

Explain your asterisks

If you’re going to denote require inputs using an asterisk * in the label, you’d better provide a legend that explains it. Even better, replace asterisk with (required).

Oops, thanks for the reminder, Claude. I added an explainer to the form:

<small>* asterisks denote required fields</small>

Punctuation as labels

I built a pagination component that looked like this:

< 1 … 45 46 47 … 104 >

Claude reminded me that when a screen reader reads out those angle brackets and ellipses, it’s going to sound weird. I opened Voiceover and sure enough - it sounds weird.

I followed Pagy’s example: the ellipses get role="separator" and the buttons get aria-label="Next"/aria-label="Previous".

Table header cell scopes

A blind spot for me: I didn’t know about the scope attribute. WCAG recommends using scope="col" on table header <th> cells to associate them with their column. And also using <th scope="row"> for table body cells that identify the subject of the row.

Probably more useful for complex tables than simple ones. I’ll have to remember this.

Thank goodness for automated scanners and the people who maintain them[^2]! The stuff I build is better for it. I was impressed by the bugs Claude caught, even though it surely wasn’t comparable to an accessibility audit by a real person.

[^1] My prompt: “You are an accessibility expert. Please review all the pages on this site and create a table of accessibility and WCAG violations”

[^2] By the way: thoughtbot maintains CapybaraAccessibilityAudit which uses Axe under the hood!

The Bike Shed Ep 500: Celebrating with past hosts

2026-05-26T00:00:00+00:00

The Bike Shed celebrates its 500th episode with hosts new and old as they reflect on the show’s history and ask, what’s new in your world?

Why Duck Typer?

2026-05-26T00:00:00+00:00

Duck Typer is a Ruby gem that validates interface compatibility across polymorphic classes sharing the same role, so they can be used interchangeably. It detects and clearly reports interface drift directly in your test suite.

Since Duck Typer launched, there’s been some discussion about the validity of interface testing. In this post, I want to make the case for it.

“Interface tests are fragile, so you shouldn’t write them”

That’s not true without context. How is your test suite structured? What do you test? Obviously, if you write only interface tests like this:

def test_interfaces_match
  assert_interfaces_match [StripeProcessor, PaypalProcessor]
end

With no behavior tests to accompany it, that quote will be true. Why? Because you’re not testing actual code behavior. Alone, Duck Typer tests are fragile. So why should you still write them?

“But I already have behavior tests that catch mismatches”

You do, and they will catch mismatches eventually, assuming you have good test coverage. The problem is how they catch them. A behavior test will blow up with a NoMethodError or an ArgumentError, but nothing about that tells you it’s an interface problem across a group of classes. You have to figure that out yourself, then work backwards to find which class drifted and what changed.

Duck Typer short-circuits that investigation. It tells you what drifted and where, in a single message, before you ever hit a behavioral failure:

Expected StripeProcessor and PaypalProcessor to implement compatible
interfaces, but the following method signatures differ:

StripeProcessor: refund(transaction_id)
PaypalProcessor: refund not defined

There’s also a sharper version of this objection: “You can remove the implementation and the test still passes, so it’s not a good test.” That’s true, and it’s by design. Duck Typer checks shape, not behavior. It explicitly marks that a set of classes is expected to evolve together, and when one changes, the failure makes it clear. That’s a different job than verifying correctness, and both are worth doing.

It’s about quality of life

At thoughtbot, we always valued testing UX and clear error reporting. We care about the details. For example, this is a style of test generally not encouraged here:

expect(objects).to eq([post_1, post_2, post_3])

Assume that the post objects are complex Active Record instances. Can you imagine what the error message will look like if one object has differences? It will dump a huge blob of text that incurs overhead to parse. What are we really testing there? That we’re getting the right objects! Instead, we can use named identifiers to make error reporting more actionable and crystal clear:

expect(objects.map(&:title)).to eq(["Post 1", "Post 2", "Post 3"])

Duck Typer applies the same principle to interface errors. Without it, you only get generic Ruby errors that say nothing about interface drift across classes. With Duck Typer, you also get a clear, targeted failure:

Expected StripeProcessor and BraintreeProcessor to implement compatible
interfaces, but the following method signatures differ:

StripeProcessor: charge(amount, currency:)
BraintreeProcessor: charge(amount, currency:, description:)

StripeProcessor: refund(transaction_id)
BraintreeProcessor: refund(transaction_id, amount)

It communicates design intent as actionable errors

I wish Ruby had interfaces. As I said in the introductory post, I want to be alerted of interface drift because it’s a great developer experience feature.

It’s not always obvious when classes are supposed to be used interchangeably. A clear error message communicates which classes share a role and what shape their interfaces should have.

What if you join a legacy project where the original developers left a long time ago? Duck Typer would be super helpful there too.

A concrete example: Null Objects. You add a deactivate method to User, and your behavior tests for User pass. But NullUser, which is supposed to be interchangeable with User, silently drifts because nobody remembered to update it. Behavior tests on User won’t catch that. Duck Typer will, immediately, because it treats those classes as a group that must stay in sync. It also reminds you to write the actual behavior test for NullUser#deactivate.

As a developer who loves targeted feedback, that is right up my alley.

It helps you think about design

Let’s say that introducing a do_stuff public method in StripeProcessor is the easiest way to accomplish a goal. You add it, but get a test failure like the following:

Expected StripeProcessor and PaypalProcessor to implement compatible
interfaces, but the following method signatures differ:

StripeProcessor: do_stuff(data)
PaypalProcessor: do_stuff not defined

That message doesn’t just report interface drift. It actually asks:

Why are you doing that? A public method in StripeProcessor should also exist in the other processors.

Most likely, your do_stuff method is not in the right place. Maybe it belongs in a collaborator object, or maybe it should be a private method that isn’t part of the public interface at all.

The same applies to differing method parameters; if you introduce a parameter in one class but it is not needed in another class from the same interface, you are probably doing something wrong.

“But that’s just like shoulda-matchers”

Not quite. Shoulda Matchers are great for shortening TDD feedback loops when working with Rails conventions. They verify a single object’s declarations: does this model have_many :posts? Does it validate_presence_of :email? That’s inward-facing: one object, one declaration.

In fact, once the code has enough behavior coverage, you could delete the shoulda-matchers tests entirely. They’ve done their job.

Duck Typer is cross-cutting. It checks whether a group of objects agrees on a shared interface. The question isn’t “does StripeProcessor have a charge method?” but “do StripeProcessor, PaypalProcessor, and BraintreeProcessor all define charge with the same signature?” That’s a fundamentally different concern, and one that single-object matchers can’t express.

“But it doesn’t catch errors in production”

Some prefer an approach where the interface is validated at class load time: declare the contract, and if a class doesn’t conform, raise a RuntimeError immediately. That way, mismatches surface as errors in production rather than only in tests.

That’s a valid approach, although not exactly great. In a typed language, an interface mismatch would never be deployed because the code wouldn’t compile. Ruby doesn’t have a compiler, but it has its own equivalent: the test suite. And guess what inhibits bad deployments in Ruby projects? In all my years working with Ruby, I’ve never seen a project without a CI pipeline. If tests fail, your code doesn’t get deployed. In practice, the safety net is the same.

On top of that, runtime checks add metaprogramming to your production code, and you’d still need tests to verify the setup is correct.

That’s why Duck Typer deliberately stays in the test suite: it’s Ruby’s natural place to enforce constraints like this, and your implementation stays clean, without workarounds that try to mimic static typing at runtime.

If you want compile-time or runtime guarantees, tools like Sorbet or RBS take a fundamentally different approach to the same problem and you wouldn’t need Duck Typer. That said, Duck Typer gives you some of those benefits with a fraction of the effort, at least when it comes to interfaces.

Wrapping up

Duck Typer won’t replace your behavior tests, and it was never meant to. It’s a small, focused tool that gives you targeted feedback when interfaces drift. It’s usually a one-liner to add, has no runtime dependencies, and lives only in your test environment. If you value clear error messages and care about keeping polymorphic classes in sync, give it a try.