Echoes of the machine

DNA, lineage, and provenance: the genetic metaphor for AI artifacts

Sid Smith — Tue, 09 Jun 2026 13:00:00 GMT

The first time somebody asked me, in earnest, "where did this model come from?" I had a good answer for about thirty seconds. Checkpoint 4 of the second fine-tune of the third base release. Q2 corpus. The eval suite we'd been running since March. Then the follow-up, which version of Q2, which prompts shaped the SFT pass, which downstream embeddings inherited from this checkpoint, and I was stitching answers from four systems and a Slack thread. The confidence drained out of the conversation in the way that tells you the architecture has a hole in it.

The hole isn't unusual. Every AI pipeline I've worked on has had some version of it. The vocabulary almost everyone reaches for is the manifest, a JSON file alongside the artifact saying "this model was trained on this dataset, with this code, at this time." Manifests are useful. They are not enough. A manifest is a snapshot. What you actually need is a graph.

I keep reaching for the genetic metaphor here. Not because biology has anything technical to teach a pipeline engineer, but because biology already invented the right vocabulary for "what did this thing inherit from where, and what shares its lineage." Descent. Inheritance. Mutation. The genome is the snapshot; the family tree is the graph; the population is the system you're actually trying to reason about. The genome of an individual organism is useless without the lineage, you can sequence it perfectly and still not know whether a trait is novel, inherited, or convergent.

This is the fourth metaphor in the series, complementing atoms-and-molecules (composition), the periodic table (layout), and cosmology (containment). Where those three are about how things relate at a moment, this one is about how things relate across time.

What the manifest framing hides

A manifest tells you what an artifact is made of right now: dataset hash, code commit, hyperparameters, base model. A static parts list. For build artifacts where inputs are small and the lifecycle is short, that's enough.

AI artifacts don't behave like build artifacts. Three reasons.

First, the inputs aren't small. A training corpus is itself a derived artifact (scraped, filtered, deduplicated, labeled, augmented) with a lineage of its own that crosses pipelines and probably team boundaries. The "dataset" in the manifest is one node in a tree of datasets. The manifest captures only the leaf.

Second, the lifecycle is long. A checkpoint gets fine-tuned, distilled, quantized, served, and re-trained on logs that include its own outputs. Each operation produces a new artifact whose manifest references the prior one, but the chain isn't navigable from any single manifest. To answer "is this production model downstream of the corpus we now know was poisoned?" you traverse backwards through every link, and a manifest doesn't know it's a link.

Third, artifacts share foundations. The same base model spawns dozens of fine-tunes, hundreds of adapters, thousands of prompts. Manifests describe each in isolation; they don't describe the population. "Which prompts in production are affected by the change to this base model" is a question you'll eventually have to answer, and the manifest is silent because the question is across artifacts.

The graph is the thing. The manifest is a node in the graph. Treating the manifest as the unit of provenance is the analog of doing biology by sequencing one organism perfectly and ignoring the family tree.

What the genetic metaphor demands

The metaphor earns its keep by forcing four properties into the design.

Every artifact has parents. Not "inputs", parents. Inputs are what the build consumed; parents are what the artifact descends from. They overlap, but parents include things the build didn't directly consume yet the artifact still inherited from: the prompt template that shaped the SFT data three steps back, the eval set whose failures drove the curriculum, the base model whose tokenizer is now baked in. Every parent edge is typed, trained-on, distilled-from, quantized-from, prompted-by, evaluated-against. The edge type tells you what was inherited.

Lineage is queryable in both directions. Walk up to ancestors ("what shaped this?") or down to descendants ("what does this shape?"). Both queries are first-class. Most provenance systems get the upward query right and ignore the downward one because it's expensive and nobody asks until it's too late. But the downward query is the one you need when you discover a problem upstream and need to know the blast radius.

Mutations are explicit. When an artifact is derived from a parent with some change, the change is recorded as a typed mutation. Genetic mutations come in flavors (point, insertion, deletion, duplication, recombination); model mutations have analogous ones (continued training, parameter pruning, layer freezing, adapter merge, RLHF pass). "Same model with one extra epoch" is one kind of edge; "same model quantized to 4-bit" is another. Both produce a child, but they relate to the parent in different ways, and the edge type tells you which.

Lineage is a first-class artifact, not metadata. The load-bearing one. Most pipelines treat provenance as a sidecar, a JSON file next to the artifact, indexed weakly if at all. The genetic framing inverts that. The lineage graph is the artifact you most care about; individual nodes are how it's instantiated. The graph has its own schema, storage, access patterns, SLOs. You version it, audit it, query it. The artifacts are projections of nodes; the graph is the system.

When those four properties are present, you have a genealogy, not a parts list. The questions that used to take an afternoon and a Slack thread take a query.

What a lineage-aware AI pipeline looks like

Concretely. The pipeline has a provenance store as a primary subsystem, not an observability afterthought. Every meaningful creation event (corpus build, fine-tune launch, eval run, serving deployment, prompt commit) emits a node with typed edges to its parents. Schemas for nodes and edges are enforced at ingestion, the way a type system is enforced at compile time.

Pipeline tools (trainer, eval harness, deployment controller, prompt registry) all write into the same graph, not into separate manifest files reconciled later. The graph is the source of truth; artifacts carry a stable identifier pointing at their node. Most teams skip this because making N tools agree on a graph schema is more political than technical. It's worth eating the cost. The alternative is N parallel "lineage" stories that disagree at every reconciliation.

The graph is content-addressed where it can be. Hashable artifacts (datasets, model weights, prompt templates, eval suites) carry the hash in their identity. Two nodes with the same hash are the same node, because the identity rule says so. This is the atomic-molecular discipline applied at the node level. Atoms are immutable, typed, small, stably identified. Lineage edges compose them. The graph is the population.

Queries are part of the developer surface. "Everything downstream of this dataset version" is an API call, not a forensic exercise. "Every production prompt depending on a base model older than ninety days" is a dashboard, not an audit project. When queries are easy, the team starts asking them prophylactically instead of in postmortems.

The graph is also where policy hooks anchor. The decisions-as-code pattern that governs deployments governs lineage: "no production model may descend from a corpus that hasn't passed the PII filter at version >=3" is enforced against the graph at promotion time. The check walks lineage upward; if any ancestor fails, promotion is blocked. The graph makes the policy enforceable; the manifest framing makes the same policy a wish.

Treating provenance as a first-class artifact

The design discipline that follows is the part I want to underline. Same shape as the other metaphor pieces, the framing changes the work.

If provenance is metadata, it gets the budget of metadata: a few hundred bytes next to the artifact, an index nobody owns, a schema that drifts because no one's job depends on it. When something goes wrong upstream, you spend a week reconstructing what should have been a query. The cost is invisible until you need it, then catastrophic.

If provenance is a first-class artifact, it gets the budget of one. Schema review. SLOs. Versioning. Backups. An owning team. Tests that fail when the graph isn't ingested correctly. The cost is visible up front and cheaper than the alternative because the alternative compounds. That discipline is what separates a pipeline that answers the audit question in a meeting from one that schedules a sprint to find out.

The cultural piece is harder than the technical. Engineers like to ship the model and treat lineage as exhaust. Reframing it so lineage is the product and the model is one node, so the standard answer to "what changed?" is a graph diff, not a release note, takes deliberate work. The metaphor helps because it makes the framing self-justifying. Nobody seriously argues that the genome of one cell tells you what's wrong with the organism.

Where the metaphor has limits

Genetics gives you single-parent inheritance for asexual reproduction and dual-parent for sexual; AI artifacts can have N parents and the metaphor needs a stretch. Adapter merges, ensembles, RAG retrievals at inference, many parents whose contributions are weighted, sometimes opaquely. The fix is to keep the inheritance structure but admit weighted, multi-parent edges. Hybridization works as a mental model; it's just more common in pipelines than in nature.

The other limit: biology has natural selection telling you which lineages matter. Pipelines don't. You have to choose deliberately what's worth admitting, emit a node for every prompt evaluation and the graph melts; emit only for promoted artifacts and you lose debug resolution. The granularity choice is unavoidable and the metaphor doesn't make it easier. But once set, the rest carries through cleanly.

The discipline, not the helix

The metaphor isn't load-bearing on its own. The discipline is: treat lineage as a first-class object, not an annotation. Name parents. Type edges. Make mutations explicit. Make the graph queryable in both directions. Anchor policies, audits, and debugging sessions in the graph rather than per-artifact files.

Call it lineage, provenance, ancestry, genealogy, whichever word doesn't already mean something else in your codebase. The point is that you have a graph, the graph is owned, and the graph is the answer to the questions that matter. Manifests stay useful as the on-disk projection of a node. They stop carrying weight they were never built to carry.

I keep coming back to the genetic framing because the audit question (where did this come from) is structurally a lineage question, and biology has had the right vocabulary for a hundred and fifty years. Borrow it. Skip the nucleotides. Treat descent as a thing you design for, not reconstruct after the fact.

, Sid

Traceability as a debugging tool, not a compliance one

Sid Smith — Tue, 02 Jun 2026 13:00:00 GMT

Here's a claim that sounds backwards and is, after a few years of holding the on-call pager and a few more sitting across from auditors, the thing I'm most confident about in this series.

Traceability is not a compliance feature. It's a debugging feature. The compliance use case is a side effect of the debugging use case. Build for debugging, you get compliance for free. Build for compliance, you almost never get debugging, and you frequently get neither, because the compliance shape of the trail isn't actually the shape an auditor wants once the question gets sharp.

I have watched this play out enough times to be tired of it. Every platform I've joined had a traceability story that began life as a compliance line item. Somebody scoped it against a control framework, picked the events they thought the auditor cared about, structured the emit to match an export format, and shipped. Six months later, payments started failing in a way nobody could explain, and the trail that was supposed to satisfy the auditor turned out to be useless for explaining what happened to the payment. The on-call engineer ended up in grep and Slack, like always.

The two use cases pull the design in different directions, and most teams don't notice the pull until they're already on the wrong side of it.

The 3am question

The use case I want to design for is the one I've actually had to answer at 3am. Not hypothetically. With the pager going off and a customer escalating in a parallel thread.

A payment failed. Or, worse, it didn't fail, it succeeded but routed to the wrong account, or succeeded but the customer never got the receipt, or succeeded twice for reasons the system swears are impossible. I have a transaction ID. I have a vague timestamp. I have a customer who is angry. I have about twenty minutes before this becomes a postmortem.

What I need from the trail in that moment is not what the compliance framework asks for. The framework asks: was this action authorized, by whom, under what rule. Real questions, and they matter. But they are not the questions that get me out of the incident.

The questions that get me out are: what did the system see, in what order, what did it do with each piece, what was the state of every dependency it touched, what came back, what side effect propagated where, what almost happened but didn't because some retry succeeded on the third try. I need to walk backwards from the symptom to the cause without leaving the trail. Every node along the way has to carry enough context that I can reconstruct local state without ssh-ing into a box that was decommissioned an hour ago.

That is a debugging trace. Rich, contextual, carrying inputs, outputs, intermediate state, decision points, retries, fallbacks, and the actual data the system was reasoning about, not just IDs pointing at data that may or may not still exist. Generous, because the cost of an extra field at emit time is nothing and the cost of a missing field at debug time is the entire incident.

The compliance shape

The compliance trace looks different. It is sparse. It is structured. It is optimized for export to a system the auditor's team uses. It carries the events the framework named, auth.granted, policy.evaluated, record.created, with the fields the framework named, in the schema the framework prescribed.

It is, in its purest form, a list of decisions, each annotated with the authority that permitted them. It's what you'd design if your only customer were an auditor with thirty rows to look at and a checklist for each.

The compliance shape isn't wrong. It answers real questions, and the five questions every audit trail must answer are the ones a good compliance trace was built to handle. But the compliance shape, if you build only for it, leaves out almost everything the on-call engineer needs. No record of what state the dependency was in. No record of what the agent saw before it picked the tool. No record of the request that almost succeeded on the second retry. The fields that don't matter for export are the fields that matter for debug.

And here's the part that took me too long to internalize: the compliance shape, even on its own terms, often fails. The auditor's first question is the one the framework anticipated. Their second question (the one they ask because something in the first answer didn't quite sit right) is almost always one the compliance trace cannot answer, because answering it requires the context the design left out for the sake of a clean export.

Why debug-first dominates

Now the claim. A trace designed for the debugging case is strictly more powerful than a trace designed for the compliance case. Strictly. Not "usually." Not "on average." Strictly. The debug trace contains everything the compliance trace contains, plus the contextual richness the compliance trace omits.

The work of producing a compliance export from a debug trace is a projection, you select the fields the framework names, you filter to the events the framework cares about, you reshape the schema to match the export format. That work is mechanical. A small amount of code, run on demand, against the same trail the on-call engineer is using. The compliance team gets exactly what they need, and you maintain one trail instead of two.

The reverse projection does not exist. You cannot reconstruct the debug trace from the compliance trace. The information was never captured. The trail that was sparse-by-design is sparse forever.

Which leads to the cheapest, simplest, most operationally honest path: design for the debug case, derive the compliance case from it. One source of truth. One emit pipeline. One schema, generous, with the compliance projection as a query, not as a separate trail.

The expensive, fragile, two-team path is: design two trails. Pay the cost of consistency between them. Discover, eighteen months in, that they have drifted, that one is missing events the other has, that the compliance audit pulls a row and the debug trail can't reproduce it. Then ship a project to reconcile them, on a deadline, while the auditor waits.

I have watched that project happen. It is not a project anyone wants to be on.

What the debug-first trace actually looks like

Concretely, the trace has a small set of properties.

Every event carries the inputs the system saw at decision time. Not pointers to inputs. The actual values. If the rule engine evaluated against tier=A, region=EU, amount=4200, customer_age_days=87, those four fields are in the row. The upstream service might be down, retention might be shorter there, the field might be named differently. The decision row carries the inputs locally.

Every event carries the outputs and the side effects. What the system returned. What it wrote. What downstream call it kicked off. The IDs of the writes, with enough context that you can find them again without joining across four services.

Every event carries a coordination identifier that lets you walk the chain, the orchestrator's view of the run, with each participant labeled, as I covered in the five questions. Every step carries the identifier and an index. You can walk forward or backward without guessing.

Every event carries the rule that allowed it, with the version inline. The forward trace tells you what the system did; the rule pointer tells you what it was supposed to do. Both belong in the same row, because at debug time you need to know not just "what happened" but "was what happened actually correct."

Every event carries timing rich enough to be diagnostic. Start time, end time, duration, which dependencies it waited on, which retries it ran. The compliance shape doesn't need any of this. The debug shape lives or dies on it.

The cost of this richness is paid once, at design time, in the standards library that owns the schema. The cost of not having it is paid every time the on-call engineer reconstructs an incident from grep and intuition. I have priced both. The richness is cheaper.

When audit-only blows up

The companion failure to "build for compliance, never get debug" is the team that did build a compliance-focused trail, and it works, and they hit the audit. Then a real incident hits, a payment misroute, a model invocation that did the wrong thing, an agent that ran a tool it shouldn't have, and the trail is missing exactly the contextual fields that would let them figure out what happened.

The team writes the postmortem with "we believe" in it. They commit to enriching the trail. They ship the enrichment, and now they have two trails (the original compliance one and a new debugging one bolted on) and both decay independently. The decay modes I covered in why traceability dies in most platforms apply twice. By the second incident the new trail has drifted from the schema. By the third, nobody is sure which one is standard.

The fix is not to ship two trails. The fix is to start with the debug-shaped trail and derive the audit view from it. One trail. One schema. One owner.

The reframe, said plainly

Audit trails are a real obligation. They are also a derivative artifact. The thing you should be building is the trail that the on-call engineer needs at 3am. The auditor's view is a query against that trail, not a separate system.

If your team is staffing a compliance project to build an audit trail and the debugging story is "we'll figure that out when an incident happens," you have the priorities inverted. Reverse them. Build the debug trail. Make the auditor a downstream consumer of the same data, with their own projection. You will spend less, get a better debug experience, and (paradoxically) pass the audit more cleanly than the team that built for the audit, because the second-order question will have an answer waiting in the rich trace instead of in a Slack thread that begins "we believe."

It's the pattern across every platform I've seen survive both shapes of pressure. The teams that built for debug are the ones whose on-call engineers come out of incidents with answers and whose audits feel like data extraction. The teams that built for compliance are still in grep at 3am and still in long meetings with the auditor at 10am, rationalizing why both situations are temporary.

They aren't. They're the design, doing what it was designed to do.

, Sid

Pricing the service: subscription, per-resolution, outcome-based

Sid Smith — Sat, 30 May 2026 13:00:00 GMT

This is the closer of the 4-piece year-one series. After this the blog goes back to the regular weekly mix, news roundups on Sundays, deeper-dive pieces midweek, the architecture stuff when something is worth writing about. Thank you for reading 22 in a row.

For the closer, the topic that everyone in the MVP series quietly wanted me to get to: how do you actually charge for any of this.

Pricing decision tree

Quick recap of where we are. A consultant signs up, picks a starter pack, uploads their material, gets a working surface in five minutes (two weeks back), runs their AI from a supervisor view while customers ask questions through a customer view (last week's piece), and the inference cost is being managed by routing across Haiku / Sonnet / Opus / Llama on Bedrock per loop (the piece that started this series). Now: somebody has to pay for it, and the pricing model is part of the product, not separate from it.

Three pricing shapes are realistic for this kind of product: subscription, per-resolution, outcome-based. Each one fits some verticals beautifully and some terribly. The cost-of-goods math behind each one is actually knowable, which is the unglamorous gift of this architecture.

Let me walk through how I think about it.

The three shapes

Subscription. Flat monthly fee, fair-use cap. The consultant pays you, you give them a tenant, customers use the surface, the meter never runs in front of the consultant or the customer. Predictable revenue for you, predictable cost for them. The downside: if the consultant's customer base 10x's, you eat the cost overage. If it never grows, you charge the same as you would for a tenant doing 100x the volume.

Per-resolution. You charge per query that gets answered (or per ticket that gets closed, or per contract reviewed, or per resume marked-up, the unit varies by vertical). The meter runs in proportion to the work the AI does. Aligns cost-to-value almost perfectly. Downside: customers and consultants both hate watching meters, and "what counts as a resolution" becomes a definitional argument that eats into trust.

Outcome-based. You charge a fee tied to a measurable outcome the AI produced. A successfully placed candidate (HR consultant). A signed deal a sales-discovery brief contributed to (sales). A contract issue caught that would otherwise have leaked through (legal). Highest possible alignment, highest possible price tag, highest possible measurement and dispute risk.

None of these is right or wrong. They're trade-offs across four dimensions: revenue predictability, cost alignment, sales friction, and dispute risk. The right shape depends on the consultant's vertical and the consultant's own customer relationships.

The decision tree I actually use

Three questions I ask, in order, before suggesting a pricing model to a vertical.

Is the unit of work clearly definable from outside the system?

Per-resolution and outcome-based both depend on having a unit that the customer and consultant agree counts. Some verticals have this naturally. Contract review: a contract is a contract. Resume coaching: a resume is a resume. Interview rubric: a candidate writeup is a candidate writeup. You can charge per unit and nobody argues.

Other verticals don't have a clear external unit. Sales discovery, what's a unit? A call prep brief? A research session? A whole pursuit? The consultant and customer might define it three different ways and the product can't enforce any of them without irritating somebody. In those, subscription is the safe default because the meter problem doesn't exist.

How variable is per-tenant volume?

If your tenant base is going to span "consultant doing 30 customer queries a month" to "consultant doing 30,000," subscription pricing breaks one of them, usually you, on the high end. Per-resolution scales with use, which is what you want when the spread is wide.

For verticals with naturally narrow spread, say, medical second-opinion review where each specialist's volume is bounded by their own throughput, subscription works fine.

How directly attributable is the AI's work to a measurable outcome?

Outcome-based pricing only works when you can prove the AI moved the needle. A legal-pro product that catches a clause that would have cost the customer $50k is straightforwardly outcome-attributable. A career-coach product that helped someone get a job is somewhat attributable but lots of other things contributed. A marketing-positioning advisor whose AI-assisted brief contributed to a quarter's better revenue is barely attributable at all without a much bigger measurement apparatus.

If attribution is clean, outcome-based gets you the highest revenue per customer. If it's muddy, don't bother, you'll spend all your effort defending the bill.

The cost-of-goods math, honestly

Here's where the architecture from the MVP series pays back. Because you have observability and audit on day one (piece #13) and you've thought about cost as a design input from the start (piece #15), you can actually compute COGS per query. Most AI products can't.

Per-query cost has three layers.

Layer 1: Bedrock tokens. Variable per query, varies dramatically by which model the router picked (see the model-selection piece). Triage with Haiku is fractions of a cent. Diagnose with Sonnet is low single-digit cents. The 5-10% of cases that escalate to Opus are 5-10x that. Llama batch work is low. If you log per-query model selection (you should be), you can compute exact Bedrock cost per query and roll it up to per-tenant per-month.

Layer 2: RDS + storage + bandwidth. Per-tenant overhead. The pgvector store grows with the consultant's corpus. The audit table grows with usage. RDS instance cost is shared across tenants. CloudWatch logs are real money at scale. Plus S3 for artifacts. This layer is harder to attribute exactly per query, but you can attribute it per tenant per month with reasonable accuracy.

Layer 3: Mac Studio amortized. The local stack (piece #5) (fine-tuning, batch inference, transcription, image gen) has a fixed capital cost and an electricity bill. Spread that over your tenant base divided by the share of work each tenant pushes through the local pipeline. For most products this layer is small per-tenant per-month if your tenant base is healthy. If you have three tenants, the Mac Studio is expensive per query. If you have 300, it's basically free.

Add the three layers, and you have a per-tenant-per-month COGS number you can put up against any of the three pricing models and check whether your margin is real.

The number that matters: what's the gross margin at typical tenant volume? If subscription pricing puts you at 40% margin on a typical tenant and 5% margin on a heavy-use tenant, your subscription tier needs a usage cap or a heavy-use overage rate. If per-resolution pricing puts you at consistent 60% margin across tenant sizes. That's the right shape for that vertical.

How the three shapes play across verticals

Three quick walkthroughs, then a fourth on the cross-vertical pattern.

An IT-ops consultant doing infrastructure triage and resolution. Volume per tenant is highly variable, small managed-services shops doing 50 tickets a week, large ones doing 5,000. Unit of work (a resolved ticket) is naturally well-defined. Outcome attribution is direct (ticket either resolved or didn't). Per-resolution wins. Meter the resolved-tickets count, charge per, set a tiny baseline subscription so you have predictable floor revenue plus the per-unit upside.

A career coach doing resume + positioning review. Volume is narrower (a coach has so many candidates per month), unit is clear (a resume), outcome is muddy (job offers come from many sources). Subscription per coach with a fair-use cap is the cleanest shape. Maybe a small per-extra-resume overage for coaches who go over the cap. Outcome-based is a tar pit here, too many factors contribute to landing a role.

A legal pro auto-reviewing contracts against their playbook. Volume varies but is mostly bounded by the lawyer's own bandwidth. Unit (a clause flagged, a contract reviewed) is well-defined. Outcome attribution is occasionally crisp ("this clause would have cost the client $X if it shipped, we caught it") but mostly fuzzy. Hybrid: subscription floor plus a per-contract-reviewed line item. The lawyer knows monthly cost will be in a band. The product gets paid more when used more.

The pattern across verticals. Most consultant-AI products end up at subscription with a usage component on top. Pure subscription leaves money on the table for high-volume tenants and undercharges-then-loses-margin on heavy ones. Pure per-resolution puts a meter in front of the customer that nobody enjoys watching. Hybrid is boring and right.

The cost-of-goods math here is only possible because the architecture from the MVP series logs everything per query in a structured way. If you skipped the audit-on-day-one investment from piece #13, you cannot price a hybrid model honestly. You'll be guessing at margin.

The free-tier question

Yes. Have one. Here's why and how.

A free tier in this product isn't "free chat with no value." It's "let the consultant use the onboarding flow and get to the five-minute moment, then let them try a small number of real customer queries before committing." Five minutes from the onboarding piece is the trial.

The free tier exists to let the consultant prove the product works on their material before they pay you. That's the highest-leverage demo you can run, and it scales, every signup runs it themselves, you don't have to give a sales call.

The cost of the free tier is real. A free tenant takes RDS rows, embeds documents (storage), runs Bedrock calls (per-token cost), generates audit rows. So you cap it. The numbers I've seen work:

50 customer-side queries total in the free tier (lifetime, not monthly).
Limited corpus size (say 50 MB of uploads or 200 documents).
Full feature access, no neutered functionality, if you make the trial weak, the conversion will be weak.
Auto-suspend after the cap until they convert; don't auto-bill, don't surprise-charge.

Cost per free tenant works out to a manageable number of dollars per signup. Conversion rate from free to paid in this kind of product, when the onboarding actually delivers the five-minute moment, lands somewhere in the 8-15% range based on what I've seen elsewhere. The economics work if your paid plan margin can carry roughly 7-12 free tenants per paid one. For most of the verticals here, it can.

Don't fall into "free tier with rate limits per day." That just teaches the consultant your product feels stingy. Generous-but-bounded beats stingy-but-generous on time horizon.

Pricing changes are product changes

One thing I want to leave you with as the operate series wraps.

A pricing change in this kind of product isn't a marketing change. It's a product change. Because the unit you're charging on (per resolution, per contract, per resume) has to be measured by the system, displayed in the consultant view, defended in the audit trail, and capped or metered in the customer view. Switching from subscription to per-resolution means engineering work in five places.

Which means: pick one model to launch with, run it for at least a quarter, watch where it breaks, and only then think about adding the second. The biggest pricing mistake I see startups make in this space isn't picking the wrong model. It's flipping the model after three months because revenue isn't where they hoped, and creating a billing dumpster fire that takes another quarter to clean up.

Pick deliberately. Wire the meter end-to-end. Watch what the data tells you. And be prepared to defend whatever you charge against the COGS math, because the customer who challenges you on it is doing you a favor, they're telling you they care.

What's next

This wraps the four-piece operate series. The full 22-article arc, 18 MVP pieces (cap article here) plus these 4, was built around one idea: the architecture that turns any consultant's secret sauce into a working AI-powered product is a known shape. What changes is what's yours.

If you've followed the whole run, the next ask I'd put to you is the one I keep putting to myself: what's the smallest version you'd actually ship? Not the version you'd build if you had a year. The version you'd put in front of one consultant tomorrow. That's the cut line.

Back to the regular cadence from here. News roundups on Sundays, deep-dives midweek, whatever bites me hard enough to write about as it happens. Thanks for reading, and if you're shipping anything in this space, drop me a line. I want to hear what cracks.

The customer view vs the consultant view: two surfaces, one product

Sid Smith — Fri, 29 May 2026 13:00:00 GMT

Last week I wrote about getting a new consultant from signup to a working surface in five minutes. That assumes one thing the new consultant doesn't always realize on day one: they're not just a user. They're a supervisor of their own AI.

Which means this product has two faces. Two UIs sitting on the same backend, showing different things, optimized for different work, measured by different numbers.

Two views, one backend

I want to talk about that split because I see people get it wrong in the same way every time. They build one surface (the customer-facing chat) and then bolt on a "settings page" or "admin panel" later for the consultant. The consultant view ends up being a forms-and-tables afterthought that nobody enjoys using. So the consultant doesn't use it. So the system doesn't learn. So the product stays stuck in human-approve-everything mode forever.

The fix is to treat the consultant view as a real product surface from day one. Not "admin." A product. The other half of what you're selling.

What each surface is actually for

Let me ground it before going technical.

The customer view is the front of the house. A small business owner needs help with a hiring decision and they're paying an HR consultant whose surface is built on this product. They open the app, type "I have two finalists for a senior PM role. Here's their backgrounds. Here's the role, what would you push on in the next round?" and they want an answer. Maybe right now, maybe in 20 minutes. They don't care which.

The consultant view is the back of the house. The HR consultant whose name is on the product opens it Monday morning and sees: 23 queries from customers since Friday. 18 have been auto-resolved (the AI handled it confidently, the answer went out, all logged). 4 are in their approval queue (the AI drafted an answer, low-medium confidence, the consultant has to sign off). 1 is escalated (the AI tagged it as outside-rubric, hand-it-to-the-human). The consultant works that queue, approves what's good, edits what's almost-good, denies what's wrong, and reads the auto-resolved trail for quality control.

Two surfaces. Same data underneath. Different jobs.

What the customer view actually shows

Plain shape: ask a question, get an answer. Or get a "we're working on it" status if the answer is queued for review.

That's it. Everything else on the customer side is decoration.

The trap I see people fall into: trying to make the customer view "smart." Showing confidence scores. Surfacing which retrieved documents got used. Letting the customer pick a model. None of this. The customer pays for the consultant's expertise delivered through software. They don't want to look at the engine room.

What the customer view does need:

A question box that handles the obvious things, markdown, file attach, voice input optional.
A clear "answer pending review" state for queued items, with an honest estimate of when they'll see something. Not "soon." A real time band.
A history of their own past queries so they can scroll back and reference prior answers.
A graceful state when something falls outside what the AI can handle, with the human-only fallback the consultant chose (see the failure-modes piece, #14, for what that looks like).
And, optionally, the ability to mark an answer as "this didn't help" so the consultant sees the miss.

That's the surface. Clean. Calm. A typing box and a panel of answers. The work is invisible.

The behind-the-scenes path is the three-loop pattern from piece #9: triage routes the query, diagnose runs the retrieval-augmented generation (that's RAG (pulling the consultant's relevant material into the prompt) if you want to read up later), resolve either ships the answer straight to the customer or hands it to the approval queue. The customer view doesn't show any of that. It shows "thinking..." and then "here's your answer" or "this is queued, expect ~15 minutes."

What the consultant view actually shows

This is the surface I underestimate every time I sketch a new product and then regret.

The consultant view is a working tool. It has to feel good to use because the consultant will be in it five days a week. Six panels, roughly:

The queue. A list of pending items. Customer query at the top, the AI's drafted answer in the middle, the retrieved sources (with hover or click to see the actual cited content), the confidence signal, and three buttons: approve, edit-and-approve, deny. Edit-and-approve is by far the most-used. The deny is a learning signal, denied items pattern-mine into future eval cases.

Resolved history. Everything the AI auto-resolved (didn't need human approval) shown in a scrollable feed. The consultant skims it. They're spot-checking. If they see something off, they click in and reclassify it back into the approval gate retroactively, which both fixes the customer-facing record (with audit) and feeds back into the confidence threshold.

Pattern view. The interesting one. A view that clusters customer queries by topic or intent over time and shows which clusters the AI handles well (high confidence, low edit rate, no complaints) and which it doesn't (low confidence, high edit rate, denials). This is where the consultant decides what to add training material on next. "Oh, every query about offer-stage negotiation is getting edited. I should drop in my offer-stage playbook."

Persona controls. The voice-shaping settings from onboarding (see last week's piece) plus the ability to tune them as they learn. Tone, length, hedge level, the specifics of how their AI should and shouldn't talk to customers. Plus an upload-more-content path that drops new material into the retrieval store.

Approval-gate thresholds. The actual knob from piece #12. On day one, this is set so everything goes to the queue. As the consultant builds confidence in certain query classes (and the data backs it up) they can let those classes auto-resolve. The view shows the current threshold per class and the suggested-by-data threshold, side by side.

Audit trail. Every decision, who made it (AI or human), what evidence it used, when. Searchable. (See piece #13 for the audit-on-day-one argument; this is the surface where that audit becomes useful instead of just compliant.)

Want to go deeper on the gate mechanics? The threshold logic and how it moves over time is in The approve-deny gate and when it goes away. The view I'm describing here is the surface that makes that mechanism tractable for a human.

Three consultants, three surfaces, same backend

A product PM offering decision-coaching as a service. Customer side: a junior PM at a Series A startup types in "Should we ship the feature now or after we redo the onboarding?" and gets back a structured analysis using the PM's framework, citing two of the PM's past write-ups. Approval queue side: the PM whose name is on the product reviews 6-8 of these a day during launch, approves most, edits a couple, denies one. Pattern view tells them this week's recurring miss is around technical-debt trade-offs, they upload a new write-up on that. The customer never sees any of this.

A medical specialist doing second-opinion review. Customer side: a patient (or a primary-care physician they're consulting on behalf of a patient) submits a case description and supporting documents and gets back a structured second-opinion analysis. Approval queue side: the specialist sees every case in the queue. Always. There is no auto-resolve in this vertical, the threshold is locked at 100% human review forever, by design, because the stakes don't allow otherwise. The consultant view here is doing a different job: not "decide what to auto-resolve" but "review and sign each one efficiently." Same surface, threshold knob just doesn't move.

A legal pro auto-reviewing contracts against their playbook. Customer side: a small-business owner uploads an NDA they were sent and asks "are there terms in here I should push back on?" and gets a structured response, clauses flagged, suggested edits, escalations marked. Consultant side: the legal pro sees every flagged clause in a queue, with the playbook entry that triggered it shown next to the model's draft. Approve, edit, deny. After a year, ~70% of common-clause flags auto-resolve and the consultant only sees the unusual ones. The pattern view shows them which clause types are still requiring frequent edits.

Three verticals, three thresholds, three different rhythms, and the consultant view supports all of them because the components (queue, history, patterns, persona, threshold, audit) compose differently per vertical and per consultant.

Different metrics matter for each surface

This is where I see teams confuse themselves.

For the customer view, the metrics that matter are:

Time to first useful answer (auto-resolved median + queued median).
Repeat-use rate. Does the customer come back?
"Didn't help" rate on answers (the explicit signal).
Implicit signal: ratio of follow-up questions to original questions (high follow-up means the first answer didn't fully land).

These are customer outcome metrics. They tell you whether the surface is delivering value to the buyer.

For the consultant view, the metrics that matter are:

Time spent in queue per day (lower = AI getting better; this should bend down over time).
Edit rate on approved items (lower over time = AI matching the consultant's voice better).
Pattern-view → upload conversion (did the consultant act on the gap the patterns surfaced?).
Threshold migration (how many query classes have moved from "always review" to "auto-resolve" over time).

These are leverage metrics. They tell you whether the surface is letting the consultant do more work without scaling their hours linearly.

The same dashboard does not serve both. They need to be two dashboards, watched by different people, telling different stories. I have one customer in mind every time I look at the customer dashboard, and one consultant in mind every time I look at the other.

The thing that makes this hard

It's tempting, when shipping, to ship the customer side first and the consultant side as "v0.5, just a queue, we'll add patterns later." I have done this. It backfires every time.

Here's why. The consultant view is what produces the training signal that makes the customer view better. Every approve, edit, deny, retroactive-reclassify is a labeled data point that improves retrieval ranking, prompt tuning, confidence calibration, and eventually feeds the fine-tuning loop running on the Mac Studio (see piece #5). If the consultant view is bad, the consultant doesn't use it well. If they don't use it well, the AI doesn't improve. If the AI doesn't improve, the product is a worse version of stock Claude in a wrapper.

The consultant view is the moat. It's where the consultant's secret sauce gets refined every week. Ship it first-class on day one.

The captured-judgment shape from piece #2 is the thing this surface produces. Onboarding (last week) gets the consultant in; the consultant view is what keeps the secret sauce flowing in week by week.

What to ship first if you're shipping this

If you're at MVP and trying to decide what makes it into v1, my rank order on the consultant view:

The queue (approve / edit-and-approve / deny, the minimum loop).
The resolved history (read-only at first; reclassify-retroactive can wait).
The audit trail (because the audit table from the data layer needs a UI on top, even a crappy one).
The persona + content upload tools (so the consultant can iterate without your help).
The pattern view (the highest-leverage surface but the one you can ship at v1.5 once you have data to cluster).
The threshold knob (only matters once you have enough approved examples to consider auto-resolve, usually 30+ days in).

Customer view is simpler in scope but the bar for polish is much higher. The customer's experience of your product is one or two screens, and those screens have to feel as good as a consumer chat app. Spend disproportionate design time there even though there's less to build.

The next piece in this series, and the closer of this 4-article run before the regular content cadence picks back up, is about how you actually charge for any of this. Subscription, per-resolution, outcome-based. The pricing decision tree, the cost-of-goods math, and the "free tier so consultants can try it" question. Next week.

Onboarding new tenants: the five-minute path from signup to working AI

Sid Smith — Thu, 28 May 2026 13:00:00 GMT

Last week's piece was about picking the right Bedrock model on evidence. This week is about the moment before any of that matters: a new consultant has just signed up and is staring at a blank tenant.

This is the moment every AI product gets wrong. The marketing site promised "your AI assistant, trained on your expertise." The signup flow took 90 seconds. Then the new tenant lands on a dashboard that says "Upload your knowledge base to begin" and they have no idea what that means, no idea what shape the upload should take, no idea whether the thing they have on Google Drive is the right thing, and no idea what'll come out the other end.

Five-minute onboarding

If their first session takes more than about five minutes to produce something they can show another human being, they're gone. Not "churned in week two" gone. Gone today, before they ever come back.

So the onboarding flow is not a thing you bolt on after the product works. It is the product, for the first session. Everything I said in piece #2 about captured judgment being the value, yes, that's true, but the customer can't see it on day one. What they can see is whether the thing they typed produced a useful-looking output. That's the deliverable for minute five.

What "working AI in five minutes" actually means

Let me pin this down because it's tempting to weasel out of.

Five minutes from "I clicked sign up" to "I can paste a question into my surface and get an answer that sounds like me, on a topic I care about, using examples I gave it." Not a demo with somebody else's data. Not a generic chat that could have come from raw Claude. Their voice. Their topic. Their examples. Working.

This is hard. The architecture from the MVP series helps. Cognito auth, tenant-scoped data, RDS+pgvector for retrieval, Bedrock for inference, all already wired (see piece #6 and #7), but the spine doesn't bootstrap the consultant's content. That's the problem.

The trick I've landed on: starter packs plus a guided first pass.

How the five-minute path actually works

Four screens. That's the budget.

Screen one: pick your vertical. Sales discovery, marketing positioning, product decision-coaching, IT-ops triage, contract review against a playbook, second-opinion medical review, portfolio diagnosis, interview rubric, resume coaching. Pick one. This is not "what's your job title." This is "which of the prebuilt starter shapes is closest to what you do." It seeds everything downstream.

Screen two: import what you've already got. Three buttons: upload files (PDF, DOCX, MD), paste text. Connect a source (Google Drive, Notion, Dropbox, whatever I've wired up). The customer drops in anywhere from one document to 50. The system doesn't care which yet, it just needs something to embed.

Screen three: shape your voice. Three sliders or three short prompts: how formal, how long, how much hedging vs. how directive. Plus a free-form "Anything I should know about how you talk to clients?" field. This is the persona-shaping step. Customer-facing it's three settings; behind the scenes it's a more structured object that shapes prompts and retrieval downstream. (I'm being deliberately vague about that structure, there's a patent boundary I'm staying behind.)

Screen four: try it. A sample question (auto-generated from their starter-pack vertical) sitting in a prompt box, with a "Run" button. They click. Five to fifteen seconds, and an answer comes back. It's not perfect, but it's clearly theirs: it cited one of the documents they uploaded, it used the voice settings they picked, and it sounded like the work they actually do.

That's the five-minute moment. Everything from here is iteration.

The starter-pack trick

The thing that makes screen one through screen four cost ~five minutes instead of five days is the starter pack.

For each vertical, I ship a curated bundle. Think of it like a default kit. It's got:

A reference corpus of generic-but-realistic examples (anonymized, public-domain, or synthetic) for the vertical. Eight to twelve documents. Enough to seed the embeddings before the consultant's own material lands.
A baseline persona shape (formal-but-warm, mid-length, low-hedge) that's a sensible default for that vertical.
A starter prompt template wired into the right retrieval pattern. (RAG is retrieval-augmented generation, the system pulls relevant context from your stored material before asking the model, if you want to read up.)
A first sample question pre-filled so the customer doesn't have to think of one.
Three "next things to try" prompts so once the first question works, there's a path forward.

The starter pack is what gets shown on screen four if the consultant uploaded nothing. It's also what fills the gaps if they uploaded a little. As they add more of their own material, the starter content gets demoted in retrieval weighting and then eventually pulled. The pack is scaffolding.

The captured-judgment idea from piece #2 is what the starter pack is a stand-in for. The pack gets the surface working; the consultant's real material is what makes the surface theirs. The first hour is scaffolding; the first month is replacement.

Three verticals, three first sessions

Marketing strategist. Picks "marketing positioning." Drops in eight case studies they wrote for past clients, two of their own writeups on their positioning approach, and a slide deck. Sets the voice sliders: high formality (they work with B2B), medium length, low hedge ("just tell them what I think"). Screen four asks a sample question: "What positioning angle would you recommend for a 50-person dev tools company entering a crowded category?" The answer comes back grounded in two of their case studies plus a starter-pack one (clearly marked), using their voice settings. Total time: 4m 20s. The first thing they do next is paste in a real client situation and ask for real advice.

HR consultant packaging an interview rubric. Picks "interview rubric." Uploads their rubric document, a few sample interview notes, and a one-page philosophy doc. Voice settings: warm but direct, medium length, decisive. Screen four shows a sample candidate writeup with the rubric applied, partly using their actual rubric, partly using a starter-pack scaffold for sections they didn't upload. They immediately spot a category they didn't include in their upload and add it. The product just told them something about their own work.

Financial advisor doing portfolio diagnosis. Picks "portfolio diagnosis." Drops in their portfolio-review template, a few anonymized prior diagnoses, and a brief writeup of their philosophy. Voice settings: high formality, longer responses, conservative-hedged ("when uncertain, flag, don't bet"). Screen four runs a sample portfolio against the prior diagnoses and the template. The output reads like their own writeup. They immediately notice their template doesn't ask about liquidity needs explicitly enough and make a mental note to revise it.

Three verticals, three different starter packs, three different voice profiles, three different surfaces by minute five. Same architecture spine.

What happens on the back end during those five minutes

Curious-reader summary: the system is doing a lot, fast, and most of it is invisible.

The technical version, for anyone running this:

Tenant provisioning. Cognito creates the user. RDS gets a tenant row with row-level security scoped to that tenant from the first query. (This is the "do it on day one" point from piece #6.) Zero "we'll add this later."
Starter-pack seeding. Starter documents for the chosen vertical get embedded into the tenant's pgvector store and marked as starter-pack-origin. They retrieve but at a downweighted rank.
Upload + embed pipeline. Customer uploads hit S3, kick an EventBridge event, an embed Lambda chunks and embeds the content into pgvector under the tenant scope. Streaming progress shown to the customer.
Persona shape. The three voice settings get stored in a structured way and bound into the prompt template for that tenant. (Specifics deliberately left vague, patent boundary.)
First inference. Sample question hits the same triage→diagnose path the production app will use. Haiku triages, Sonnet diagnoses with retrieval from the tenant's pgvector (mix of customer content and starter pack), output streams back. The router from last week's piece is doing its job from minute one.
Audit row. Every step gets logged to the audit table from piece #13. Yes, even during onboarding. Especially during onboarding.

Five minutes wall-clock. A lot of moving parts. The customer sees none of them, which is the point.

The trap I keep almost falling into

The temptation, every time, is to make screen four "better" by making it ask the consultant for more upfront. More documents. More voice calibration. A multi-step tone interview. "Just five more minutes to really tune this." It feels like quality investment.

It is not. It is churn manufacturing.

The consultant doesn't know what they don't know yet. They've never used a product like this. The only way they figure out which of their material matters is by using the surface with what they uploaded already, seeing what it gets wrong, and adding the missing piece. Iteration with their real material beats upfront perfection every single time.

So the rule I hold: screen four happens by minute five, even if the output isn't great yet. Get them to the surface. Let them see it working. Then the next 30 days is "you noticed it didn't handle X, here's where to drop in your X material." That second loop is where the secret sauce actually lands.

This is the same shape as the day-one approval gate from piece #12. You ship the loop early, knowing it's not yet good, and the loop itself produces the data that makes it good.

How I know onboarding is working

Three numbers I watch.

Time-to-first-output. Median from "clicked sign up" to "saw a generated answer." Target is five minutes; I get alerted if the 75th percentile climbs above eight.

Day-one engagement after the sample. Did they paste in a second question after the auto-generated one? If yes, the surface earned trust. If no, the sample didn't land and I look at what went wrong for that vertical.

Week-one material adds. Did they come back and upload more of their actual content? This is the leading indicator of long-term retention. Tenants who add material in week one keep paying. Tenants who don't, don't.

I don't watch DAU in the first week. I watch material adds.

If you're shipping this

One thing this week: time your own onboarding flow with a stopwatch. From "click sign up" to "see a generated output that uses my actual content." If it's more than five minutes, find the screen that's eating the time. Almost always. It's a screen asking the customer to do work the product could have done for them with a starter pack.

The next piece in this series is about the other half of this product: once the consultant is onboarded, they're not just a customer, they're a supervisor of their own AI. Which means there are actually two distinct surfaces sitting on the same backend, the customer view (ask a question, get an answer) and the consultant view (approve, deny, mine patterns, tune). Next week.

Bedrock model selection: pick on evidence, not vibes

Sid Smith — Wed, 27 May 2026 13:00:00 GMT

The 18-article MVP series wrapped last week with a piece about what I'd cut and what I'd keep. This one starts the year-one series, the part where the MVP is alive, customers are using it, and the questions stop being "what do I build" and start being "what do I run, and how do I make it cheaper without making it worse."

First question, every time: which model.

Bedrock model selection

If you've shipped anything on Bedrock, you already know the trap. There's a default in your code somewhere, anthropic.claude-sonnet-4-something, and it stays there for six months because nobody wanted to touch it. Then your bill triples, or a competitor ships something faster, or you read a benchmark that makes Haiku look like a steal, and you panic-swap to a different model and break a corner case nobody had test coverage for.

This piece is about not doing that. Pick on evidence. Switch on evidence. The evidence is the eval harness from piece #11, and the picking is one decision per loop, not one decision per app.

What "the right model" actually means

Three knobs. Cost, quality, latency. You get to pick two and the third comes along for the ride. That's the whole story.

What changes per use case is which two matter.

Routing the inbound query to the right pipeline? Latency and cost. Quality is a binary (did it pick the right bucket or not) and the buckets are coarse. You can do this with a model the size of a postage stamp.

Diagnosing what a customer actually needs help with, given their context and the consultant's body of work? Quality. Quality. Quality. Latency is fine in the 3-5 second band because the customer is already waiting. Cost matters but not on the same axis.

The hard-edge cases, the contract clause that's almost-but-not-quite the standard, the medical-second-opinion query where the symptom set is unusual, the financial diagnosis where the portfolio doesn't fit any of the standard patterns, quality is everything and you're willing to pay 10x per call because the case happens 1% of the time but it's the 1% the consultant put their name on.

So the model picker isn't "what's the best model." It's "what's the right model for this loop."

The four-way split I actually use

Bedrock gives you a menu. I run four models concurrently and route between them.

Haiku, for triage and routing. Inbound query comes in, Haiku decides which pipeline it belongs in. Sales-discovery prompt or onboarding-fit prompt? IT-ops triage or feature request? Marketing-positioning question or copy-edit request? It's a classifier dressed up as a chat model. Latency is sub-second, cost is rounding error, quality is high enough on coarse buckets that I trust it.

This is the triage loop from piece #9. Haiku is what makes that loop cheap enough to run on every inbound message instead of every fifth one.

Sonnet, for diagnosis. This is the workhorse. The query, the retrieved consultant context (from RAG, which is retrieval-augmented generation if you want to look it up later), the persona shape, the conversation history. Sonnet pulls it together and writes the answer. Or, more often, drafts the answer and sends it to the consultant for approval. 80%+ of my Bedrock spend is here.

Opus, for the hard ones. Two ways into Opus. First, Sonnet flags low confidence and the router hands the query up. Second, the case carries a tag ("high-stakes" or "novel" or "consultant-flagged-for-quality") and goes straight to Opus regardless. A legal-pro tenant doing contract review against their playbook routes 5-8% of clauses to Opus because that's the band where the playbook doesn't quite cover it and the consultant wants the model to think harder.

Llama on Bedrock, for cost-sensitive batch: summarization, re-embedding the corpus when chunking changes, generating eval candidates. Anything that runs overnight on the Mac Studio side fine, but sometimes the Mac Studio is busy fine-tuning and I want it in the cloud. Llama 3.x or whatever's current on Bedrock at the time. Quality is good enough for the work, and the per-token price is meaningfully lower than Sonnet.

That's the spread. Haiku at the door, Sonnet for the bulk, Opus for the corners, Llama for the back office.

How I actually pick, not by reading benchmarks

Here's the part nobody wants to hear. Public benchmarks are useful for narrowing the field. They are useless for the final pick.

Benchmark says Model X beats Model Y by 4 points on MMLU. Cool. My consultant's body of work is none of MMLU. The only thing that tells me whether Model X is right for a portfolio-diagnosis prompt against this financial-advisor's corpus is running both models against my eval set and looking at the pass rate.

The eval harness from piece #11 is the unlock. Golden examples, structured grading rubric, regression detection. Per-model scorecard. When I'm picking between Sonnet and Opus for the diagnose loop, I run the same 200-example set through both, score the outputs, look at the gap, look at the cost-per-pass.

Three numbers come out. Pass rate. Median latency. Cost per query. I write them in a tiny markdown table per loop and I keep that table in the repo. When somebody asks why we're on Sonnet not Opus for the marketing-positioning pipeline, I point at the table.

Want to go deeper on the harness mechanics? The eval setup itself is in The eval harness, how you know it's working, and the prompt-versioning discipline that lets you compare apples to apples is in Prompts as code.

The cost/quality/latency curve, in numbers I've actually seen

Rough shape, your mileage will vary, do your own evals, but for a triage-diagnose-resolve product running on a consultant's body of work, the numbers I've seen come out something like this.

Haiku for triage: ~300ms median, fractions of a cent per call, 96-98% bucket accuracy on coarse intent classification once you've tuned the prompt. Cheap, fast, good enough.

Sonnet for diagnose: ~2-3s median, low single-digit cents per call (depending on how much retrieved context you cram in, and you'll cram in more than you think), 88-92% pass rate on a well-graded eval set against the consultant's corpus. The number that pays the bills.

Opus for hard cases: 5-8s median, 5-10x the per-call cost of Sonnet, but the pass rate jumps from ~88% on the hard-case subset (where Sonnet was struggling) to ~96%. That gap is the reason Opus exists in your pipeline.

Llama on Bedrock for batch: latency doesn't matter because it's batch, cost is meaningfully under Sonnet, quality on the back-office tasks (summarization, eval generation, re-chunking) is fine.

The thing I want you to internalize: the difference between Sonnet and Opus on the easy 80% of cases is small enough that paying 10x for it is wasteful. The difference on the hard 5-10% is huge. So you route by case, not by app.

Picking by use case, three quick verticals

A sales consultant running discovery-call prep. Haiku triages the inbound: prep request vs. follow-up vs. objection-handling. Sonnet diagnoses: pulls the prospect's company context, the consultant's framework, the prior call notes, drafts the prep brief. Opus rarely fires here unless the deal is flagged as strategic. Most of the spend is Sonnet, latency tolerance is generous because the consultant is reading the brief asynchronously.

An IT-ops consultant doing infrastructure triage. Haiku routes by symptom class. Sonnet diagnoses against the consultant's runbook corpus and the customer's ticket history. Opus fires when the symptom set doesn't match a known runbook. That's the "this is novel, think harder" path. Cost-sensitive because the volume is high; Llama runs the overnight pattern-mining of resolved tickets to find new auto-resolve candidates.

A career coach doing resume + positioning review. Haiku triages: full resume review vs. single-section edit vs. positioning question. Sonnet handles the bulk. Opus fires when the candidate's background is non-standard and the coach has flagged the case for extra care. Latency is generous, quality is everything because the coach's name goes on the output.

Same architecture spine. Different routing thresholds per vertical. The picker is configuration, not code rewrite.

When to switch

Three triggers. Just three.

Eval scores drift. You re-run the eval set after a prompt change or a model update and the pass rate moves. If it dropped on Sonnet, maybe Opus is now the right pick for that loop. If it climbed on Haiku, maybe you can demote work down a tier. Re-running evals on a schedule (I do it weekly during active development, monthly after) is what makes this trigger fire when it should.

Cost shape changes. Anthropic ships a new Sonnet, the per-token price drops, the new model beats your current pick on your eval set, you switch. Or your usage shape moves and the model that was cheap at 10k queries/day is no longer cheap at 100k. The cost-model piece, #15, is the place you watch this from.

Customer complaint pattern. This is the one that doesn't show up in evals. Customers report the same kind of bad answer over and over. You go look. Often it's a class your eval set didn't have. Add it to the eval set, re-grade across models, switch if the data says switch. The complaint becomes a permanent test.

What's not on the list: a benchmark blog post made you feel behind, a competitor announced something, your CTO wants to "move to the new thing." Those are signals to test, not signals to switch. Run the eval. Look at the table. Then decide.

The default I ship with

For anyone starting from the architecture in the MVP series and trying to figure out where to begin: ship with Haiku for triage, Sonnet for diagnose, route the lowest-confidence 5% to Opus, and put Llama on whatever batch work the Mac Studio doesn't pick up. That's the default. It's not the right answer for your product. It's the right starting answer.

Then build the eval harness. Then the data tells you what to change.

If you're running this and only get to do one thing this week, do this: pick the loop that costs you the most per month, run it through three models with the same 100-example eval set, and put the numbers in a table. The picker decision after that writes itself.

The next piece in this series is about the other end of the year-one problem: when a brand-new consultant signs up, how do you get them from "I have secret sauce on a shared drive" to "my AI surface is live and answering questions" in five minutes instead of five days.

What I'd cut, what I'd keep: the actual MVP cutline

Sid Smith — Tue, 26 May 2026 13:00:00 GMT

This is the closer for the MVP series. Eighteen pieces ago I started with a question, what does "MVP" actually mean when the value of your product is the AI doing something useful?, and we worked through it: the hybrid cloud-and-local split, the AWS-native shape, auth and multi-tenancy from day one, the secret-sauce capture loop, the three-loop product flow, prompts as code, evals, the approve-deny gate, audit, failure modes, cost, deployment, hybrid sync. A lot of ground.

Today is the part where I'd take all of that and cut it down to the version you actually ship in eight weeks with a small team. Not the version that's "good enough for now and we'll fix it later" (that version makes you cry) but the version where every piece you build is earning its keep, and the pieces you skip are the ones that genuinely don't bite until you have customers telling you they bite.

MVP cutline

I'll do this in three parts. What I'd cut on day one. What I absolutely wouldn't. And what the first thirty days of real customer use will teach you that you cannot, no matter how clever you are, predict in advance.

What I'd cut

These are the things people build into MVPs because they feel important, and which you can almost always defer until the product has earned the right to need them.

A pretty admin UI. Whatever supervisory work the consultant has to do (review the queue, approve diagnoses, mine for patterns) can run on a stripped-down internal tool for the first hundred customers. Retool, an admin-style React page, even a couple of Postgres views and a CLI. The customer-facing surface gets the polish budget. The supervisor surface earns its polish later, when the consultant tells you which three actions they do twenty times a day and you build a button for those three things.

Anything other than email for notifications. SMS, push notifications, in-app real-time toasts, Slack integrations, webhooks for customers, all good ideas, all later. Email is universal, asynchronous, and works. SES from AWS gives you the first sixty-two thousand emails a month free, which covers a real pilot. SES, the Simple Email Service, is AWS's outbound email transport, if you want to look it up later. Build the notification layer as a single "send-event" function that today only knows how to send email. The day you need SMS, you add a path. You don't add five paths and use one.

Multi-region anything. One region, the one closest to your pilot users. The day you have a customer in Singapore complaining about latency, you have a real reason to think about a second region. Until then it's expense and complexity for a problem you don't have.

Caching layers beyond what AWS gives you. No Redis, no ElastiCache, no in-front-of-everything caching layer. Use API Gateway's caching for the obvious GETs. Use CloudFront for static assets and obvious cacheable endpoints. Lambda's own warm execution acts as a small cache. That's enough for an MVP. The day you can prove you have a hot read pattern that's costing you, you add Redis with a clear purpose. Adding it speculatively gives you a cache invalidation problem on top of all your other problems.

A microservices split. One Lambda monorepo, one CDK app, one RDS database. Two or three Lambda functions for the customer-facing API. Maybe a separate Lambda for the heavier async work, but the line is "different concurrency requirements," not "different team owns it." There is no other team. You are the team. Distributed systems problems are the most expensive problems in software, and you do not need to invite them in until they're forced on you.

Custom dashboards and BI. CloudWatch dashboards are ugly and they are sufficient. The metrics that matter (call volume, error rate, p95 latency, Bedrock spend per day, eval pass rate) all fit in a single CloudWatch dashboard you build in twenty minutes. You don't need Datadog or a custom Grafana for the first six months. When you do, you'll know exactly which ten metrics you need on it because you'll have stared at the CloudWatch one daily for six months.

Fancy A/B testing infrastructure. A feature flag library that lets you set a percentage rollout per environment is enough. LaunchDarkly is great for the day you have a real product team running real experiments. Your early A/B is one new prompt vs the old one, fifty users vs fifty users, evals telling you which won. You can do that with a feature flag and a column in the audit log. Don't buy LaunchDarkly in month one.

Most of your "what if" features. The features that the consultant brainstormed in the second discovery call but no customer has actually asked for. Cut those. Build the three things every pilot customer has asked for in the same words. The other ten things are real maybe, and you'll know which ones to build when customers tell you.

What I absolutely would not cut

There are four things where deferring them is the most expensive thing you can do. They show up in the audit logs of every team I've watched fail, marked "we should have done this from the start."

Auth and multi-tenancy. Cognito on day one, tenant ID on every row, row-level security in Postgres on day one. The day-zero version is small, maybe a hundred lines of CDK and a few hundred lines of application code. The bolt-on version, after you have customers and data and assumptions baked in, is a months-long migration that occasionally leaks one customer's data into another customer's view. Don't do that to yourself. The auth and multi-tenancy piece earlier in this series walks the cheap version in detail; that's the floor.

A real audit trail. Every meaningful decision the AI makes (and every action a human takes on top of it) written into an audit table with the actor, the action, the evidence, the timestamp, the outcome. This is not a logging concern. CloudWatch logs disappear. The audit table doesn't. The day a customer asks "why did your system reject my application?" or a regulator asks "show me how this decision was made," you point at a row, and the row has the answer. The observability and audit piece details the row shape; the bar is non-negotiable.

An eval harness. Even a small one. Even fifty golden examples and a script that runs them. The point isn't full coverage on day one; the point is a habit. The eval harness is what tells you the prompt change you just shipped didn't make things worse on the cases you already cared about. Without it, you're shipping prompt changes by vibes, and vibes regress silently. The eval harness piece describes the minimum shape, fifty examples is enough to get the discipline started.

A documented escalation path. When the AI doesn't know, when it returns a low-confidence answer, when the customer asks something off-pattern, there has to be a path that gets a human eyeball on it within a day. For a one-person shop, "human" might be the consultant themselves checking a Slack channel each morning. For a small team. It's a queue with an SLA. Either way, the path exists, the customer knows it exists, and the AI knows when to invoke it. Without an escalation path, the only failure mode is "AI gives a wrong answer and nobody notices until the customer churns." That's the failure mode that ends MVPs.

If you're reading this and your draft architecture is missing any of those four (auth, audit, evals, escalation) go put them in before you ship to your first customer. The other things on the cut list, you can add as needed. These four, you can't bolt on.

What 30 days of real customer use will teach you

Here's the part nobody can predict for you, no matter how good a brief I write or how clever your architecture is. The first thirty days of real customers using the product will teach you four things, and you will not see any of them coming.

The questions they actually ask are not the questions you built for. You designed the discovery prompt for the sales consultant assuming customers would upload transcripts. Half of them paste raw notes from memory. You designed the IT ops triage flow assuming customers would describe symptoms. Half of them paste the entire stack trace and ask "what is this?" Your retrieval, your prompts, your tone, all calibrated to inputs you guessed at. The first thirty days show you what the inputs actually look like, and the gap is always larger than you expected.

The fix is cheap if you're set up for it: capture the actual queries (PII-stripped, in your audit table), categorise them, and update your retrieval and prompt structure for what you're actually seeing. The fix is expensive if you're not set up for it, meaning, if you didn't keep the audit trail in shape and you don't have a way to safely look at customer queries, you're flying blind.

The places it fails are not the places you tested. You hammered the diagnose flow. The thing that fails is the file upload, because the consultant's customers are uploading PDFs three times the size you tested with. You stress-tested the Bedrock path. The thing that breaks is the Cognito password-reset email, because you used the default sender domain and customers' spam filters are eating it. The first thirty days expose the unsexy operational gaps, the ones that have nothing to do with the AI and everything to do with the rest of the product being a real piece of software that real strangers are using.

The fix is alarms and a habit of looking at them. Page yourself on error rate spikes. Look at the dashboard daily, really daily, not "when I think of it." Customer-facing breakage is invisible to you and visible to them.

The AI will be wrong in a way that surprises you. Not the failure modes you anticipated and planned around. A new one. The legal pro's contract-review tool will confidently say a clause is fine when it's actually missing. The financial advisor's portfolio diagnosis will recommend a rebalance based on a misread of the customer's risk profile. The medical specialist's second-opinion review will agree with the original diagnosis when the original was wrong. Whatever the new failure mode is, it will be the one your eval harness didn't have an example for, because if you'd had an example you'd have caught it in dev.

The fix isn't to prevent it (you can't prevent the unknown unknown) but to catch it within a day. Approve-deny gate stays on for the first thirty days, no exceptions. The consultant is in the loop. Patterns get added to the eval set. The eval set grows from fifty to two hundred examples in those first thirty days, every one of them a real near-miss that taught you something.

Customers will tell you what to build next, and they will be partly right. They'll ask for features. They'll ask for fields. They'll ask for integrations. About sixty percent of what they ask for will be the right thing to build, usually a smaller, more specific version of what they asked for. About forty percent will be a misdiagnosis of their underlying need, where they're describing the solution they imagined and the actual job is something else.

The skill is listening to what they're trying to do underneath the feature ask. A customer who says "I need a Salesforce integration" might actually need "I need to stop manually re-entering customer info." Those are very different things and lead to very different builds. The thirty days teach you that translation skill. There is no shortcut.

Where this hands off

The architecture you've shipped at the end of this series (small, hybrid, intentional) gets you through the first thirty days. The next series, Operating an AI product, year one, starts tomorrow. Four pieces, running daily through May 30. We pick up where this one ends: real customer load, real Bedrock model selection on real eval evidence (first piece tomorrow), the five-minute onboarding path for new tenants, the two-surface UI split between customer and consultant, and the pricing model that makes the unit economics work. Operating, not building. The architecture is settled; the question becomes how to run it well.

The MVP series was about getting the shape right. The operating series is about keeping it alive while customers actually use it.

If you've built along with this series, you have the spine. The Lambda, API Gateway, Cognito, RDS-with-pgvector, S3, Bedrock, EventBridge, SQS, CloudWatch on the cloud side. The Mac Studio with mflux, mlx-lm, whisper, and the SQS-poller-batch-runner on the local side. The hybrid sync wired through SQS, EventBridge, S3, and signed manifests. The eval harness, the audit table, the approve-deny gate, the prompt versioning. The cost model in your head. The deployment pipeline that doesn't break things.

That's an AI MVP. That's the shape. The secret sauce sitting on top, that's yours, and it's the only part of the system that's actually you. Everything else generalises across consultants, verticals, products. Architecture is the protagonist of these eighteen pieces. Your secret sauce is the protagonist of your product.

If you're shipping yours soon, my one ask: send me the URL when it's live. I want to see what you built.

Tomorrow we start operating it.

The downgrade pattern for cross-boundary data transfer

Sid Smith — Tue, 26 May 2026 13:00:00 GMT

The first time I defended a cross-boundary data transfer to a compliance officer, I made the mistake every engineer makes. I said, "we redact PII before it leaves the regulated environment." She nodded politely, then asked the question that ended the meeting: "show me the rule that says which fields get redacted, who wrote it, when it was last reviewed, and the log of every record this rule has ever processed."

I had two of those four. The redaction code existed; somewhere in the pipeline a function stripped a list of column names. The rest (the rule as a reviewable artifact, the provenance, the audit trail) were vibes. The pipeline ran on the assumption that the engineer who wrote it had thought about the right things on the right day. That is not a story you can tell a regulator.

Here's how I think about it now, and what I train every team I work with to say. Cross-boundary data transfer isn't a copy. It's a downgrade. The data that crosses into the lower-trust universe is a different artifact from the data that lives in the higher-trust one. Treating it as the same data with some columns removed is how you end up rebuilding the pipeline six months later, after the audit conversation that should never have happened.

This is the piece of the compliance-aware design story I see teams skip most often. The data model conversation happens. The auth conversation happens. The audit conversation happens. The cross-boundary conversation gets folded into "we have a redaction step in the ETL," and that is where the audit defensibility quietly leaves the building.

What "downgrade" means as a primitive

A higher-trust universe (HIPAA-applicable, SOX-applicable, GDPR-restricted, PCI-scoped, pick your regime) has a shape. Every record carries classification. Every access leaves an audit trail. Every operation runs under a default-deny posture where a specific allow-rule fired with a specific subject, purpose, and resource. The platform enforces the laws that apply.

A lower-trust universe doesn't. The analytics warehouse, the BI dashboards, the product-telemetry pipeline, the AI training set, the developer's notebook, different controls, different audit posture, different retention, different blast radius. The moment a record from the regulated universe enters the analytics universe, the controls of the regulated universe stop applying. The lower-trust universe cannot enforce HIPAA on a record it received; it doesn't have the foundation to.

The downgrade is the plainly-named, rule-bound, audited transformation that turns a higher-trust record into something the lower-trust universe can hold without inheriting an obligation it cannot meet. The output is structurally different from the input. Fields are removed, replaced, generalized, hashed, bucketed, or combined such that the resulting record can no longer be re-identified, no longer carries the regulated classification, and no longer triggers the regulated controls. The downgrade is not redaction; redaction is a tactic the downgrade rule may use. The downgrade is the rule.

The shift in framing matters because "redact PII before export" describes an operation. "Downgrade rule R-DG-014 transforms patient-records into the analytics-records shape, owned by Compliance, last reviewed 2026-04-12, applied 2.8M times last quarter" describes an artifact. The auditor asks for the artifact, not the operation.

What actually makes a downgrade defensible

A downgrade pattern that holds up in an audit conversation has four pieces. Skip any one and the system works until somebody looks closely at it.

Plain downgrade rules

Every cross-boundary transfer is governed by a named rule. Not a function in the ETL code. A rule in the standards repo, with an ID, an owner, a review date, a description of the source shape, a description of the target shape, and the transformation logic that gets you from one to the other. The rule is data, not code. The pipeline reads the rule and applies it; the rule itself is a Decisions as Code artifact that lives in the same standard layer as the t-shirt sizing standards and the tagging conventions.

The shape of the rule is load-bearing. A typical entry I now ship reads: "R-DG-014, source: patient-records-v3 (clinical), target: analytics-records-v1 (analytics), transformations: drop patient_name, drop dob, replace patient_id with HMAC(patient_id, key_2026_q2), generalize zip5 to zip3, bucket age into ten-year bins, drop free-text notes, owner: compliance, reviewed 2026-04-12, next review 2026-07-12." Every column the rule touches is enumerated. Every column it leaves alone is enumerated by the source shape being versioned. Adding a new column to patient-records-v3 invalidates the source shape and forces a rule review before new data crosses.

The failure mode of an implicit rule is silent. The team adds a new free-text field, the ETL function doesn't know about it, the field flows to analytics, the auditor finds it eighteen months later. The cost of that finding is the cost of identifying every downstream consumer and proving the leaked field never propagated, large enough to fund explicit rules for a decade.

Enforcement that matches the rule

The rule is the artifact; the enforcement is the foundation that ensures no record crosses except through the rule. This is where default-deny does its second-most-useful job. The boundary itself is closed; the only way through it is via a registered downgrade rule. There is no engineer-with-credentials path that bypasses the rule. No "just this once for the analyst" path. No debug pipeline. The only way data crosses is through a transform that names a rule ID, and the foundation refuses any transfer that does not.

What this looks like varies by stack, a network-level egress controller, a database-level row policy, a service-mesh authorization layer, an OPA-backed admission step in the analytics ingest. The mechanism doesn't matter much. The discipline does. The lower-trust universe cannot ingest a record that did not come through a registered downgrade. Anything else is a side door, and the auditor will find it.

An audit trail of every crossing

Every record that crosses the boundary emits an audit event: the rule ID that authorized the crossing, the source-record identifier, the target-record identifier (structurally different, the downgrade replaced it), the timestamp, the upstream subject, the batch volume, and a hash of the rule version applied. The audit log lives in the regulated universe, because that's the universe responsible for the obligation the data carried, and the audit trail itself is regulated evidence.

Retention is brutal, years, depending on regime. Volume is large; a high-throughput downgrade pipeline produces millions of events a day. Both are design constraints, not surprises. Teams that ship this well treat the cross-boundary audit log as a foundation, the same way they treat the access audit log: separate trust boundary, append-only, hardware retention, queryable on a defined SLA.

The query the auditor runs is "show me every record that crossed from clinical to analytics last quarter, by rule, with the rule version and owner." That sentence needs to be a query, not a fire drill.

A human approval step for first-of-pattern transfers

The first three pieces handle steady state. The fourth handles new patterns. Every time a downgrade rule is created, modified, or applied to a source shape it hasn't seen before, a human reviews and signs off before the rule goes live. Not a developer. Not the engineer who wrote the rule. A reviewer with the authority to say no on behalf of the regulated universe, typically compliance, sometimes paired with a data-steward.

The step is not a rubber stamp. The reviewer reads the rule, reads the source shape, reads the target shape, asks the questions nobody on the engineering team thought to ask. "Why is this field in the target?" "What downstream join might re-identify the subject?" "Has legal reviewed the K-anonymity claim on the bucketed age field?" "What's the deletion path if a subject revokes consent?" The questions are slow on purpose. The step exists because the cost of getting a downgrade rule wrong is the cost of every record that ever crossed under the wrong rule, and that cost compounds.

The step does not block steady-state operation. Once a rule is approved, records flow through it without further intervention. The step fires only on first-of-pattern: new rule, new column on a source shape, new target universe, new transformation on an existing field. Steady state is fast. New patterns are deliberately slow.

Teams resist this most. "It'll slow us down." It will, when you create a new rule. It won't, when the pipeline runs. It's the cost you pay once per pattern in exchange for an audit posture that doesn't fall over.

Why this is harder than "redact PII before export"

The redaction framing reduces the problem to a column list. Strip these fields, ship the rest. It is operationally simple, and it has been the dominant pattern for as long as I have built data pipelines.

The downgrade framing forces a different conversation. It starts from the regulated universe's obligations and asks what it would take to release a record from them. That is rarely a column-list answer. It is a question about re-identification risk, combination effects, downstream joins the lower-trust universe might perform, the regulated universe's deletion semantics following the record across the boundary, and the version of the rule and source shape under which the record was downgraded.

A redaction step cannot answer those. A downgrade rule is the artifact that can. The difference is whether the cross-boundary story is a function in a script, which a single engineer can change, which leaves no provenance trail, which the auditor cannot read, or a versioned, owned, reviewed, enforced, audited artifact the regulated universe authored on purpose.

The teams I see ship this well treat the boundary like the regulator already thinks of it. The regulated universe is a closed system with an obligation. The lower-trust universe is a different system without it. Anything crossing between them is, at the moment of crossing, a deliberate release, and a deliberate release is a decision, made by the right people, with provenance. Not a side effect of an ETL job nobody has read in a year.

If your platform handles regulated data and your cross-boundary story is "we redact before export," start with one rule. Pick the highest-volume transfer. Write the rule down. Put it in the standards repo with an owner and a review date. Wire the enforcement so no other path crosses. Turn on the audit log. Run the approval step when you change anything. The first rule takes a quarter; every rule after takes days. The audit conversation that follows is the one I wish I had been ready for the first time.

The data that crosses the boundary is not the data. It's a downgrade. Build like that's what it is.

, Sid

The hybrid sync pattern: how cloud and local actually talk

Sid Smith — Mon, 25 May 2026 13:00:00 GMT

The hybrid split between cloud and local is easy to draw on a whiteboard and tricky to make actually run. You sketch a cloud box on the left, a Mac Studio box on the right, an arrow between them labelled "sync," and everyone nods. Then you sit down to build it, and the arrow turns out to be six different arrows doing six different things, and you have to pick the right wire for each one or the whole thing turns into a flaky mess of cron jobs and SSH tunnels.

This piece is the wiring. I'll walk through the actual mechanisms that move work and data between a cloud-side product (the architecture spine we've been describing. Lambda, API Gateway, RDS with pgvector, S3, Bedrock, EventBridge, SQS) and a Mac Studio in the corner doing batch inference, evals, fine-tuning, and image generation. The patterns are concrete. The code stays at the shape level, enough that you can build it, without me writing your boto3 boilerplate for you.

Hybrid sync wiring

The framing throughout: the cloud doesn't reach into your house, and your house doesn't reach into the cloud. Both sides hit AWS services that act as the meeting point. SQS is the inbox. S3 is the warehouse. EventBridge is the alarm clock. That's the model. Everything else falls out of it.

Cloud-to-local: SQS as the pull point

When the cloud has work for the local rig, a batch eval, a transcription job, a fine-tune kickoff, an image-generation request from the back-office UI, the cloud doesn't try to push it. Pushing means the cloud has to know your home IP, get past your router, authenticate against something running on your Mac. That's a security and reliability swamp.

The pattern is pull. Cloud-side, a Lambda drops a message onto an SQS queue. That message is small (a few kilobytes) and contains a job descriptor: type, ID, parameters, and an S3 location for any large inputs. SQS, short for Simple Queue Service, is AWS's hosted queue, producers drop messages in, consumers pull them out, with at-least-once delivery semantics, if you want to look it up later.

Mac Studio side, a poller process runs on a launchd schedule, every thirty seconds is a reasonable cadence for batch work. The poller calls ReceiveMessage on the queue, processes whatever it gets, and calls DeleteMessage when it's done. If it crashes mid-process, the message becomes visible again after the visibility timeout, and either this poller or its restarted self picks it up. The reliability comes from idempotency: every job descriptor includes a stable job ID, and the local processor checks "have I already done this one?" before starting, using either a local SQLite ledger or a small entry in S3.

The shape of the local poller, conceptually:

loop:
  msgs = sqs.receive_message(queue, max=10, wait=20)
  for msg in msgs:
    job = parse(msg.body)
    if already_done(job.id): sqs.delete_message(msg); continue
    inputs = s3.get(job.input_uri) if job.input_uri else None
    result = run_job(job, inputs)
    s3.put(job.output_uri, result)
    mark_done(job.id)
    sqs.delete_message(msg)

Notice what's not there. There's no inbound port open on the Mac Studio. There's no WebSocket. There's no cron that wakes up at weird times. There's a poller that asks "is there work?" every thirty seconds. SQS's long-poll (wait=20) means the call blocks until either a message arrives or twenty seconds pass, so the API call count stays sane.

For a financial advisor productizing a portfolio-diagnosis routine, the typical cloud-to-local flow is: customer uploads a portfolio CSV, cloud Lambda drops a "diagnose-portfolio" message on SQS pointing at the CSV in S3, the Mac Studio polls, runs the locally fine-tuned classification model against the holdings, writes the structured diagnosis back to S3, marks the message done. The cloud picks up the result on the next pass (more on how, in a moment).

EventBridge for scheduled work

SQS is the right fit for "the cloud has a piece of work for the local rig, do it whenever you can." It's not the right fit for "run the nightly eval at 2 a.m." or "retrain the secret-sauce model every Sunday." That's what EventBridge schedules are for.

EventBridge can fire a scheduled rule that drops a message onto an SQS queue, hits a Lambda, or pings any other AWS target. For the local-side scheduled work (nightly evals, weekly fine-tunes) the simplest pattern is: EventBridge rule fires on a cron expression, target is the same SQS queue the local poller is reading. The message body declares the scheduled job type. The Mac Studio sees it on the next poll and runs it.

This means everything the Mac Studio does is dispatched through SQS, whether it came from a customer event or a scheduled job. One consumer, one inbox. The Mac Studio doesn't need its own cron table; the schedule lives in AWS, version-controlled in your CDK, and it's the same control plane the cloud uses for everything else.

The marketing strategist's productized brand-positioning method gets the benefit here. Nightly, EventBridge fires a "regenerate brand-asset library" message. Mac Studio polls, picks it up, runs mflux to produce a fresh batch of hero images and social cards based on the latest brand voice fine-tune, writes them to S3 under a versioned prefix, marks done. The cloud-side product just reads the latest version when the customer asks for assets.

S3 as the shared warehouse

Both sides read from and write to S3. That's where anything bigger than a few KB lives. The structure of the bucket matters more than the bucket itself.

The pattern I default to: one bucket per environment (eotm-prod, eotm-staging, eotm-dev), with prefixes carving up the namespace. Roughly:

inputs///...
outputs///...
artifacts/models///...
artifacts/eval-sets///...
artifacts/manifests/.json

Inputs are written by the cloud, read by the Mac Studio. Outputs the other way. Artifacts (model weights, eval golden examples, training data) are written by whichever side trained them and read by the other side as needed.

The two sides don't share credentials. Cloud-side IAM roles let Lambdas read inputs and write outputs and read artifacts. Mac Studio-side credentials are scoped to a dedicated IAM user with a long-lived access key (stored in the Mac's keychain) that can read inputs, write outputs, and write artifacts under specific prefixes. The Mac Studio cannot read customer data outside the input prefix it was told about. That separation is the audit boundary, the Mac Studio sees only what the cloud explicitly handed it, and only for the job in question.

Want to go deeper on how this connects to retrieval and the secret sauce? The artifact layout above is the same one retrieval is the secret sauce surface assumes for embeddings and corpus files, and the cost story for S3 (free until you egress) lives in the cost model piece from earlier this week.

Local-to-cloud: signed manifests and event triggers

The trickier direction is local pushing results back into the cloud product. The naive version is "Mac Studio writes results to S3, cloud product polls S3 for new files." That works for small scale. It falls apart for two reasons: polling is inefficient, and you want the cloud to react to a result landing, not discover it ten minutes later.

The pattern I use: signed manifest files kick off downstream events.

The Mac Studio finishes a job. It writes the actual output (say, a trained model file or a JSON diagnosis) under outputs///. Then, as the last step, it writes a small manifest.json next to it. The manifest contains: the job ID, the input it consumed, the output paths, a timestamp, and an HMAC signature using a key shared between cloud and local. HMAC is a way of cryptographically signing a small payload with a shared secret, so the receiver can verify the sender knew the secret, if you want to look it up later.

The manifest write is the trigger. S3 has an event notification rule set up on outputs/*/manifest.json keys, when one lands, S3 fires an EventBridge event. A cloud-side Lambda picks it up, verifies the HMAC (rejecting the event if the signature doesn't match), and then dispatches whatever the downstream work is: update the database row, notify the customer, ping the consultant's supervisor queue, kick off the next stage of the pipeline.

The shape, conceptually:

# local side, end of run_job:
write_output(output_path, result)
manifest = {
  job_id, input_uri, output_uri,
  timestamp, content_hash, signature: hmac(shared_key, ...)
}
s3.put(manifest_path, json(manifest))
# that put triggers S3 -> EventBridge -> Lambda

The signature matters. Without it, anyone with write access to the bucket could drop a manifest and trigger cloud-side actions on data they fabricated. With it, the cloud Lambda has a cheap verification step that proves the manifest came from a process that holds the shared key, which lives only in the Mac Studio's keychain and in Secrets Manager on the cloud side, never in the bucket.

For a career coach packaging their resume-positioning review, the round trip looks like: customer uploads a resume, cloud drops a "review-resume" SQS message, Mac Studio runs the locally fine-tuned positioning model, writes the structured review to S3, writes a signed manifest. Manifest landing triggers a Lambda that verifies the signature, writes the review into the customer's RDS row, marks the job complete, and emails the customer. The cloud product never reaches into the Mac Studio. The Mac Studio never reaches into the cloud product. Both reach into AWS services in the middle.

Model artifacts, the special case

The most important local-to-cloud flow is also the simplest: trained model artifacts.

The Mac Studio fine-tunes a small model on the consultant's annotated examples (the secret sauce). The output is a model file, call it model-v23.safetensors plus a config. The Mac Studio writes it to artifacts/models//v23/ and then writes a manifest with the version pointer.

Cloud-side, the Lambda that runs inference doesn't know about v23 yet. It loads whatever model version it has cached. The handoff is via cold-start. When a Lambda execution environment cold-starts, its init code reads artifacts/models//current (a small pointer file) to find the version it should load, then downloads that version into the Lambda's temp directory and loads it. The pointer file is what gets updated when a new version is ready, the manifest-trigger Lambda is responsible for swapping it.

This means rolling out a new model is two writes: write the new version artifacts, write the new pointer. Existing warm Lambdas keep serving the old model until they recycle. New cold-starts pick up the new one. There's a brief mixed-version window which is fine for the kind of workloads we're talking about; if you need atomic cutover, you flush the Lambda concurrency, but for an MVP you don't.

This pattern also gives you rollback for free. The old version still sits in S3. Repoint the pointer file at the old version. Next cold-start, you're back on the previous model. No deploy, no CDK, no panic.

What doesn't go through this wiring

Two things deliberately stay off the hybrid path.

Customer-facing inference. When a customer's query needs an LLM response in real time, it goes to Bedrock, not the Mac Studio. The Mac Studio's latency is fine for a 30-second batch job; it's not fine for a 1.5-second customer interaction. The hybrid path is for batch, scheduled, and back-office work.

Anything containing raw customer PII the Mac Studio doesn't need. The cloud-side scrubs and tokenises before the SQS message gets created. If the Mac Studio is doing a job that doesn't need the customer's name and email, it doesn't get them. This is a habit thing, easy to be sloppy here when nobody's watching. The day you have a regulator asking where data flows, you'll want to point at the SQS message format and say "those are the only fields that ever cross."

The whole shape

Pull all of this together and the hybrid sync pattern is six pieces:

SQS queue as the cloud-to-local inbox. Pull, not push.
EventBridge schedules as the alarm clock for scheduled local work, dropping into the same queue.
S3 prefixes as the shared warehouse for inputs, outputs, artifacts, and manifests.
Signed manifest writes as the local-to-cloud trigger mechanism, via S3 → EventBridge → Lambda.
HMAC verification as the lightweight integrity check on every manifest the cloud picks up.
Cold-start artifact loading as the model-handoff mechanism, with a pointer file enabling instant rollback.

Six things, each doing one job, each easy to reason about independently. The whole pattern fits in maybe four hundred lines of code across both sides, including the poller, the manifest writer, the verifier Lambda, and the IAM scaffolding in your CDK. The wiring is small. The clarity is large.

If you're building this, my one ask: get the signed manifest pattern in from day one. Polling-based versions of this work, sort of, until they don't. The day you go to production and a customer's "is it done yet?" answer depends on a 5-minute poll interval, you'll regret not having the event trigger. Build the trigger now; thank yourself later.

AI in the news — week of May 24, 2026

Sid Smith — Sun, 24 May 2026 16:02:57 GMT

Week ending Sunday May 24. Google I/O was the tech story; the keynote on Tuesday May 19 shipped roughly what was leaked plus a couple of surprises. Meta's 8,000-job cut went live on Wednesday, with a $21B CoreWeave commitment behind it. The Boston Consulting Group dropped a number on the power story (data centres at two-thirds of US home electricity by 2030) that puts the build-out dimensions into perspective. And the Stratos campus in Utah and the Florida proposal pipeline both moved.

Google I/O 2026: Gemini 3.5, Antigravity 2.0, Code Mender, glasses. May 19.

The keynote ran two hours. The Developers Blog has the full roundup. The pieces that matter, in order of how much I'll use them.

Antigravity 2.0. The agent-first developer platform shipped with a CLI that spins up specialised subagents inside cross-platform terminal sandboxing, with credential masking and hardened Git policies built in. Google's demo built a working OS in 12 hours using 93 parallel subagents, 15K+ model requests, 2.6B tokens, and under $1K in API credits. Marketing exercise, but the cost line is the genuinely interesting number. The sandboxing and credential masking are what make this usable from inside a one-person-shop CI loop without rewriting auth flow first. This is the announcement that changes my workflow this week.

Gemini 3.5 across the consumer surfaces. AI Mode in Search at one billion monthly active users in twelve months, per Sundar's keynote post. That is the Search transition that analysts were budgeting three years for. The 1B-MAU number is the structural item this week.

Code Mender. A security tool that finds vulnerabilities and ships patches. Defender-first framing, which is the right framing. The question is whether Google ships it as widely-available infrastructure or holds it in the Cloud-tier feature set. The former is good for the long-tail open-source ecosystem; the latter is good for Google's gross margin.

Intelligent Eyewear, shipping this fall. Two configurations (audio-only, and audio plus an in-frame display) with partners Warby Parker, Gentle Monster, and Samsung. Gemini runs natively for translation, landmark recognition, equation solving. The demos were good. The privacy story for always-on cameras and microphones is still unresolved. I'll wait for the real-world deployment reviews.

Tom's Guide and Engadget both ran live blogs that are useful if you want the second-by-second.

Meta cuts 8,000, $21B to CoreWeave. May 20.

CNBC has the Wednesday rollout. The 8,000-job cut went live as scheduled, about 10% of the company, with an additional 6,000 open roles frozen. Zuckerberg memo: "Success isn't a given. AI is the most consequential technology of our lifetimes." About 7,000 employees are being moved into new AI-focused roles. The cuts landed across Reality Labs, recruiting, sales, and global operations.

The capex line is the part to read alongside the layoff. Meta has lifted 2026 capex guidance by up to $10B to $145B and committed $21B through December 2032 to AI cloud provider CoreWeave. 24/7 Wall Street framed the layoffs as a line item in the AI bill, which is uncomfortable language but accurate. The Next Web has the $56B quarterly revenue context.

The aggregate. TrueUp's tech-layoff tracker reads 142,985 cuts through this week, running at about 1,000 per day. Layoffs.fyi's narrower count is at roughly 113,000. Both methods are credible; the gap is what each counts as a "tech" company. The 2025 full-year Layoffs.fyi total was about 122,000. We will clear that with seven months to spare at the current rate.

My read. The Zuckerberg memo language is the new standard formulation across the cohort: not "AI is doing the work" but "we are not promised success, we cannot afford people we used to afford." Same structural outcome for the people losing the job, different argument about what the company is doing. The pace keeps outrunning the realistic-view forecasts, including mine.

The power story: BCG drops a number, PJM says "years, not decades"

The Boston Consulting Group published the forecast that has been making the rounds this week. AI data centres will consume the same electricity as roughly two-thirds of all US homes by 2030. Their underlying math: data-centre electricity consumption tripling from ~130 TWh in 2022 to ~390 TWh in 2030, with ~70 TWh of that increase attributed specifically to generative AI. Amazon, Google, Meta, and Microsoft are collectively spending roughly $400B per year on AI infrastructure, per Yahoo Finance's summary.

PJM Interconnection, the grid operator covering 65 million Americans from Virginia to Illinois, published a white paper this week saying it has "years, not decades" to fundamentally restructure. New CEO David E. Mills, who took the job May 1, wrote in the foreword that "the current situation is not tenable." The Register has the breakdown and TechCrunch has the operator-level primer. The price evidence: the 2025/2026 PJM capacity auction cleared at $269.92/MW-day (up from $28.92 the year before), the 2026/2027 auction hit the FERC cap at $329.17/MW-day, and the 2027/2028 auction cleared at the cap with a supply shortfall — PJM could not buy enough capacity at maximum price.

My read. The training-cluster question for 2027 is no longer chip availability; the chips will be there. The grid is the constraint, and the auction-cap-with-shortfall print is the strongest market signal of that I have seen. Expect more developers to go off-grid with on-site generation, and expect more sites to chase regions with newer transmission capacity (Texas, Wyoming, the upper Midwest) over Northern Virginia. That second move is already visible in the announced-project map; the Stratos campus below is the more dramatic version of the same pattern.

Stratos in Utah; the Florida pipeline

O'Leary Digital's Stratos campus in Box Elder County, Utah is the buildout to watch on the on-site-generation pattern. Tom's Hardware has the shape: 9 GW at full buildout, generated on-site from natural gas off the Ruby Pipeline, 40,000 acres of unincorporated county land plus 1,200 acres of state and military land, projected cost north of $100B over the life of the build, first gigawatt targeted within two years. The county commission approved on May 4. Demonstrations followed: a few hundred people on May 14 delivered a petition with 7,000+ signatures, and more than 600 rallied again on May 23. Utah News Dispatch has the coverage. I'll watch whether construction proceeds at the spec the developer is pitching, and whether the on-site-gas model gets replicated by other developers facing grid-queue bottlenecks.

Florida's pipeline keeps getting denser. Fox 13 Tampa Bay reports Fort Meade in Polk County approved a $2.6B, 4.4 million-square-foot data centre on April 15, developer Stonebridge, on former phosphate land, with a 20-year development agreement and a $150M tax break. Project Tango in Palm Beach County (202 acres near Loxahatchee) had its zoning hearing postponed from April 23 to July 15 for additional impact studies. Active proposals across at least seven more counties are tracked on floridadatacenters.org. State-level coverage: Florida Phoenix has a bill explainer.

Rack-level power is its own story

The substation is one half of the power story; the rack is the other. Compute Forecast has the trend lines: industry-average rack density at 27 kW in 2026, a 69% year-on-year jump, driven by NVIDIA Hopper and Blackwell deployments. The latest GB200 racks pull 132 kW fully loaded. Next-gen NVIDIA platforms keep moving the ceiling: Vera Rubin Ultra projected at 600 kW per rack, Feynman Ultra at 1.2 MW per rack by 2029. Liquid cooling is now the default for AI deployments, not the upgrade. Tech Zine has the forward look out to 4 MW per rack and the high-voltage shift that makes those numbers possible.

My read. The political opposition lives at the substation and the watershed. The capability problem lives at the rack. Every additional kilowatt of rack density compresses the timeline at the substation end of the wire, which is the squeeze PJM is reporting at the auction.

Smaller items

AI Mode in Search hit 1B MAU in 12 months, per Sundar's I/O post — the structural Search-transition number.
Antigravity 2.0 CLI ships with the credential masking and Git policies that make agent platforms usable in production CI loops.
NVIDIA Vera Rubin Ultra projected at 600 kW per rack; Feynman Ultra at 1.2 MW by 2029.
Stonebridge's Fort Meade build broke through despite 40 of 41 public commenters opposed.
The Deseret News has a running tracker of state-level data-centre moratoriums and ballot initiatives — useful as site-planning context.

Looking ahead

WWDC is June 8, so Apple's on-device-AI story starts landing in next week's roundup. The first wave of teams trying Antigravity 2.0 + the CLI in production should produce real-world reads by Friday. Project Tango's July 15 hearing is the next big data-centre vote. The power-supply story keeps growing — I expect it stays in the rotation for several months.

Sources

Google I/O 2026 — Developers Blog roundup, Sundar's keynote post, Tom's Guide live blog, Engadget live blog.
Meta layoffs + capex — CNBC: Zuckerberg memo, 24/7 Wall Street: $145B AI bill, The Next Web: $56B revenue context, SF Standard: Meta morale.
Layoff aggregates — TrueUp tracker, Layoffs.fyi.
Power and grid — BCG: solving the data-centre power crunch, Yahoo Finance: AI data centres + $400B capex, The Register: PJM grid reality, TechCrunch: PJM under strain.
Utah Stratos — Tom's Hardware: project overview, Utah News Dispatch: May 23 rally, Utah News Dispatch: May 14 rally + petition, CNN: project + opposition, KUER: county vote.
Florida pipeline — Fox 13: Fort Meade approval, CW34: Project Tango delay, Florida Phoenix: bill explainer, floridadatacenters.org.
Rack-level density — Compute Forecast: rack density trends, Tech Zine: 4 MW per rack outlook, Substack: 1 MW rack milestone.
State tracker — Deseret News moratoriums and ballot initiatives.

Deployment: IaC, CI/CD, environments, the minimum shape

Sid Smith — Sun, 24 May 2026 13:00:00 GMT

There's a thing that happens about three weeks into shipping an AI MVP. The product works. The prompt is dialled in. The first pilot users have started clicking around. And you've quietly accumulated a small zoo of resources in the AWS console that you created by hand, in no particular order, while you were trying to get the thing to work. There's a Lambda you can't remember configuring. There's an RDS parameter group with a name like default-pg15-2. There's an S3 bucket whose lifecycle rule you set up at midnight one Tuesday.

This is the part where most MVPs either get on top of their deployment story or get crushed by it. Not because the AWS console is bad. Because the second environment is when the bill comes due. The day you want a staging environment that mirrors production, you discover that you don't actually know what production looks like, you only know what's currently running, which is not the same thing.

Three envs one pipeline

The fix is boring and the fix is well-known: infrastructure as code, version-controlled, deployed by a pipeline. The interesting question is how small you can make that setup and still ship safely. This piece is the minimum I'd actually do for an AI MVP. Not the gold-plated version. Not the "what FAANG does" version. The version that fits a one-person or three-person team and a real budget.

If you've been following the architecture spine through this series. Lambda, API Gateway, Cognito, RDS with pgvector, S3, Bedrock, EventBridge, SQS, CloudWatch, and a Mac Studio on the local side, this piece is how you actually wrap a build-and-deploy pipeline around it.

Pick one tool, and pick CDK

Terraform and CDK both work. They both produce reproducible AWS infrastructure. They both have warts. The argument over which is "better" has been running for years and will run for more. For an AWS-native MVP, pick CDK and stop debating.

The reason isn't religious. It's that CDK lets you write infrastructure in TypeScript (or Python, but I default to TypeScript here), which means your Lambda code and your infrastructure code share a language, a linter, a test runner, and a type system. The Constructs library (the standard CDK pre-built bundles) removes a stack of boilerplate. A RestApi construct wires API Gateway, CloudWatch logging, throttling, and stages in twelve lines. A DatabaseCluster construct wires RDS, security groups, subnet groups, and parameter groups in twenty. CDK stands for Cloud Development Kit; it's AWS's official "write infra as code in a real programming language" tool, if you want to look it up later.

You also get the synth-then-apply rhythm, cdk synth outputs the CloudFormation, cdk diff shows you what's about to change, cdk deploy applies it. The diff step is the single most valuable habit you can build. Read the diff every time. The day you skip reading it is the day you delete a production database.

Terraform isn't worse, exactly. It's broader, it works across clouds, has a stronger community in multi-cloud shops, and the state-file model is more transparent. If your team already knows Terraform cold, use Terraform. If you're choosing fresh and you're on AWS, CDK pays back faster because the Constructs library is genuinely good and the type safety catches mistakes the YAML world doesn't.

Pick one. Don't run both. Don't half-CDK-half-console. The split makes the pipeline brittle and the audit trail useless.

Three environments, the way they actually need to differ

The standard answer is dev / staging / prod, and that's right, but the standard explanation undersells how different they need to be from each other. Let me walk through what they're really for.

Dev is one developer's playground. Every developer should be able to spin up a personal copy (cdk deploy --context env=dev-sid, for example) and tear it down without affecting anyone else. The data is fake. The Bedrock calls are real but cheap (point at Haiku models, low rate limits). RDS is a tiny instance. There's no Multi-AZ. The point is step-by-step speed, not durability.

Staging is one shared environment that mirrors production's shape. Same instance sizes (cheaper tier if you must, but same topology). Same secrets pattern. Same observability wiring. The difference is the data (staging gets synthetic data or anonymised production samples) and the customers, of whom there are zero. Staging exists so the CI pipeline can deploy to it, run the eval harness against it, run integration tests against it, and let humans poke at the actual UI. Staging is the thing CI breaks if it's going to break.

Prod is what customers touch. Stricter alarms. Full backups. Whatever Multi-AZ or read replicas you've decided you need. Locked down IAM. The blast radius matters here in a way it doesn't in dev or staging.

The shape of the differences matters more than the count of environments. I've seen teams run six environments where they were all subtly different from each other in ways nobody documented. Three environments that are deliberately the same in everything that matters beats six that drift.

The CI/CD pipeline. GitHub Actions, three jobs, no drama

The pipeline I land on for this kind of MVP has three GitHub Actions workflows.

pr.yml runs on every pull request. Lint, unit tests, cdk synth, cdk diff against staging. The diff gets posted as a PR comment so reviewers see exactly what infrastructure is about to change. No deploy happens.

deploy-staging.yml runs on merge to main. It deploys CDK to staging, runs the eval harness against staging Bedrock endpoints, runs integration tests, and if everything passes, tags the commit staging-passed. The eval harness is part of the gate, not a separate concern. A regression in model quality is a regression. You can read more about that bit in the eval harness piece.

deploy-prod.yml runs on manual trigger, a workflow_dispatch with a commit SHA. It only accepts SHAs that carry the staging-passed tag. It deploys, runs a smoke test, and pings a Slack channel. Manual trigger because production deploys should be a person saying "yes, do it now," not a side effect of a merge.

That's the whole CI/CD shape. Three workflows. The most complex one is maybe two hundred lines of YAML. Don't build the elaborate version on day one. Most of the hard problems people solve in CI/CD (canary deployments, automated rollback, blue/green) are solving problems you don't have yet at MVP scale. Add them when you feel the pain.

Want to see how the deploy gate interacts with prompt versioning? The prompt-versioning approach in prompts as code and the audit story in observability and audit, not later both depend on the deploy pipeline being able to roll back cleanly. They're sibling concerns to this one.

Migrations on RDS, the part everyone hand-waves

The CDK side of RDS is easy. The migrations side is where I see MVPs hurt themselves.

Here's the rule I'd ask of any team: migrations are not part of the CDK deploy. They are a separate step, run as a job, with its own logging and its own retry semantics. CDK builds the database; a migration tool changes the schema inside it.

I default to flyway or migrate (the Go one) for this, pick whichever your team already knows. The migrations live in db/migrations/ in the same repo, numbered sequentially, with up-only files (down-migrations look great in slides and ruin you in production). The CI pipeline has a separate job, migrate-staging, that runs after the CDK staging deploy but before the eval harness, so the eval runs against the new schema. The same shape exists for prod: migrate-prod runs before the prod deploy is considered done.

The reason migrations need to be their own job is that they have failure modes the CDK deploy doesn't. They can deadlock against running queries. They can run for forty minutes on a big table. They can succeed but leave the application's data in a state nobody expected. Wrapping all of that inside cdk deploy makes the failure mode opaque and the rollback impossible. Pulling it out gives you a job you can rerun, monitor, and reason about independently.

For an IT ops consultant productizing their triage tree, the kind of migration that gets you is something like "add a severity column to the tickets table." It's three lines of SQL. The thing that ruins you isn't the SQL, it's that the new application code expects the column to exist, gets deployed before the migration runs, and starts throwing 500s on every customer query. The fix is the boring rule: migration runs first, app code runs second. Bake the order into the pipeline.

Secrets, three places, one pattern

Every environment has the same kind of secrets: database passwords, Bedrock API quotas, third-party API keys, signing keys for the audit trail, SMTP credentials for transactional email. The MVP-grade answer is Secrets Manager, with a clean naming convention.

The naming convention I use is eotm///, so eotm/prod/rds/admin-password, eotm/staging/bedrock/throttle-config, and so on. CDK creates the secret resources; the actual values get rotated through Secrets Manager directly, never through CDK and never through a checked-in file. Secrets Manager is AWS's hosted store for sensitive config values; it integrates with KMS for encryption and IAM for access control, if you want to look it up later.

The application Lambdas reference secrets by ARN, not by value. The value is fetched at cold-start (cached for the warm-start window) using the AWS SDK. The Lambda's execution role has IAM permission to read only the secrets for its environment. A prod Lambda cannot read staging secrets, full stop. This separation matters more than people think; the day someone runs a staging test that hits a prod resource by mistake is the day this boundary saves you.

For local development, the pattern is the same with one substitution: developers use named AWS profiles to assume a dev role, the application Lambdas (when run locally via sam local or equivalent) fetch the dev secrets, and nobody (ever) copies a secret value into a .env file in the repo. If you need a local-only value for a workflow that runs entirely offline (say, a Mac Studio job for the legal pro auto-reviewing contracts against their playbook), it lives in the Mac Studio's local keychain, not the repo.

The cross-environment story matters: every secret used in prod has a staging twin, and the staging twin is what dev work points at. A new secret added in prod without a staging twin is a deployment that will break when it goes through the pipeline. Make the absence of a staging twin a CI failure. It's a six-line check; it saves you a 2 a.m. incident.

What this gives you

Land all this and you've got a setup that does what it needs to: one CDK app, three environments with deliberate parity, a three-workflow GitHub Actions pipeline that runs evals as a deploy gate, RDS migrations run as their own auditable job, secrets in Secrets Manager with a clean per-environment IAM split.

You don't have canaries. You don't have automated rollback. You don't have a multi-region failover. You don't have blue/green. That's fine. You can ship without those for a long while. What you do have is the smallest setup that lets you change infrastructure on purpose, deploy on purpose, and trace what was deployed when, which is the bar an MVP actually needs to clear.

The day you outgrow this is the day you have ten customers, the eval harness is catching real regressions, and you start to want a canary because rolling back is too slow. That's a good problem. The setup above is the foundation you'd extend toward that, not throw away.

If you're starting today: pick CDK, write three GitHub Actions workflows, treat your migrations as a first-class job, put your secrets in Secrets Manager with a per-environment naming scheme. That's the minimum shape. It's not much. It's enough.

The cost model: what you pay before you have customers

Sid Smith — Sat, 23 May 2026 13:00:00 GMT

The number that matters before you have customers is not your revenue, your conversion rate, or your activation rate. You don't have any of those yet. The number that matters is how long your bank account lasts at the burn rate the architecture is quietly setting for you.

I've watched founders get into a real bind here. They picked a stack that looked free on the marketing page, shipped a working MVP, and three months in they're staring at a bill that says they're spending five hundred dollars a month to serve fourteen pilot users, twelve of whom are friends. The bill isn't huge in absolute terms. It's huge relative to the runway, and it's huge relative to the revenue, which is zero. They are paying retail to subsidise other people's curiosity.

Monthly cost stack

This piece is the runway math. I'll walk through what the AWS free tier actually covers for the AI MVP shape we've been building in this series. Lambda, API Gateway, Cognito, RDS with pgvector, Bedrock, S3, CloudFront, EventBridge, SQS, CloudWatch, and where the bill suddenly stops being "rounding error" and starts being "we should talk." Then I'll show where the Mac Studio side of the split earns its keep, and where it doesn't.

If you missed the earlier piece, the hybrid split between cloud and a local rig is the framing this whole article assumes. Customer-facing stuff lives in the cloud. Training, batch, eval, and back-office image generation live on the Mac Studio. The cost story is a big reason that split exists.

The free tier, what it actually gets you

The AWS free tier is more generous than people think for the always-free chunks, and meaner than people think for the twelve-month chunks. Let me sort which is which for the services this MVP uses.

Lambda is genuinely free for ages. A million requests a month and 400,000 GB-seconds of compute, always free, no twelve-month timer. For an AI product where the customer-facing API is mostly thin handlers calling Bedrock (the model does the work, the Lambda just orchestrates) you can run a real pilot inside that envelope. A Lambda that takes 800 ms and uses 512 MB of memory is about 0.4 GB-seconds per call. That gets you roughly a million calls a month before Lambda itself charges you a cent.

API Gateway will take a bite earlier than you think. The free tier is one million REST API calls a month for the first twelve months, then it costs about $3.50 per million. It's not a lot. But if you're wiring a chatty front end that hits five endpoints to render one screen, you'll burn through the free million faster than your Lambda compute. HTTP APIs are cheaper than REST APIs, about a third the price. For an MVP I default to HTTP API and don't look back unless I need a feature that's REST-only.

Cognito is the surprise in your favour. It gives you 50,000 monthly active users (MAU) free, forever. That's an enormous amount of headroom for a pilot. If you have 50,000 active users and you're worried about the Cognito bill, you have other things to worry about, and they're good things. MAU here means anyone who logged in at all that month, if you want to look it up later.

RDS Postgres is the line item that gets people. The free tier offers 750 hours per month of a db.t4g.micro for the first twelve months (basically a single small instance running 24/7) plus 20 GB of storage. After twelve months, that small instance becomes about $12-15 a month. That's still nothing. The trap is what people pick instead. They look at the chart of instance sizes, decide they want to "leave headroom," pick a db.m6g.large, turn on Multi-AZ for "production readiness," and now they're at $250 a month for a database serving zero queries an hour.

For an MVP, start at db.t4g.micro or db.t4g.small. Don't turn on Multi-AZ until you have a customer who'd notice it being off. Use the free backups. The pgvector extension runs fine on the small instances; the cost of indexing 50,000 embeddings is not your problem at MVP scale.

S3 is cheap until egress. Storage is twenty-three cents per GB per month for standard, less for infrequent-access tiers. You can store gigabytes of model artifacts, training corpora, transcripts, and eval sets and barely notice. What costs is moving the data out. Nine cents per GB egressed to the internet, less to other AWS services in the same region. If your CloudFront cache is doing its job, S3 egress stays small. If you're serving raw S3 URLs to users, you're paying the worst version of this bill.

CloudFront is where bandwidth lives or dies. First terabyte of egress per month is free, forever. After that it's about $0.085 per GB to North America and Europe, more to Asia-Pacific. For static assets and API caching it's a good deal. For streaming video, image-heavy pages, or large model downloads, it goes up fast.

EventBridge and SQS are basically free at MVP scale. EventBridge custom events are a dollar per million. SQS standard queues give you a million requests a month free, forever. You will not notice these on the bill. Don't optimise them; optimise things that matter.

CloudWatch is the silent killer. Free tier gives you ten custom metrics, ten alarms, a million API calls, 5 GB of log ingestion, and 5 GB of log storage. Sounds like a lot. Then you turn on verbose Lambda logging across six functions, and a single bad day of debug logs ingests 8 GB and blows the free tier in one afternoon. Log ingestion is fifty cents per GB and storage is three cents per GB per month after that. Set retention on your log groups before you ship anything. Seven days for dev, thirty for prod. The default is "never expire" and it will quietly cost you.

Secrets Manager is forty cents per secret per month. Trivial. KMS keys are a dollar per key per month plus three cents per 10,000 requests. Trivial. These add up if you create a hundred of them by accident, so don't do that.

Want a deeper tour of the cloud shape these line items belong to? I sketched the box-and-line view in the AWS-native shape I actually start with. This piece is the price tag on each of those boxes.

The line item that actually bites. Bedrock

Everything I just listed is rounding error compared to the model bill.

Bedrock charges per token, in and out, and the prices vary by model. A rough mental model for May 2026: a Sonnet-class model is in the $3-per-million-input-tokens: $15-per-million-output-tokens neighbourhood. Haiku-class is roughly a tenth of that. Opus-class is roughly four times Sonnet. Tokens are the chunks the model reads and writes, a token is about three-quarters of a word, if you want to look it up later.

The shape of the bill follows the shape of the call. A typical "diagnose" call in a consultant-vertical product, say, the sales consultant's discovery framework asking the model to extract pain points from a transcript, runs maybe 5,000 tokens of input (the system prompt, the retrieved RAG context, the transcript) and 800 tokens of output (the structured pain-point list). On Sonnet that's about 1.5 cents. On Haiku it's about 0.15 cents. On Opus it's about 6 cents.

Now multiply by how often that call fires per real customer interaction. If your product flow is "user uploads a transcript, system runs three Sonnet calls to triage, diagnose, and propose actions," you're at four and a half cents per interaction. Five interactions per pilot user per week, fifty pilot users, that's about $45 a month on Bedrock alone, before anyone has paid you a dime. Manageable. Now imagine each interaction fires six calls because you went wide on retrieval, or you used Opus because "it gives better answers," or your prompt grew from 5,000 tokens to 25,000 because you started shoving the whole knowledge base into context instead of doing real retrieval. You're at $400 a month for the same fifty users.

The cost-as-a-design-input principle: every prompt you write should be paired with a per-call cost number. Not someday. The first time you ship the prompt. If a flow is too expensive, you fix it the way you'd fix a bug, with intent and a stopwatch. Reaching for a smaller model, tightening retrieval, caching where you can, and using the eval harness to prove the cheaper version is still good enough.

The marketing strategist productizing their brand-positioning method is a great example: the first version of the prompt asks Sonnet to "review this brand" and dumps in twelve pages of brand voice notes. Cost per call: thirty cents. The shipped version retrieves the three most relevant sections, runs Haiku as a router to decide which brand-positioning lens applies, then runs Sonnet on a focused 3,000-token prompt for the actual review. Cost per call: under two cents. Same output quality, the eval harness said so. Fifteen times cheaper.

When the local rig pays back

The Mac Studio side of the split is fixed cost. You buy it once, it sits in the corner, the electricity bill goes up by maybe twenty dollars a month if it's running serious workloads, and that's it. There's no per-token, no per-request, no egress fee. The cost-per-inference for a workload it can run trends to zero as you use it more.

Three workload classes earn back the rig fast.

Image generation for back-office assets. mflux on Apple Silicon will turn out marketing illustrations, blog hero images, internal slide art, and product mockups all day. The cloud equivalent (Bedrock image models or third-party APIs) can run two to ten cents per image. If you're producing a hundred images a week for marketing pages and internal use, that's $30-100 a month for something a Mac Studio does for free. mflux is a Mac-native runtime for image generation models, if you want to look it up later.

Batch inference and eval runs. When you're regression-testing prompts against your golden eval set, you might fire 500 calls in a single eval run. On Sonnet that's $7-8 every time you run evals. Run evals nightly, that's $200-250 a month. Run them locally on a smaller-but-good-enough open model (mlx-lm with a Llama or Qwen variant) and that bill is zero.

Fine-tuning the secret-sauce model. This is the big one. Training even a small model on AWS (SageMaker, Bedrock custom models, or just GPU instances) is real money. A few hundred dollars per training run, easily, when you're iterating. mlx-lm fine-tuning on a Mac Studio handles small-to-medium models on annotated examples for the cost of an evening's electricity. The trained artifact gets uploaded to S3, picked up by cloud Lambda on next cold-start, and customers get the secret-sauce model without the cloud-training bill.

For an HR consultant productizing their interview rubric, the workflow looks like: gather a few hundred annotated interview transcripts, fine-tune a small model locally to classify candidate responses against the rubric, push the model artifact to S3. Cloud-side, Lambda loads the artifact and uses it for the cheap classification step before any expensive Bedrock call fires. The fine-tune happened for the cost of running the rig overnight. The cloud equivalent of that workflow easily runs into four-figure monthly bills if you do it through managed services.

When cloud is genuinely cheaper

It's not free-versus-paid; it's a matrix. A few cases where cloud wins on cost.

Anything customer-facing with real latency requirements. Round-tripping a customer query to a Mac Studio in your house adds 200-500 ms of network latency, plus the home internet connection's variability. The customer-facing path lives in the cloud. Always. The Mac Studio is for batch and back-office.

Models you don't have hardware for. A frontier-class model (anything in the Claude Opus or GPT-5 class) won't run locally on a Mac Studio at meaningful quality and speed. If your product needs that tier of model, you're paying Bedrock or OpenAI. The mitigation is using the cheaper models where they suffice and reserving the expensive model for the calls that genuinely need it.

Spiky workloads. A Lambda that runs once a week is essentially free. Buying a server for it is silly. The cloud is excellent at "do nothing most of the time, scale when needed." Local is excellent at "always be doing something."

The runway picture

Pull all of this together for a representative MVP, fifty pilot users, a hundred Bedrock calls a day across them, modest static traffic, a Mac Studio doing back-office work, and the AWS bill ought to land somewhere in the $40-80 a month range. Most of that is Bedrock. The infrastructure pieces. Lambda, Cognito, RDS small, S3, CloudFront, EventBridge, SQS, CloudWatch with retention set, sum to maybe $20-30 of that.

The Mac Studio amortizes against eighteen-to-thirty months of cloud workloads it would have otherwise replaced, image generation, evals, fine-tuning. The electricity is rounding error.

This is the burn rate before you have a customer paying you. If your pricing has a Bedrock-cost-per-customer in mind from the start, and the pricing piece coming up in the follow-on series goes into how to do that, your unit economics survive contact with the first paying customer. If they don't, you'll find out in week two and be glad you found out cheaply.

The cost model isn't a side concern. It's a design input. Every prompt has a price. Every retrieval has a price. Every audit log line has a storage price. You don't need to obsess. You do need to know the number for each of them, the same way you know whether a function returns the right answer. If you're starting an AI product this quarter, my one ask: write down the per-call cost of your three most-used flows before you ship them. That single discipline saves the runway.

Failure modes: graceful degradation when something's down

Sid Smith — Fri, 22 May 2026 13:00:00 GMT

The first AI product I shipped that handled a real outage gracefully did so almost by accident. Bedrock had a regional throttling event one Tuesday afternoon, calls started returning 429s, then timing out, then returning slow. The system limped, partially recovered, limped again. Customers kept using it. Nobody noticed for about two hours. I hadn't been clever. I'd spent the previous month being burned by smaller, weirder failures, and out of self-preservation I'd built a few defensive patterns into the request path. When the big one hit, those patterns held.

This piece is about those patterns. The failure modes an AI product hits in its first year are largely predictable. The thing that distinguishes products that survive their first bad outage from ones that lose customers is whether they've decided, in advance, what the system does when something it depends on isn't there. That's the difference between graceful degradation and cascading collapse, an architecture question, decided on day one, revisited every time you add a dependency.

Graceful degradation chain

Layman version. A marketing strategist has productized her brand-positioning method into an AI service for small business owners. A customer asks a question. The model service is rate-limited today. Without a plan, the customer waits, gets a generic timeout, decides this product is broken. With a plan, the customer either gets a slightly slower answer (fell back to a smaller model), a clear we're routing this to a reviewer who'll respond within an hour message (already in the strategist's queue), or (for some questions) gets answered immediately because the AI recognized the question was outside its scope. None of those is the perfect product working perfectly. All of them are products the customer keeps using.

The four failure modes

I keep a short list at the top of my head, and every new feature gets walked through it before I ship.

External model service unavailable or throttled. Bedrock has a bad day, your model is rate-limited, latency spikes past your timeout. This is the one people plan for, and still the one that bites hardest, the failure is rarely binary, so a naive timeout-and-retry pattern piles load onto an already-struggling service.

Back-office infrastructure unavailable. In a hybrid stack, the Mac Studio side runs whisper transcriptions, fine-tune jobs, eval batches, sometimes the secret-sauce model. Power cut, network blip, RAM-stuck job, the cloud half has to know how to handle work that was supposed to run locally. Most stacks don't. They let the SQS queue grow until it dies of old age.

Customer query falls outside what the AI can handle. Nothing is technically broken. The model returned an answer. But the answer was wrong because the question was outside the playbook, the retrieval corpus had nothing relevant, the AI confabulated, the reviewer couldn't catch it because the answer sounded plausible. A failure of scope detection.

Internal AWS infra failure. Postgres failover takes longer than your Lambda timeout. The vector index is rebuilding. Cold-start cascades. Normal AWS-day failures with well-understood patterns, patterns only help if you've actually applied them.

Each gets a different response. Lumping them under "we'll retry" is the failure mode behind the failure modes.

Circuit breakers

A circuit breaker (a piece of code that stops calling a failing service for a while so it can recover, if you want to look it up later) is the most leveraged code you'll write for external-model failures. The wrapper that calls Bedrock keeps a counter of recent failures. When the failure rate crosses a threshold (say, five errors in thirty seconds) the breaker opens. While open, calls don't go to Bedrock; they immediately return a "service unavailable" signal. After a cooldown the breaker enters a half-open state and allows one probe call through. Probe succeeds, breaker closes; probe fails, breaker stays open.

The worst outcome during a partial outage is a thundering herd of your own retries hammering a struggling service. The breaker turns your product into a polite consumer rather than a contributor to the problem. Customers experience fast failures (which sounds bad but is good, fast failure leaves time to fall back) instead of long timeouts.

Key detail: breaker state is per model, not global. Sonnet throttled doesn't mean Haiku is. That's what enables the next pattern.

Fallback models, the cheaper sibling

Every model call is wrapped in a chain. Try Sonnet first; if its breaker is open or the call fails, try Haiku; if Haiku fails, decide whether to fail loudly, queue for later, or fall back to human-only.

The chain is per call site, not global. A high-stakes diagnose call: Sonnet only, no fallback, fail loudly to the queue. A low-stakes routing call: Sonnet primary, Haiku fallback, keyword classifier as a third tier. The chain lives in version-controlled prompt config, not buried in the model-calling code.

The honest tradeoff: the fallback is usually less capable. So the audit row gets a which model actually answered field, the eval suite knows about each tier, and CloudWatch alarms fire on the duration of degraded operation. The consultant knows this morning's batch ran degraded and might warrant extra review. Degraded mode is a state the product knows it's in, not a quiet quality drop.

Queueing for retry

Some failures are best handled by punting. The work isn't time-critical, a transcription job, a batch summary, a fine-tune trigger. Put it on a delay queue, return we've got this, results in a few minutes, process when capacity is back.

For a career coach who's productized her resume-and-positioning review service, this maps cleanly. A customer uploads a resume. The back-office summarization model is down today. They don't need feedback in twenty seconds, they're going to read it over coffee. UX: thanks, your review is being prepared, we'll email you within ten minutes. The work goes into SQS with a delayed-visibility timeout; the worker retries with backoff until the underlying service comes back.

The patterns are boring. Idempotent job handlers. Bounded retry counts with a dead-letter queue (after N failures, the job goes to DLQ and a human gets paged, it stopped being transient). Visible queue depth in dashboards. Clear customer messaging that doesn't lie about timing.

What you don't want is silent retry forever. I've inherited systems where SQS queues had hundreds of thousands of messages backed up, retrying every thirty seconds against a service deprecated months earlier. Nobody noticed because the retries didn't error visibly. Bound the retries. Page on the DLQ.

When the Mac Studio is just gone

The local side disappears for a dozen reasons, power cut, network outage, launchd job stopped, OS update reboot.

Cloud-side detection is a recurring heartbeat: every minute, the worker writes to a Postgres table or S3 key. A scheduled Lambda checks the heartbeat is recent. Stale, the Lambda fires an alarm and starts queueing flagged work in a holding pattern.

Customer experience depends on the work. Pure batch (fine-tune jobs, weekly evals, asynchronous voice transcription) the customer never notices; bursty work absorbs hours of delay. Synchronous path (a query needing the locally-hosted fine-tuned model) falls back to a cloud model, the audit row records the fallback, the eval expectations adjust. The product keeps working. It works less well, and the system knows it does, and the consultant knows the system knows.

Want to go deeper on the cloud/local split? The wiring is in the Mac Studio side of the stack and the hybrid sync pattern. The point here: the boundary needs explicit failure semantics, not a hopeful shrug.

The query that's outside scope

People don't classify this as a failure, because nothing technically broke. It's the one that does the most damage over time, because the AI confidently answered a question it shouldn't have touched.

A product PM consultant's decision-coaching service covers prioritization frameworks, opportunity sizing, stakeholder mapping. A customer types: what's the right legal structure for my new LLC? The AI has no business answering. It might still produce a plausible paragraph because language models pattern-match anything into sentences. The right response: that's outside what I'm trained to help with. Here's what I can help with instead.

Mechanism: scope detection at triage. Every query gets a quick in-scope/out-of-scope classification before reaching diagnose, a small LLM call with a tight prompt against the playbook scope. Out-of-scope queries get an honest "this is outside the product" response, with a referral path.

The eval harness needs out-of-scope test cases as much as in-scope ones. The failure mode you're testing for is the AI confabulating an answer to a question it shouldn't have touched. Catch it before a customer does.

The honest UX of "we couldn't"

The hardest pattern to get right is the customer-visible message when degradation kicks in.

Wrong messages I see most: a generic something went wrong page; a deceptive we're processing your request spinner that loops forever; a lengthy technical explanation the customer can't act on. None preserve trust.

The right shape is short, specific, actionable. Three things: what happened (plain language), what the system is doing (concrete next step), what the customer should do (or shouldn't have to). Example: We couldn't answer this one automatically, one of our reviewers is taking a look and will respond within an hour. You don't need to do anything; we'll email you. Honest about the limit. Specific about the recovery. Doesn't waste attention.

The mechanism behind the message has to deliver on what it says. The reviewer queue has to exist. Someone has to be watching it. The email has to land. If the message is a lie (if the queue is where requests go to die) you've made the failure worse than just showing an error, because you've also broken trust. The honest message requires honest infrastructure.

Test the failures, deliberately

The pattern I most underuse is deliberate failure injection. On a quiet weekday, flip a flag that simulates Bedrock returning 429s for thirty minutes. Watch what happens. Does the breaker open? Does the fallback chain engage? Does the customer-facing message show up?

The first time I ran one I discovered my fallback chain was correctly configured for nine of ten call sites and silently misconfigured for the tenth, the one nobody had touched in six months, which fell back to no fallback. You don't find that from reading code.

I do this monthly. Takes an hour. Catches the drift you can't see otherwise, the new Lambda that didn't get the wrapper, the prompt without a Haiku variant, the dashboard that stopped firing because a metric name changed.

The list, plainly

When I add a new dependency, I write down the answers to four questions before merging:

What does the system do if it's slow? (Timeouts, breaker thresholds.)

What does it do if it's unavailable for an hour? (Fallback chain, queueing, customer message.)

What does it do if it returns a wrong answer? (Eval coverage, scope detection, audit field for "which path served this".)

How will I know any of this happened? (Dashboards, alarms, audit fields, queue notifications.)

If any answer is "we'll figure it out when it happens," I don't merge. The figuring-out gets done in the calm hour before the dependency fails, not the panic hour after. Customers don't churn over a perfect product having an outage. They churn over an imperfect product whose outage broke trust. The architecture that keeps trust intact is mostly written down before the bad day arrives.

Observability and audit, not later

Sid Smith — Thu, 21 May 2026 13:00:00 GMT

The cheapest hour I ever spent on an AI product was the one where I added a single field to a logger. The most expensive was the one eighteen months earlier when I shipped without it and spent the next nine months stitching log fragments together every time a customer asked "why did the system do that?"

That field is a trace ID. With it, observability is a query. Without it, observability is archaeology, and the dig site is your customer's loss of patience while you reconstruct what happened.

Trace + audit

This piece is about the day-one observability shape for an AI product, the thing I now refuse to ship without. Not because I love telemetry. Because I want to answer the customer's question, the auditor's question, and my own debugging question with the same query, in seconds.

Layman version. A sales consultant has productized her discovery framework, the AI scores client conversations, suggests follow-ups, drafts summaries. Three months in, a client asks: why did your system tell my rep to ask the budget question on call two instead of call three? That's a real question with a real answer. The system did something. There's a chain of decisions: which prompt fired, which rules matched, what the model returned, who approved it. Pull that chain back in twenty seconds and you keep the customer. Can't and you've lost their trust no matter how good the actual answer was. Observability is the receipt the system hands you on its way out the door.

The two things people conflate

Observability and audit are related, share infrastructure, but answer different questions.

Observability is the engineer's question. What is the system doing right now? Where is the latency? Which Lambda is failing? Diagnostic. Audience: the team; retention: days to weeks; format: structured logs, metrics, traces.

Audit is the customer's question, the regulator's question. What was decided? By whom? When? On what evidence? Evidentiary. Audience: non-engineers, sometimes lawyers; retention: years; format: a database table you can query with SQL and hand to a non-technical person.

These share a backbone (the trace ID, the structured-log discipline) but they should be separate stores. The observability store rotates and gets pruned. The audit store does not. The observability store holds things you can lose in a retention policy. The audit store holds things you'd be in serious trouble for losing.

Build both. They're cheap on day one. Unrecoverable later.

The trace ID, which is everything

Every customer-facing API request gets a trace ID at the edge, generated at the API Gateway authorizer, attached to request context, propagated to every downstream call. The format is a UUID, because UUIDs are unique without coordination and easy to grep for.

It rides through every hop. The handling Lambda adds it as a structured-log field on every line. Downstream Lambda invocations carry it in the message envelope. Work queued for the Mac Studio side carries it in the SQS message; the local worker logs everything under the same ID. Results pushed back to the cloud carry it through to S3, Postgres, and the audit row.

Done consistently, a CloudWatch Logs Insights query takes ten seconds to write and returns the entire timeline of a customer's request (across every service, every queue, every back-office worker) in chronological order. The query is short because the discipline was long: every log line on every machine includes the trace ID. No exceptions.

Surface the ID in customer-facing error messages (reference ID abc123-...), when a customer reports a problem the first thing they paste is the exact key you need. Surface it in the consultant's queue UI too. Trivial cost, huge leverage.

Structured logs, not stringified essays

Every log line is a JSON object: trace ID, timestamp, service, operation, tenant ID, user ID, and event-specific data. The message field is a short label, "model_call_complete", "retrieval_failed", "approval_recorded", not a paragraph.

Logs are queries. The moment your logs are free text, you've lost the ability to filter, group, and combine. "Give me every model_call_complete event for tenant X in the last hour grouped by latency bucket" is one Insights query with structured logs. Without them. It's a person and an afternoon.

For a medical specialist running a second-opinion review service, this matters more than usual. When something looks off, the structured log lets you reconstruct the exact prompt fired, the retrieval results returned, the model output, the reviewer action. Stringified logs are stories. Structured logs are evidence.

I lean on a small library every Lambda imports, fifty lines of code wrapping the standard logger, baking in the trace ID and tenant context automatically. It's the most-used import in the codebase. Build the equivalent on day one. Don't let anyone log strings.

The audit table

Separately from the observability stream, a Postgres table whose purpose is to record decisions. Not events. Decisions.

The schema is boring. Columns: primary key, trace ID, tenant, actor (user ID for humans, rule ID for auto-resolved), action type, subject, evidence (JSON blob of what the AI saw, retrieved docs, model output, confidence, rationale), decision, timestamp. A "supersedes" column for when a decision gets reversed by a later one; the chain is preserved.

This is the table you point at when someone asks why did your system tell my rep to ask the budget question on call two? Query by tenant and time range, find the row, read the evidence column. Human-approved? Actor field has the user ID. Auto-resolved? Actor field has the rule ID and version. Need more depth? Pivot on the trace ID into the observability logs.

The audit table doesn't get pruned because storage is cheap and regret is expensive. Backups are real backups, and encryption at rest is on by default. KMS, not "we'll get to it." Schema migrations are themselves audit events with a reason.

The most important rule: every state-changing API has to write to the audit table before returning success. If the audit write fails, the API returns an error and the action doesn't happen. This is harsh on purpose. The day you let an action happen without an audit row is the day your audit story has a permanent hole, and you won't know which day it was until a regulator finds the missing row.

The hybrid wrinkle. Mac Studio side

The cloud half is one trace continuum. The Mac Studio half, the back-office worker draining SQS, running whisper transcriptions, running eval batches, fine-tuning the small model, is a separate machine, and observability across the boundary is what teams fail at most.

The fix is the same fix: trace ID rides with the work. SQS messages carry the originating trace ID, and the local worker logs every step under it. Result manifests pushed to S3 carry the ID, and EventBridge events back into the cloud carry it. The downstream Lambda that updates the audit table writes a row with the ID still attached.

The local worker ships its logs back to CloudWatch via a small daemon, under the same log-group conventions the cloud Lambdas use. The Insights query that returns the request timeline includes the local worker's hops as if it were just another Lambda. The customer doesn't know there's a Mac in a closet. The trace doesn't either.

Want to go deeper on the cloud-local mechanics? The wiring is in the hybrid sync pattern. The point here: the trace ID and the structured-log discipline cross the boundary unchanged. If your observability falls apart at the edge of your network, you have a cloud monitor and a separate local monitor and a habit of swiveling between them.

CloudWatch and what it isn't

CloudWatch is the default for AWS-native systems and a fine one. Logs flow there from every Lambda. Metrics flow from API Gateway, Bedrock, SQS without you doing anything. Insights queries are fast at small scale.

What CloudWatch is not is the audit store. Its retention is rarely "forever." Its query model is awkward for pulling a specific decision and showing it to a non-engineer. Its access controls are coarser than you want for "legal can read these rows but not those." Use CloudWatch for the observability stream and Postgres for the decision record. Engineering surface vs governance surface. Same trace ID, different stores.

Dashboards on day one are small and pointed: request rate by endpoint and tenant; model-call latency distribution; model-call cost per tenant per day; error rate by service; SQS queue depth; local-worker heartbeat; eval-suite pass rate over time. Seven dashboards, single screens, readable in fifteen seconds each. Premature dashboards are a way to feel observant without being observant.

The parts that bite

PII in logs is a permanent problem if you don't head it off on day one. Customer queries contain personal data, financial details, regulated content. The structured logger has to know what fields to scrub before serialization, and the scrub list is in version-controlled config. Once PII is in CloudWatch, getting it out is messy. Don't put it there.

The trace ID has to survive serialization round-trips. SQS messages, EventBridge events, S3 metadata, Postgres JSON columns. Every one is a place a sloppy serialization can drop the field. Test the round-trip with an integration test. Ten minutes to write, saves a debugging session.

Audit writes have to be idempotent. Lambdas retry. SQS at-least-once means duplicate processing. Use an upsert keyed on (trace ID, action type, subject) so a retry doesn't double-record. Otherwise the count of "decisions made" diverges from the actual count, and the report you give the consultant is silently wrong.

Structured-log fields drift. latency_ms, latencyMs, latency all end up in the same log group. Lock the field names with a schema and lint for them.

Day one, not later

You don't build observability later because the decision points where it would have been cheap (how to log, what fields are universal, where the trace ID gets generated) pass quietly during the first week. By the time you wish you had them, retrofitting means touching every service, every Lambda, every worker. You'll do most of the work but not all of it, and the gaps will bite you on the day you least want to be bitten.

The audit table gets harder to add in proportion to traffic handled without it. On day one the table is empty and the schema is your decision. On day three hundred it's full of rows you wish were structured differently.

So: day one. Trace ID at the edge. Structured logs everywhere. A Postgres audit table every state-changing API writes to before returning. Both layers crossing the cloud-local boundary unchanged. Seven CloudWatch dashboards. PII scrub list locked. Retry idempotency. The whole pile, in the first week, before there's anything to observe. The day a customer asks the question (and they will) you want the answer to be a query, not a dig.

The approve/deny gate: and when it goes away

Sid Smith — Wed, 20 May 2026 13:00:00 GMT

There's a moment in every AI product's life when the founder looks at a queue full of human-approved suggestions and asks: do we still need the human? The honest answer is "for some of these, no, and for some, never." Knowing which is which is most of the job.

Layman version. An IT operations consultant has packaged her decade of triage instinct into an AI helpdesk product. A ticket lands: the office printer keeps going offline. The AI proposes: restart the print spooler on the front-desk PC, reseat the network cable. On day one, that proposal goes to a queue, not the customer. A tech looks at it, nods, approves. The customer gets the answer. Three weeks later the consultant notices this exact suggestion has been approved 47 times, denied zero, with no follow-up complaints. That pattern is the signal. This class can graduate. From now on the AI's answer goes straight to the customer; the human reviews a sample.

Confidence ladder

That arc (suggesting → acting) is the whole point of the approve/deny gate. The gate is not a permanent tax. It's a learning instrument. You build it on day one because you have to. You let pieces of it dissolve over time because the data tells you they can. And the audit trail survives the transition because you designed it to.

Why the gate exists on day one

Two reasons, both non-negotiable.

You don't yet know how the AI fails. Every AI product I've shipped has had a failure mode that wasn't in any pre-launch eval. Not because the eval was bad, because the world is bigger than your test set. The first hundred real interactions are where you learn what the model actually does in your domain. A human between the AI and the customer during that window is how you learn without burning customers.

Then the audit story. Who decided this? On what evidence? When? On day one the answer should always be "a named human." A product that day-one auto-resolves anything has nobody to point at when something goes wrong.

So: gate. Every action. Day one.

The mechanics are simple. The AI runs through its triage-diagnose-resolve loops (covered in the three loops piece) and produces a proposed action with a confidence band, a rationale, and the evidence it used. The proposal lands in a queue. A reviewer sees it, with approve/deny buttons. Both outcomes get logged with reviewer identity, timestamp, and full context.

Both outcomes. Most teams log the approves. The denies are where the gold is.

Pattern-mining the approvals (and the denies)

After a few weeks, your database knows things nobody else does. Which classes the AI handles well (high approval rates, fast reviews, no follow-up complaints), which classes it fumbles (high deny rates, slow reviews, lots of edits before approval), and the gray middle where humans approve with hesitation visible in the timing data.

Mining that table is how you decide what graduates.

For an HR consultant who's productized her interview-rubric scoring: applications scored "strong yes" with all four criteria present and no flags get approved 98% of the time, in a median of 12 seconds, with zero post-approval reversals across 200 cases. That's a class. Narrow, well-defined, and the human review is performative, every reviewer just clicks approve. Those clicks consume reviewer time that should go to harder cases.

Meanwhile "weak yes" applications with one criterion missing and a tone flag get approved only 60% of the time, take 4 minutes on average, with a 12% reversal rate. That class is not graduating anywhere. The AI's confidence isn't justified by the outcomes.

The pattern-mining isn't sophisticated. The first version I built was a Postgres view with three columns: input category, AI proposal, human action. Read the rows by hand, eye the rates, pick the obvious classes, write the rule. Later you can make it a formal classifier, but the manual phase is more honest. You see what's actually in the queue.

The graduation rule

The rule I now use for letting a class auto-resolve is deliberately conservative.

A class is eligible when all of these are true: approval rate over the last 90 days above 95%; volume of at least 100 cases (so the rate isn't a small-sample artifact); post-approval reversal rate below 1%; the deny reasons that did show up are not about safety or correctness (scope, formatting, stylistic preference); and the consultant whose secret sauce drives the product has personally signed off.

That last bit matters. Graduation isn't a system decision. It's a human decision informed by data the system collected. The system makes the decision easy to make and easy to defend.

When a class graduates: future cases skip the queue and ship the action directly; a sampled audit kicks in (one in twenty cases still gets a post-hoc human review); the audit table gets a new field recording whether this row was human-approved or auto-resolved, which rule version, signed by whom. That field is the bridge between "the AI did it" and "a named human authorized the AI to do it under these conditions."

The audit trail through the transition

This is the part most teams botch.

The audit table on day one: case ID, input, proposed action, evidence, reviewer ID, decision, timestamp. Reviewer ID always a real human.

On day 200, after several classes have graduated: same columns plus decision-maker (human or rule), rule ID and version, graduation authority (the consultant who signed), sampled review (yes/no, reviewer ID if yes). Every row is still answerable. Every row still has a chain of accountability. The chain just runs through a versioned rule signed by a named person, instead of a live reviewer.

Two things to insist on.

Graduation rules live in the same versioned, reviewed place as your prompts and eval cases. Every rule change goes through PR. I keep mine as small declarative YAML files alongside the prompts. (See prompts as code for why.)

The sampled-review path feeds back into the eval harness. When a post-hoc reviewer finds an auto-resolved case that should have been denied, that case becomes a golden example, the rule gets re-evaluated, and if the failure rate creeps above the threshold, the rule gets pulled. The eval harness is what makes graduation safe. Without it, graduation is just deletion of safety.

For the regulated-vertical reader. In medical, legal, financial domains, the graduation rule may need to stay shallow even when the data says it could go deeper. A medical specialist running a second-opinion review product might decide no class auto-resolves, ever. The gate doesn't have to dissolve, it can just get faster: better summaries, evidence presentation, keyboard shortcuts. Speed of human approval is a separate axis from removing it.

The queue UI

A bad queue is a wall of text with two buttons. The reviewer skims, gets bored, starts clicking approve to clear the backlog. The audit trail says "approved by Jane" but Jane is a rubber stamp.

A good queue is a one-line summary enough to evaluate easy cases at a glance, everything else collapsed but one click away. Approve is the default-focus button. Enter ships it. Deny opens a small box where the reason is required. Edit-and-approve is a third option, captured as "human modified the AI's proposal before shipping", those cases are gold for pattern-mining (systematic edits signal a prompt fix).

I track median time to review by class. Dropping because cases are obviously fine, graduation candidate. Rising because cases are getting harder, the AI shouldn't be touching those at all, and triage needs to route them elsewhere.

When the gate doesn't go away

Some classes never graduate, and that's the right answer.

Anything where the cost of a wrong answer is asymmetric and large, a contract going out under wrong terms, a financial recommendation a customer will act on, a medical interpretation affecting treatment, stays gated. Even if the AI is right 99% of the time, the 1% is too expensive to absorb. Mark these manually held and never auto-graduate, no matter what the numbers say.

Anything where the input distribution is unstable stays gated until it stabilizes. Strong historical approval rates with a recent spike in denials means the world moved. Graduate after the shift has settled.

Anything where the consultant's brand depends on the human touch stays gated as positioning. Some products are the human review; the AI's job is to make it faster and better-evidenced, not to replace it. That's a fine business model. The gate isn't a failure state. It's the product.

The arc, plainly

Day one: gate everything. Both outcomes logged.

Weeks one through twelve: mine the queue. Find classes where approval is reflexive and reversal is rare. Confirm with the consultant. Write the graduation rules.

Months three onward: confident classes auto-resolve, with sampled audit. Hard classes stay queued. The gate has gotten thinner, not absent. Every action (auto or human) is still answerable in the audit table.

Year two: the queue is small. The reviewer's role has shifted from "decide every case" to "decide the hard cases and supervise the rules." The audit story is stronger than on day one, because now you can show not just the decisions but the rules behind the decisions, the human who authorized each rule, and the sampled-audit data proving each one continues to behave.

The AI graduates from suggesting to acting. The human graduates from deciding to supervising. The audit trail does neither, it stays the same shape, the same "named human, named evidence, named decision" all the way through. The trail is what makes the rest safe. Build it that way on day one, and graduation is a feature you ship; build it any other way, and graduation is a story you can't tell.