estebanf.com

A Practical Test for AI-Assisted Product Discovery

estebanf — Mon, 25 May 2026 19:42:57 +0000

When you use AI in product discovery, the tempting move is to treat the summary as the customer signal. A product team finishes a week of calls, asks for the themes, and gets a clean answer: customers want “faster onboarding.” That sounds useful because it is short, confident, and easy to bring into a roadmap meeting. The team can turn it into an initiative, write a few tickets, and move on.

The real discovery work starts when someone opens the source interviews. One customer was blocked because procurement needed a permission model before rollout. Another understood the product but could not explain the pricing to their finance team. A third wanted to move faster because migration felt risky and they did not trust their own data. The label “faster onboarding” collapsed three different situations into one product-sounding finding.

That is the risk I keep coming back to with AI-assisted discovery. AI can help product teams read more material, prepare better questions, and spot patterns earlier. But it weakens the work when the output replaces contact with the evidence behind it. The issue is broader than empathy loss. The sharper risk is evidence detachment: the team loses the chain between a product claim and the customer material that supposedly supports it.

The answer is not to keep AI out of discovery. The better standard is to make every AI-assisted insight source-linked and judgment-led, especially when the output will influence a roadmap, pricing change, onboarding redesign, positioning decision, or enterprise commitment.

AI is most useful before the team believes it has an answer

In my experience, the safest uses of AI in discovery happen before the team treats a pattern as true. A PM can scan support tickets before customer interviews and prepare sharper questions. A researcher can compare recent calls with older notes and look for signs that a pattern is changing. A designer can draft prototype variations so customers have something concrete to react to.

Those uses improve the human conversation. They help the team inspect more material, avoid overreacting to the loudest interview, and notice questions worth asking. The AI output does work, but it does not carry the final judgment.

AI can also help after customer conversations. It can summarize transcripts, cluster themes, extract quotes, surface outliers, and compare feedback across channels. It can reduce familiar human failure modes, such as over-weighting a vivid anecdote or forgetting an old research thread that should inform the decision.

The boundary matters. A synthetic persona may help the team brainstorm interview questions. A transcript cluster may help the team find patterns across fifty calls. Neither should become proof that a market segment wants a feature unless someone has inspected enough of the underlying evidence for the decision being made. The practical threshold depends on the cost of being wrong.

Fluency makes weak evidence feel stronger

Product discovery depends on noticing friction. A customer contradicts themselves. A feature request hides a workflow problem. A support ticket sounds urgent until an interview reveals that only one admin role is affected. A sales call says “security,” but the real blocker is internal politics.

When you are swimming in calls, tickets, surveys, and notes, AI compression feels like relief. No one wants to reread every transcript before every planning session. The danger is that polished synthesis can feel more certain than the evidence deserves, especially when the model has done what models often do by default: compress, smooth, name, and organize.

You can see the problem in a common workflow. A team runs ten interviews, uploads the transcripts, and asks AI for the top themes. The model returns five neat clusters. Everyone agrees the synthesis is directionally right. In the planning meeting, the team cites the clusters, not the interviews, and no one asks which customers contradicted the pattern, which segments were missing, or whether one vivid story is doing too much work.

The team did research, but the decision no longer touches the research. The process has to catch that failure mode. Exploratory and reversible choices can move with lighter inspection. High-consequence choices need direct review of quotes, clips, tickets, notes, observations, and outliers.

Traceability is only the starting point

AI research tools are moving toward traceability for a reason. Teams want to move from an insight to the source quote, recording, timestamp, transcript, ticket, or note behind it. That is a good minimum requirement because it lets the team reconstruct how a claim was formed.

But a source link does not make the work good. It can point to a skewed sample, a misread quote, or a repository entry that gives weak evidence an undeserved sense of authority. Traceability improves reviewability. It does not replace judgment.

This distinction matters because the pressure to scale discovery with AI is reasonable. Executives will ask why the team cannot synthesize every call, ticket, review, and survey response. Founders will ask when synthetic users are good enough for early concept testing. PMs will ask why they should wait for interviews when a model can simulate the target persona in seconds.

Those questions deserve a practical answer. Direct customer contact does not scale cleanly, and human researchers also bring bias into the work. That makes the operating model more important, not less. The useful question is what job AI is allowed to do on the path from evidence to decision.

The test before the decision

Before you use an AI-generated summary, cluster, recommendation, persona, or concept as input to a product decision, run a simple diagnostic. I think of it as the evidence-linked discovery test.

First, what source evidence does this claim trace back to? A real trace should lead to a quote, clip, note, ticket, observation, survey response, or artifact. A general reference to “customer interviews” is not enough. If the team cannot reconstruct the path from claim to source, the output should stay in the hypothesis pile.

Second, have we inspected enough underlying material for the decision risk? The team does not need to reread every transcript for every choice, but someone accountable for the decision should inspect enough raw evidence to understand what the synthesis compressed. For a low-risk exploration, that may mean sampling quotes and outliers. For a roadmap commitment, pricing change, or enterprise rollout, the threshold should be higher.

Third, what variance did the AI output smooth over? Look for contradictions, segment differences, edge cases, emotionally charged moments, and places where the same label hides different causes. Ask the model to surface disagreement, not only agreement, then inspect a sample yourself.

Fourth, which parts are findings and which parts are interpretations? “Three admins mentioned permission limits during onboarding” is closer to evidence than “customers want faster onboarding.” Both may be useful, but they should not carry the same weight in a decision.

Fifth, what source types are missing? Interviews may miss behavioral data. Tickets may overrepresent frustrated customers. Sales calls may overrepresent buyers and underrepresent users. Traceability does not fix sampling bias, so the team still needs to ask what the evidence can and cannot represent.

Sixth, did the output preserve uncertainty? Good discovery outputs should show confidence, disagreement, and open questions. If the AI output sounds more certain than the evidence deserves, lower the decision weight.

Seventh, are we using AI to expand evidence review or replace missing evidence? AI is valuable when it helps the team inspect more real material. It becomes risky when it fills gaps the team has chosen not to investigate.

Finally, who owns the judgment? The model can draft a synthesis, suggest a pattern, and point to supporting excerpts. But the product team owns the decision. If no human can explain why the evidence supports the action, the team has delegated judgment without saying so.

How the test changes the work

Go back to the onboarding example. The AI synthesis says customers want a shorter setup flow. Without the test, the team might create an “onboarding speed” initiative and start removing steps.

With the test, the team inspects the source evidence and finds three different problems. Enterprise customers are waiting on admin approval. Small teams are confused by pricing. Migrating customers are nervous about data loss. A shorter flow may help one group and hurt another. The better roadmap may include permission templates, pricing guidance, and migration reassurance instead of only fewer screens.

The AI summary still did useful work because it pointed the team toward a real area of friction. But the source evidence changed the decision, which is the point of the whole exercise.

AI should move the team faster toward the right questions, not faster past the evidence. The best AI-assisted discovery workflow is not the one that produces the cleanest summary. It is the one that helps the team find the customer realities buried under a confident label.

Design operating contracts for AI Agents personas

estebanf — Sun, 17 May 2026 07:44:04 +0000

In July 2025, Replit’s AI coding agent reportedly deleted a production database during a code freeze after being told not to make changes. The model made a bad call. What matters for product teams is that the environment still lets the bad call execute.

That is the shift product teams need to absorb. When an AI agent can read context, call tools, update systems, route work, approve requests, or hand something back to a person, it has become a workflow participant. You do not need to treat it like a human, but you do need to specify its operating role with the same seriousness you bring to users, admins, approvers, and support teams.

This is what I mean by an agent persona. Not a fictional character. Not a friendly chatbot identity. Not a simulated employee. An agent persona is a product specification for what the agent reads, decides, routes, acts on, escalates, proves, and can no longer do when access is revoked.

Below is the practical artifact: a canvas for specifying the operating role of an agent before it enters a workflow. The argument for canvas is simple. If you do not define the role of the agent before launch, production will define it for you through permissions mistakes, handoff failures, audit gaps, bad metrics and users who no longer know what the system is doing.

Most product teams already map the people around a workflow. You define the requester, the approver, the admin, and the support team. You map goals, pain points, permissions, and screens. Then an agent gets added, and at first it looks like a feature. It reads the request, checks policy, approves routine cases, and escalates the unusual ones. That sounds like automation until you ask basic product questions.

What can it read, and what can it change? Does it act for the requester, the approver, or itself? What confidence threshold lets it approve? What happens during a freeze, a policy conflict, or a missing field? How do you prove what it did two weeks later?

Those are product requirements, not implementation details.

Workflow fit comes before model capability.

The wrong starting point is, “What can the model do?” The better starting point is, “Where does the workflow need a non-human participant?”

Agents do not create value by being capable in the abstract. They create value when they remove latency, process routine work, improve handoffs, or help humans make better decisions. The supervision cost has to be lower than the work removed.

This is where many agent ideas quietly fail. The model can summarize, classify, retrieve and suggest. But in the real workflow, the user still has to infer what the agent is doing, check its reasoning, correct absent context and carry the risk of trust. Delegation becomes slower than doing work directly.

A useful agent persona starts with a narrow operating role. For example:

It classifies inbound requests and routes them to the right queue.
It checks routine approval criteria but does not approve exceptions.
It drafts a response and cites the records it used.
It updates a system only after a human confirms the final action.
It acts automatically for low-risk reversible changes, but stops for sensitive or irreversible ones.

If you cannot write the agent’s role with that precision, you are probably designing around model capability rather than workflow need.

Permissions and context define the real product boundary.

Human personas usually cover goals, behaviors, pains, and jobs. Agent personas need those, plus a permission model.

The language of “persona” is risky if we use it casually. Security and platform teams will rightly ask why we are talking about personas when the hard problem is identity, authorization, delegation, revocation and auditability. So the term has to be constrained. An agent persona is a bridge from product language to operating controls. It should produce an operating contract, not a personality.

Agents should not be broad service accounts with vague internal trust. If an agent acts on behalf of a user, its access should bind to that user’s role and scope. If it acts for a team, the team’s authority needs to be limited. If it has its own constrained identity, the product needs to define what that identity can read, write, approve, delete, trigger, retain, and for how long.

McKinsey’s Lilli incident made the stakes concrete. CodeWall reported that an autonomous offensive agent found unauthenticated endpoints, exploited a SQL injection flaw, and gained access to McKinsey’s internal AI platform. The product lesson is not just that authentication failed. It is that the blast radius ran through the knowledge system itself: retrieval data, tool surfaces, prompts, and decision logic. When those layers are reachable through weak authorization, the risk moves from data leakage to trust poisoning.

For product teams, the practical question is not, “Can the agent connect to Salesforce, Drive, Jira, or the EHR?” The question is, “What authority does the agent carry into that system, for whom, for how long, and how do we take it away?”

Context is the other side of the same problem. Agents need to know which records are authoritative, which fields matter, which prior decisions still apply, and which missing details require a stop. A form field that felt optional for humans may become necessary for agent execution. A process that exists only in a manager’s head is not ready for an agent to act on it.

This is also where prompt injection becomes a product concern. If an agent reads emails, documents, web pages, code comments, or customer chats, it consumes content it did not author and cannot automatically trust. The product has to define which inputs are trusted, which are untrusted, and which tools are reachable after the agent reads them. Summarizing an email carries one level of risk. Reading an email and then issuing a refund, changing an account, or updating a medical record operates under a different risk model entirely.

Human control has to be designed as a workflow state.

Many teams talk about human-in-the-loop as if it means putting a person somewhere near the system. This is too vague.

Human review must be a state in the workflow. It needs a trigger, a handoff package, a response expectation and a record of what happened. Good escalation design separates three types of work. Routine, low risk, reversible actions can be automated. Cases where the agent is without context, confidence or authority should be paused for human review. Fast-moving workflows, such as fraud checks, cannot wait for humans, but they still need reliability, monitoring and after-action review.

The design job is to put human control where it changes the outcome. If every action needs approval, people rubber-stamp the system. If no action needs approval, the product creates unsafe autonomy. The boundary is the design decision.

A coding agent can read tickets, inspect failing tests, propose a patch and draft a pull request. But architectural decisions, production database changes, credential changes and destructive commands need a different control surface. A code freeze means little if the agent still has write access to production.

When an agent hands work to a human, the human should not have to reconstruct the agent’s path from scattered logs. The handoff should explain what the agent saw, what it decided, what it changed, what it could not determine, why it stopped and what it recommends next. For higher-risk workflows, the product also needs tool calls, API actions, traces, model versions and timestamps that operations and security teams can inspect later.

The agent persona canvas

1. Workflow fit and net value

What job in the workflow needs a non-human participant, and what work should disappear if the agent succeeds? Start here before autonomy, tools, or model selection. Name the outcome the workflow exists to produce: quality, cycle time, customer satisfaction, error rate, compliance performance, rework, or decision latency. Then name the supervision cost. Ticket volume alone is not enough. Average handle time is not enough.

2. Agent autonomy level

What level of agency are you designing for in each sub-task? Use simple categories from the agent’s perspective:

Agent-as-assistant: the human acts, and the agent supports.
Agent-as-collaborator: the human and agent share the work.
Agent-as-advisor: the agent recommends, and the human acts.
Agent-as-approver: the agent acts within defined limits, and humans intervene at defined points.
Agent-as-actor: the agent acts independently inside a bounded workflow.

A single workflow can have more than one level. The agent may advise on exceptions, approve routine requests and only assist with sensitive changes.

3. Operating role and bounded outputs

What does the agent own? Name the verbs: read, retrieve, classify, summarize, route, decide, draft, update, execute and escalate. Then name what it explicitly does not do. This prevents the agent from expanding quietly from “summarizes tickets” to “changes ticket priority” to “reassigns work” to “closes cases.”

Bounded outputs matter because they turn ambition into product control. If the agent drafts, does it draft a response, a policy decision, a database update, or a customer-facing action? If it routes, can it only recommend a queue, or can it move the case? If it approves, what dollar amount, policy class, customer segment, or risk tier defines the boundary?

4. Context, delegated authority and adversarial exposure

What does the agent need to read, under whose authority, and after reading which inputs?

This question combines context, permission and risk because they are coupled in practice. The agent’s inputs, identity, access scope, freshness, expiration and revocation all belong together. Here are the questions to work through:

Which records, messages, files and tools are authoritative?
Which inputs are untrusted?
Does the agent act for a user, role, team, or constrained agent identity?
How long does access last?
What revokes access?
Which tools remain reachable after the agent reads untrusted content?
Which actions are low-risk, reversible, sensitive, irreversible, or externally visible?

An agent that reads email but cannot write carries one set of risks. Give it the ability to send messages, approve invoices, or update customer records, and the risk profile shifts entirely. The Lilli incident belongs here, not as an AI morality tale, but as a product boundary lesson. If the agent’s corpus, prompts, tools and APIs sit behind weak authorization, the workflow’s real persona is broader than the one the product team wrote down.

5. Human control and escalation contract

Where can a human understand, intervene, halt and resolve?

Understanding means the human can see the decision path. Intervention means the human can correct or redirect before harm. Halt means the human or system can stop the agent from continuing. Resolution means the human receives enough information to finish the job instead of starting over.

Common escalation triggers include low confidence, missing context, policy conflict, unusual request, sensitive action, customer impact, repeated failure, or a request outside the agent’s role. The handoff package should include the decision path, the evidence used, the missing information, the reason for escalation, the recommended next step and the deadline or SLA if time matters.

6. Audit, observability and recovery

What records must the product produce so the team can debug, prove and repair the agent’s work?

For simple workflows, this may mean action logs and human-visible history. For complex workflows, it means traces, tool calls, API records, model versions, prompt versions, workflow IDs, evaluation events and cost or latency metrics. If the workflow fails, the team should be able to reconstruct what happened without asking the agent to explain itself after the fact.

Recovery belongs in the same section because audit without repair is theater. Who can reverse the action? What can be rolled back? What must be disclosed? Who owns customer remediation? Who updates the workflow so the same failure does not repeat?

7. Adoption readiness and rollout path

Is the workflow documented, integrated, governed and understood well enough for an agent to operate? Look for the basics: clear use case, clean enough data, working integrations, documented process, trained users, governance model and supervision capacity. If those are missing, the smarter move may be to ship a lower-autonomy agent first. Let the agent recommend, summarize and prepare work before it acts.

A sensible rollout narrows the first version on three axes at once: less autonomy, more reversible work and tighter delegated authority. The goal is to keep autonomy tied to evidence that the workflow can support it, not to slow down agent adoption.

where this fits in the product process

The canvas is the upstream artifact, not the final one. Once you answer it, you can write three downstream specifications with more precision.

The first is the agent identity record: who or what the agent is, what authority it carries, when that authority expires, how it is revoked, and who owns the lifecycle. The second is the workflow blueprint: the steps, decisions, tools, system boundaries, human handoffs and failure states. The third is the control set: the guardrails, logs, alerts, evaluations, prompt-injection protections, approval gates and recovery paths that make the workflow operable.

This matters because the word “persona’’ can mislead teams if it stops at UX language. The canvas should produce an operating contract that engineering, security, operations, compliance, support, and product can all challenge, rather than a profile with a name, tone and persona.

You need answers to six questions before launch. Can you bound the action? Does the workflow give the model enough context to be accurate? Whose authority does the agent carry, how long does it last, and who can revoke it? Does the human get enough information to change the outcome? Do your metrics capture quality and supervision cost, not only volume? Can you reconstruct the decision path after something goes wrong?

A product persona for an AI agent is a specification for a non-human participant in the workflow. It defines the work the agent owns, the authority it carries, the boundaries it respects, the moments where it stops and the evidence it leaves behind. The product environment that allowed the bad call is the thing this work is meant to design out.

AI Agent Governance: The Control Plane for AI Work

estebanf — Sun, 10 May 2026 08:56:50 +0000

The market wants to know how much an AI agent can handle on its own. Enterprises, on the other hand, care about whether they can accept the agent’s actions.

In high-risk enterprise workflows, the most successful systems will not be the ones that act alone. Instead, they will be the ones whose actions a company can approve, review, undo, value, and justify. It is easy to show an AI system taking action. The real challenge is whether the company can manage the results after the fact.

A demo is not the same as a real deployment.

Imagine a support agent dealing with an upset customer. It reads the ticket, checks the account history, decides a refund is deserved, updates Salesforce, sends a confirmation to the customer, and creates a finance record.

In a demo, this seems like progress. A complaint is resolved without any human delay. But for the business, new questions come up. Was the refund in line with company policy? Did this customer’s segment need extra approval? Was the agent allowed to update the CRM? Did finance get the same amount that Salesforce now shows? If not, the company now has a small operational issue and a customer-facing receipt to fix.

A failure doesn’t have to be dramatic to cause problems. Maybe the refund is just over the limit, or the customer needed approval. Now, the CRM and finance system don’t match, and the customer has a confirmation that the company might need to take back.

What you want is a mature execution layer that gives the agent the best chance to catch issues before acting and to make its output cleanup work for the organization. It needs to comprehensively review the context and business rules. If something looks off, it sends the case to a person. If everything is fine, it records the details, including intent, authority, data, approver, systems changed, and how to reverse the action. That’s what separates a clever agent from a truly useful one.

Governance now means three things, though people often use just one word. Policy sets the rules. Observability tracks what happened. Domain-aware execution control decides what the agent can do in each workflow.

The third layer is the most strategic because it’s closest to the actual work. It knows the difference between giving a refund and writing a summary, accepting a contract clause and drafting one, or reconciling a payment and approving an exception.

That’s why autonomy isn’t the right term for most enterprises. What companies really want is delegation.

Delegation means setting clear job limits. It includes permissions, authority boundaries, ways to escalate, a named owner, and consequences if limits are crossed.

A good enterprise agent acts less like an unsupervised worker and more like a responsible colleague with clear limits. It knows which systems it can use, which actions it can take on its own, which need approval, and when to stop if things are unclear. It also keeps a record for others to review later.

The enterprise execution test

The first question shouldn’t just be, “Can it do the task?” That’s for a demo. The real question is whether the system can work within the company’s rules. Before scaling an AI workflow, I’d ask six questions.

Intent: What human direction is the system trying to execute?
Permissions: What systems, data, and tools can it access?
Authority: Which actions can it take alone, and which require approval?
Escalation: When does uncertainty, risk, or sensitivity route the work back to a person?
Rollback: Can the organization reverse or repair the action if the system is wrong?
Auditability: What record proves what happened, why it happened, and who was responsible?

These questions affect how you decide to deploy. A support agent who just summarizes tickets and drafts replies might be ready for wide use. But if it issues refunds, changes account status, updates Salesforce, notifies customers, or triggers finance actions, it needs clear authority limits based on customer segment, region, agent role, and risk level.

A finance agent who matches two systems and flags issues can add value without making decisions that affect clients. But an agent that approves exceptions or releases payments is much riskier. The key question changes from whether the model can find the right answer to whether the company has set clear authority limits.

A coding agent that writes a patch, runs tests, and opens a pull request fits well into the usual review process. But if it merges, deploys, or changes sensitive files, it needs risk-based rules: let low-risk changes go through, require review for anything affecting production, and block sensitive changes unless someone gives clear approval.

This test is simple on purpose. If your team can’t answer these questions, you’re not ready to scale the workflow. You might still be ready to experiment, but that’s a different situation.

This doesn’t mean that having rules replaces the need for a good product. A weak system will still fail. If employees don’t trust a tool, they won’t use it. If a workflow asks for too many changes, people may stick to old habits. But once a company lets AI take important actions, the deployment question changes. Now, it’s not just about whether the tool is useful; it’s about whether the business can approve, oversee, undo, and defend what the tool does.

Cheap execution changes the human job.

AI makes it cheaper to get acceptable work done. This is real, and it will change how many workflows operate. Often, the system doesn’t have to be better than the top expert. It just needs to do good work fast, cheaply, and reliably enough to change the process.

But making work cheaper doesn’t remove the need for human judgment. It just changes where that judgment is needed. The real value now is in designing how work is delegated: deciding what should be done, what counts as good, when the machine should act, and who is responsible for the outcome.

There’s another risk to consider. If AI workflows focus only on efficiency, they might weaken the places where people develop judgment. For example, junior underwriters learn by reviewing regular cases and exceptions. Junior lawyers learn by comparing drafts and getting feedback. Junior engineers learn by debugging and handling small changes.

The solution isn’t to keep people doing low-value tasks just to keep them busy. Instead, AI workflows should make it easier to follow human direction while keeping the judgment needed to oversee the work. For example, a legal AI workflow should include the lawyer’s judgment, with clear permissions, review steps, records of past decisions, and assigned responsibility. In support, escalation should be built into the process, not seen as a failure. The aim is to create better feedback loops, not to remove people from every step.

The companies that scale AI safely

The point isn’t that every company should build its own AI stack. What matters more is that companies using AI must control the way important work gets done.

This discipline doesn’t mean owning the foundation model, user interface, data warehouse, or every record system. It means controlling the workflow, permissions, action logs, feedback loops, and how results are measured for important tasks.

That’s why workflows in regulated or expert-driven fields need extra care. The real advantage is building domain judgment into repeatable processes. Most companies use AI through tools such as Jira, Salesforce, ServiceNow, Microsoft, GitHub, and other platforms. That’s fine. What matters isn’t owning the software, but owning the rules for delegation: who can ask the agent to act, what it can access or change, when a person must approve, what records are kept, and how the company measures improvement.

For companies using AI, the real competitive question isn’t just which tool has the best model. It’s about which workflows can handle machine actions without losing control. The difference between an impressive AI demo and something a business can really use is whether the company can handle the machine’s mistakes.

Know the agent before it moves the money

estebanf — Tue, 05 May 2026 21:46:05 +0000

A bank receives an instruction to move $4.8 million from a corporate treasury account. The request does not come directly from a human user. It comes from an AI agent operating inside the company’s finance stack. The bank has to decide whether the instruction is a valid delegated action, a misconfigured workflow, or a compromised agent.

KYC tells a bank who the customer is. KYB tells it who the business is. Agentic finance now needs a governance layer between identity and action. Know Your Agent, or KYA, links each AI agent to an accountable owner, a permitted purpose, a maximum authority tier, session-level permissions, and a revocation path. It is distinct from IAM because it governs delegated purpose and session authority, not just credential issuance.

The harder problem is not whether the agent can act but whether the bank can prove the agent had authority to act. Banks can scale agentic finance only when they treat agents as accountable delegates, with verified ownership, bounded authority, session permissions, audit lineage, monitoring, and revocation.

The access question is already here

This is no longer a distant scenario. Large banks are already building and testing internal agentic workflows across operations, technology, risk, compliance, customer service, and payments-adjacent work. Some are assigning digital workers logins, managers, and defined tasks. Others are researching agents that can reason across multiple steps or support financial workflows. Those examples should not be stretched into a claim that banks are already letting external customer agents execute high-value payments without gates. They show something more practical: agentic workflows are entering the bank before the governance model is fully settled.

The near-term path will usually be supervised delegation. Agents will retrieve sensitive data, summarize reports, triage inquiries, draft recommendations, prepare workflow packages, and ask humans or policy systems to approve the next step. That is still enough to require a new control question: what level of authority has this agent reached, and what proof does the bank need at that level?

Reading account data is one level. Recommending a payment is another. Preparing a transfer is another. Initiating it is another. Approving, executing, and reversing financial action are higher still. A useful control model has to separate those levels instead of hiding them inside one broad permission.

Banks already know how to identify people, businesses, applications, service accounts, vendors, and non-human credentials. They also know how to manage delegated authority through mandates, corporate resolutions, payment approval flows, segregation of duties, and audit logs. KYA must not pretend those controls are obsolete. The better argument is narrower: existing controls answer important questions, but they do not fully answer the agent-specific runtime question. What did the agent infer as its scope of authority from a broad instruction, and can the bank verify that inference against the intended delegation?

A service account can be over-permissioned, but it usually executes a defined function. A rules engine can move money, but it follows predefined logic. An AI agent can interpret a goal, decide what context matters, call tools, route work, and ask for more authority based on what it just found.

When a customer says “optimize cash across accounts,” the instruction sounds harmless. But what can the agent infer from it? Can it retrieve balances? Recommend a sweep? Prepare a transfer? Open a product? Trigger an FX transaction? Move funds above a threshold if the opportunity is time-sensitive?

The instruction is broad. The actions are specific. The control model must bridge that gap by attaching permission to the session, not just the agent.

Why the timing matters

There is a regulatory and market reason to solve this now. In the United States, supervisors continue to expect banks to use existing risk-management frameworks for AI, while also assessing whether current guidance fits future AI use. Revised model-risk guidance still matters for model development, validation, testing, and monitoring, but it does not by itself resolve agent-specific delegated authority, runtime permissions, tool access, and action accountability. Agentic systems belong in a separate inventory, related to MRM but not automatically covered by it. Model risk teams should validate applicable models and monitor behavior, but they also need IAM, cybersecurity, operations, legal, compliance, and payments teams at the table.

Singapore is moving faster in public finance-sector AI governance. MAS has advanced public AI risk-management work through Project MindForge, including practical material for financial institutions managing traditional AI, generative AI, and emerging agentic AI. That does not create a finished KYA rulebook, but it does show that agentic AI is being named as a specific governance problem.

The standards environment is also moving. OpenID released a whitepaper in October 2025 on identity management for agentic AI, covering authentication, authorization, security, governance, and accountability. Protocol work such as Google’s Agent-to-Agent Protocol (announced April 2025) and Anthropic’s Model Context Protocol is making tool access and agent-to-agent communication easier to standardize. That increases the need for banks to separate interoperability from authorization. These are early signals, not requirements. Banks should monitor protocol maturity, but no vendor framework can substitute for internal control design.

Interoperability is not authorization. A protocol can help agents discover, authenticate, communicate, and call tools. The bank still must decide whether this agent may perform this action for this customer, under this purpose, in this session.

The threat model is broader than a bad model

The dangerous version of agentic finance is an agent that acts through legitimate channels using authority no one fully scoped. One failure mode is inherited authority: an agent acting on behalf of a user can receive more access than the user intended to delegate, or more access than the task requires. That violates least privilege even if every credential technically works.

Another failure mode is rubber-stamp supervision. If the agent prepares the payment package, chooses the counterparty, fills the amount, and routes the instruction, supervision fails when the human sees only the final instruction.

There is also a non-malicious version. The agent was not compromised. It was simply too efficient. Given a “liquidity optimization” goal, it liquidates a long-term hedge to cover a short-term gap, creating tax penalties or policy problems because it lacked a tax-aware or hedge-aware authority constraint. In each case, the issue is permission, purpose, and proof, not intelligence.

Prompt injection makes permission a security boundary

Financial agents will read emails, invoices, contracts, tickets, customer messages, web pages, transaction notes, and internal documents. That content will contain errors, manipulations, and instructions designed to hijack the agent’s behavior. When you design the permission boundary, assume the agent will encounter manipulated content. The model cannot be the only control.

Damage boundary

If the agent can only compare invoices, a malicious invoice can distort a summary or comparison, but it cannot create or submit a payment.

If the agent can prepare but not execute, manipulated content can create a bad draft action, but the draft still has to pass a policy check, human gate, or separate submission step.

If the agent requires fresh authentication for tier changes, a retrieve-token cannot silently become an execute-token. The attack has to defeat the control boundary, not only the model.

This does not eliminate risk. An agent authorized to execute a payment can still be tricked into sending funds to a counterparty that appears legitimate. Banks still need counterparty validation, transaction screening, anomaly detection, confirmation design, delay layers for high-risk actions, and payment-rail controls. Authority tiers do one specific job: they limit blast radius when upstream judgment fails.

The authority ladder

In my experience, the most useful KYA primitive is the authority ladder, not the acronym. Banks must ask which class of action the agent is allowed to take, not whether the agent can access the system.

Tier	Agent can…	Typical control requirement
Observe	Monitor signals or workflow status	Logged access, no sensitive data exposure
Retrieve	Pull approved data or documents	Scoped data access, session expiry
Recommend	Suggest an action	Evidence trail, source display, no state change
Prepare	Draft instructions or assemble a transaction	Policy check, human review before submission
Initiate	Submit an action for approval	Step-up control, separate approval credential
Approve	Authorize a pending action	Segregation of duties, human or policy gate
Execute	Complete the financial action	Strong authentication, limits, audit trail
Reverse	Undo or remediate eligible actions	Elevated approval, incident record

This ladder lets product teams design agent capability without forcing everything into “read-only” or “can act.” It gives risk and compliance teams a way to decide where supervision has to be meaningful.

Initiate and Approve must be separated. Existing payment controls already distinguish creation, confirmation, release, approval, and audit. Agent workflows must preserve that separation through different credentials, a mandatory human gate, or another segregation-of-duties control. If one agent can prepare, initiate, and approve the same transaction, the workflow has automated around the control instead of implementing it.

A maturity model for agent authority

Banks do not have to solve every layer at once. They do need to know where they are.

Level	State
0	Agent has no separate identity. Actions are buried under user or app credentials.
1	Agent has identity and owner, but permissions are static.
2	Permissions map to the Authority Ladder.
3	Authority changes by session context and risk.
4	Actions are monitored, auditable, interruptible, and revocable.

Level 0 is the dangerous default. The bank may see an action in a log, but it cannot tell whether a person, app, workflow, or agent actually drove it. Level 1 is better, but static permission is still a poor fit for agents that change behavior by context. The real operating model starts when authority maps to action tiers and changes by session risk. These tiers add initial friction, but they accelerate deployment by giving risk functions the confidence to move agents out of permanent read-only sandbox mode.

The delegated authority stack

Behind the ladder is a simple stack: every agent needs an owner and purpose, every agent needs a maximum authority tier that changes by context, and every action needs audit and revocation. Take the $4.8 million treasury case. A corporate treasury agent owned by the customer’s treasurer has a declared purpose: daily liquidity optimization and payment preparation. In one session, the bank grants Retrieve, Recommend, and Prepare authority, but not Initiate, Approve, Execute, or Reverse.

The agent retrieves bank balances, approved payment templates, treasury policy, ERP payables, and a cash forecast. It recommends funding a supplier payment and prepares a wire package with beneficiary, amount, purpose code, invoice references, sanctions-screening status, liquidity impact, and policy rationale.

Then the boundary matters. The bank’s authorization layer checks the agent ID, owner, business entity, session tier, mandate, amount, beneficiary, time window, data provenance, and whether the action crosses from Prepare into Initiate. If the wire exceeds the approved threshold, the agent cannot submit it. A human treasury initiator reviews the package and submits it through the corporate portal. A different approver, using separate credentials, approves release.

The audit trail links the corporate entity, human owner, agent identity, session ID, input data, recommendation, prepared transaction, blocked initiation attempt, human initiator, human approver, final payment reference, and any revocation event. If the agent reads a malicious invoice or starts behaving outside its purpose, the bank can revoke the agent’s session authority without disabling the customer’s human users.

Multi-agent workflows make the authority chain easier to lose. A treasury agent may ask a document-review agent to inspect a contract, then use that result to prepare a payment instruction. Authority must not silently cascade. The Authority Ladder applies per agent, not per workflow: each sub-agent needs its own owner, purpose, tier, audit trail, and revocation path, or it must operate strictly inside the initiating agent’s bounded session authority.

The readiness test

Before allowing an agent to enter a financial workflow, a bank must be able to answer five questions:

Who owns this agent?
What purpose is it allowed to serve?
What is the highest authority tier it can reach?
What context or risk signal changes that authority?
How can the bank audit, interrupt, or revoke it immediately?

If those answers are vague, the agent may still be useful, but it must stay in Observe, Retrieve, Recommend, or Prepare modes until the control environment improves.

Banks can start this week with an inventory. List every AI workflow that can access sensitive data or trigger workflow steps. Assign an owner, classify the maximum authority tier, map connected tools, define prohibited actions, require step-up approval for tier changes, and test prompt-injection scenarios. Only then expand from recommendation to preparation or initiation.

The gateway that matters

Agentic finance will be tempting because it promises faster financial work: faster cash movement, faster fraud investigation, faster reconciliation, faster customer service. But finance does not run on speed alone. It runs on authorized action.

You cannot answer whether an agent can move money until you know its owner and purpose. Speed questions presume authority tiers are enforced, not just listed. The word “supervised” is hollow if the human cannot see the context and decision path before approving. Existing governance only protects you if it can stop the agent at the moment its authority changes.

The practical posture is supervised delegation first. Let agents gather context, draft work, recommend actions, and prepare workflows. Let humans, policy engines, and separate credentials retain approval for consequential financial actions until the bank can prove that agent authority is bounded, contextual, auditable, monitored, and revocable.

The goal of KYA is to make delegated action provable. Until the bank can identify the agent, bind it to an owner, limit its purpose, tier its authority, and revoke its session, the agent must stay on the recommend side of the authority line.

The AI product is the workflow

estebanf — Fri, 06 Mar 2026 04:44:55 +0000

When you walk into most enterprise AI conversations today, the discussion starts with the model. Is it smart enough? Is the demo impressive enough? Does the feature set cover enough ground? But talk to buyers in operational, customer-facing, or regulated work and the questions shift. What job does this system do? What is it allowed to change? Who reviews the output? What happens when it is wrong? What business metric improves?

Those questions point to something specific. The unit of adoption is the governed workflow, not the model. A productized AI offering is a bounded operating loop with a trigger, an input, an action boundary, a system of record, a human-control point, an exception path, an audit trail, and a measurable outcome. That gap, between a model that runs in a demo and a workflow that runs with controls, is the difference between a pilot and an offer.

Pick the job first, then the AI

When you look at a supply-chain workflow that starts with a driver check-call, the job is clear enough that you can define the AI’s role without much debate. A late arrival triggers a check. If the delay is real, the system can update the shipment record, reschedule a dock appointment, notify the right team, or escalate to a human. The process has a boundary. It has a handoff. It has a point where the system acts and a point where a person remains responsible. That is not a generic “AI for supply chain” story. That is a bounded exception path with rules.

The evidence points in the same direction. Recent research on enterprise AI adoption identifies workflow redesign, human validation rules, KPI tracking, and operating-model integration as the difference between usage and scale. McKinsey’s 2025 survey found broad AI usage but far less scaled adoption. High performers were more likely to redesign workflows and define when humans validate outputs. Stanford’s 2026 enterprise AI research found that in 42% of studied production deployments, the underlying foundation model was fully interchangeable. The advantage sat in the application and orchestration layer.

The same pattern shows up in insurance. Claims intake, first notice of loss, service triage, and document review are credible starting points because they are repetitive, costly, and measurable. They also carry enough structure to support controls. Many complex decisions still route to human approval. That is often what makes the offer acceptable, not a weakness of it.

The best starting point is rarely the largest possible use case. It is the highest-return workflow with the lowest acceptable risk profile. If the work is important but too open-ended, the AI becomes a liability. If the work is safe but trivial, it never earns budget. The sweet spot is where the buyer can see the pain, define the boundary, and accept the control model. That is why “AI for operations” fails as a pitch. It is too wide to govern and too vague to buy.

The model is rarely the real failure

When AI deployments fail, people blame the model, and most of the time that is the wrong diagnosis. An agent can optimize for speed, skip validation, and still create compliance failures, integration failures, customer confusion, or downstream rework. The model may have done exactly what it was instructed to do. The operating assumptions were wrong. The system was optimized for the wrong metric inside an underspecified process. A strong model inside a weak workflow does not create a deployable product. It creates a faster way to make the wrong decision.

The public examples make this plain. McDonald’s ended its IBM AI drive-thru test in 2024 after a narrow, high-volume workflow still proved brittle in the real world. Air Canada was held liable after its chatbot gave a customer incorrect bereavement-fare information, and the tribunal rejected the idea that the chatbot was separate from the company. Deloitte Australia agreed to partially refund a government report after apparent AI-generated errors, including fabricated or unsupported references, entered a high-trust deliverable. Those were control, verification, and accountability problems, not only model problems. The lesson is not that AI should stay in the lab. What it teaches you is that automation without a control envelope is a risk transfer, not an offer.

Governance is part of the product

For customer-facing, regulated, or action-taking AI, governance cannot be added after the pilot works. A governed AI workflow needs four things: authority, observability, intervention, and accountability.

Authority defines what the system can access and what it can do. Observability records what it saw, decided, changed, and escalated. Intervention gives humans clear points to review, approve, override, or stop the system. Accountability names the owner of the result when the system is wrong.

This is product architecture, not a policy slogan. Microsoft’s agent-governance guidance frames agents as systems that access data, take actions, and operate with delegated authority. That means enterprises need to identify agents, assign ownership, limit access, observe behavior, and stop unsafe agents. Runtime authorization matters because the real question is not only whether a model response is safe. The question is whether this specific action should execute now, under this identity, policy, approval state, data boundary, and business context.

The better enterprise workflows already reflect that logic. Morgan Stanley’s AI Debrief is a clean example. A client meeting happens. With consent, the system captures notes, surfaces action items, drafts a follow-up email, lets the advisor edit and send, and saves a note into Salesforce. Morgan Stanley also reported 98% adoption of its earlier AI assistant across financial-advisor teams. The workflow is trusted because the control points are visible: consent, advisor review, discretionary send, and CRM persistence.

Governance also needs one more distinction. Human review is not automatically meaningful review. A human approval gate can solve accountability without solving quality if the human only rubber-stamps the machine. Automation-bias research and health-insurance commentary both show that humans can over-rely on automated recommendations. The control model has to define what kind of review is required: a quick edit, a compliance check, a clinical judgment, an operational override, or a full decision by a licensed person.

“Human in the loop” is too vague. The buyer needs to know which human, at which point, with what authority, and with what evidence.

Who carries the cost of error

Regulation is not a uniform force across all AI markets. The pressure is strongest where the workflow affects consequential decisions, regulated rights, customer harm, or public-facing obligations.

California’s SB 1120 shows what workflow-level regulation looks like in practice. In health-insurance utilization review, AI may inform an adverse benefit determination, but a licensed clinician must review the decision. The EU AI Act applies a similar logic more broadly through risk-tiered obligations tied to specific use cases. Colorado’s AI Act pushes high-risk deployers toward risk management, impact assessment, and human appeal.

This is the procurement reality. In consequential workflows, the buyer will ask who reviewed the output, who can override it, what evidence remains, what happens when the model changes, and who carries the cost of error. That question has moved from theory into contract language, regulation, and litigation. Recent enterprise AI contract patterns include AI-specific addenda, data-training restrictions, model-change notice requirements, AI system registers, and human-review obligations in acceptance criteria. The point is not that every buyer has mastered AI liability. Error ownership has become an active procurement variable, and that matters more than legal sophistication.

Healthcare litigation around AI-assisted coverage decisions shows the risk pattern. A bounded workflow can still create liability if review, escalation, and accountability are weak. Existing laws can assign responsibility even before AI-specific law matures.

For vendors, this changes packaging. A productized AI offer needs to say what it does, where it stops, and who owns the downside. A demo cannot answer that. A workflow design can.

Offer-ready is more than workflow-specific

A narrow AI idea is not automatically a product. McDonald’s proved that a bounded, measurable, and high-volume workflow can still fail when the real-world environment is noisy and the edge cases are ugly. Klarna showed a different version of the same lesson. Its customer-service AI initially looked like a flagship narrow-workflow success, handling two-thirds of customer-service chats in its first month. Later reporting showed that the company had to soften the replacement narrative because some customers preferred humans and complex issues still needed human agents.

These workflows were not too broad. The real issue was that workflow specificity alone did not settle the operating model. A narrow AI idea becomes offer-ready only when the vendor can define the job, the inputs, the systems touched, the action boundary, the exception path, the human-control point, the audit record, the KPI, and the failure owner. Use this test before you call an AI idea an offer.

The minimum viable AI offering canvas

What exact workflow, decision, or handoff improves?
Who uses it, funds it, and owns the result?
What does the AI do: recommend, draft, classify, trigger, execute, or coordinate?
What systems does it read from, write to, or change?
Where must a human review, approve, edit, override, or intervene, and what makes that review substantive?
What controls are required: permissions, logs, audit trail, escalation, and revocation?
What baseline, target metric, and time to impact define success?
What is the cost of error, and who carries it?

The canvas is not asking whether the AI is impressive. It asks whether the offer is commercially legible, governable, measurable, and priced to absorb the risk it creates. If you cannot answer those questions cleanly, you do not have a product. You have a promising capability looking for a job.

The market rewards bounded operating loops

The strongest examples are products wrapped around a bounded job. C.H. Robinson’s missed LTL pickup workflow is the cleanest case. A pickup is missed. The system checks the situation, decides the next step, calls the carrier, and pushes the freight back into motion. The value is an exception-resolution loop tied to visible operating metrics, not a claim about broad intelligence. C.H. Robinson reported that 95% of checks were automated, more than 350 hours of manual work were saved per day, freight moved up to a day faster, and unnecessary return trips fell by 42%.

The reason this example works is not that logistics is uniquely suited to AI. It works because logistics exceptions already have a shape: detect the exception, contact the responsible party, decide the next action, update the system of record, and escalate when the path breaks. AI fits because the operating loop already exists.

FourKites and project44 frame the same idea from another angle. Their products focus on carrier follow-up, ETA-triggered appointment changes, delayed-shipment rescheduling, document collection, freight-audit exceptions, and carrier onboarding. The value sits after the exception fires, when the system has to coordinate across tools, rules, and people.

Claims intake and first notice of loss show the same pattern. They are constrained, repetitive, and costly. They have enough volume to justify automation and enough structure to support controls. The buyer can imagine the job before the pitch is over.

The product becomes legible when the buyer can answer four questions in less than a minute:

What job does it do?
Where does it stop?
When does a human step in?
What outcome improves?

If the vendor cannot answer those questions, the offer is not ready.

Horizontal platforms still matter, but value is vertical

The argument is not that buyers only buy vertical AI. That would be too simple.

Horizontal platforms still attract major enterprise spend. Some companies want broad access first so employees can experiment, standardize identity controls, connect internal data, and build use cases on top. JPMorganChase’s LLM Suite is a good counterexample to any claim that buyers only buy one workflow at a time. Large enterprises with strong security, model-risk, and engineering capacity can buy or build a governed platform first, then attach it to workflows later.

But platform access is not the same as scaled operating value. The commercial unit is still the bounded job, even when the technical substrate is horizontal. Enterprises may sanction AI horizontally, but they scale it vertically. They may buy a platform for access, experimentation, and shared infrastructure. They count value when that platform changes a bounded process with an owner, a control model, and a metric.

Even the commercially strong platforms tend to win inside work surfaces. Salesforce works through CRM processes. Microsoft Copilot works best in roles where the work, data, and collaboration layer already live inside the Microsoft stack. Those are platform businesses, but their adoption still becomes concrete through jobs: qualify a lead, summarize a meeting, draft a response, prepare an analysis, resolve a case, update a record.

Broad platforms create access. Governed workflows create accountable deployment. Both matter, but only one carries the commercial outcome.

Start with the work, not the model

If you are evaluating an AI idea, do not start with the model. Start with the work.

Ask whether the job is narrow enough to govern, valuable enough to fund, and stable enough to support a clear human-control model. Ask whether the data exists, whether the exception categories are understood, and whether the buyer already tracks the metric that should improve. Ask who owns the result when the system is right, and who carries the cost when it is wrong.

If those answers are clear, the idea may be offer-ready. If they are not, the model is not the problem. The offer is not ready.

Most teams keep trying to sell intelligence. The buyer is trying to buy control.

The new AI SaaS lock-in is operational

estebanf — Tue, 20 Jan 2026 15:44:07 +0000

One of the most overlooked questions in an AI SaaS evaluation is not “what features are included.” It is “what happens when we need this vendor to stop?”

That question sounds defensive until you watch a buying team realize what the demo is really showing. A customer-service platform is no longer only summarizing tickets or drafting replies. It is checking order status, updating CRM fields, issuing refunds under a policy threshold, escalating exceptions, and reporting resolution metrics. The feature checklist looks good, the integration plan looks manageable, and the ROI model looks plausible. But the real decision sits underneath the checklist. The buyer is deciding which part of the operation to hand over. Improve. Coordinate. Execute. Quietly become responsible for.

In my experience, AI SaaS buyers need a new buying lens. They should evaluate vendors on two axes at the same time: the operational role the vendor plays and the amount of work the vendor absorbs. The first determines where value appears. The second determines how hard the vendor will be to govern, pause, replace, or exit. AI SaaS does not make old lock-in disappear. In some deployments, it extends lock-in into the operating model. That is workflow absorption.

The risk is the delegation, not the agent

Traditional SaaS buying treated the application as the unit of value. The buyer asked familiar questions: does the product have the features we need? Does it integrate with our stack? What does it cost per seat? Will users adopt it? Can we export our data if we leave? Those questions still matter, but they are incomplete when the vendor starts acting inside the workflow.

AI is pushing SaaS value away from manually used applications and toward workflow-level outcomes. Applications move behind the scenes. The workflow becomes the front door. Agents coordinate work, trigger actions, reconcile data, manage exceptions, and sometimes participate in decisions. A customer-service platform may no longer be just a better ticketing tool if its agents resolve requests across systems. A finance platform may no longer be just an expense dashboard if it reviews transactions, enforces policy, flags fraud, approves low-risk spend, and logs the decision. A healthcare revenue-cycle platform may no longer be just administrative software if it checks claim status, works payer queues, and updates the EHR workflow.

This is why “AI SaaS” is too broad a category to guide buying decisions. A dashboard, a routing layer, a task-performing agent, and an expert-assisted decision system create different kinds of value. They also create different kinds of dependency.

Absorption occurs when a vendor becomes responsible for a repeatable operational function, rather than merely supporting the people who perform it. The threshold is easy to miss. A customer-service AI starts by summarizing cases. The vendor improves visibility. Then it recommends the next-best action, drafts a response, and routes the case to the right queue. The vendor now coordinates work. Then it updates a CRM record, changes an order, issues a refund, or closes the case. The vendor executes work. Then leaders rely on its resolution metrics, exception queues, escalation rules, and performance dashboards to manage the function. The vendor is no longer just a tool inside the workflow. It is part of how the workflow runs.

Public examples point in this direction, with an important caveat: the evidence supports bounded execution more than full autonomy. Salesforce Agentforce materials and customer-service analysis describe agents that can handle order status, refunds, troubleshooting, appointment scheduling, case updates, and profile updates. Intercom’s Fin has moved its pricing language from resolutions toward outcomes, including workflows where the agent gathers context, executes configured procedures, and hands off when policy requires a person. ServiceNow-like workflow platforms increasingly sit above ERP, CRM, and ITSM systems, turning the AI layer into the place where work is assigned and measured. Banking and payments examples now include agents involved in re-underwriting documents, payment dispute resolution, and transaction execution.

Agents have not taken over whole enterprises. They have not. What is changing is that agents are beginning to execute bounded tasks inside live workflows. Once that happens, the buying decision changes. The buyer has moved from software adoption to operational delegation.

The vendor absorption grid

Buyers need a simple way to name what they are delegating. The Vendor Absorption Grid starts with the vendor’s role.

Vendor role	What the vendor does	Main dependency	Buyer’s minimum control
Visibility	Shows what is happening	Reporting dependency	Data definitions, source access, metric lineage
Coordination	Routes and prioritizes work	Orchestration dependency	Workflow documentation, escalation rules, override paths
Execution	Performs tasks in systems	Continuity dependency	Action logs, rollback, failure recovery, service levels
Judgment	Recommends or participates in decisions	Accountability dependency	Validation rules, approval thresholds, liability terms, decision logs

Then the grid asks how much work the vendor absorbs into its own operating model.

Absorption level	Meaning	Buyer question
Assist	Helps humans do the work	What productivity gain does it create?
Orchestrate	Coordinates how work moves	Can we see and change the workflow logic?
Execute	Performs defined tasks	Can we audit, reverse, and recover actions?
Absorb	Becomes responsible for continuity of the function	Could we still operate if the vendor stopped?

The grid is a diagnostic, not a maturity model. Every buyer should not race toward absorption.

A visibility vendor may help leaders see bottlenecks faster. The dependency is usually manageable. A coordination vendor may improve prioritization and handoffs, but it also starts to shape who gets work, what gets escalated, and which cases receive attention. An execution vendor may expand capacity, but it creates continuity risk when the system updates records, triggers payments, closes tickets, or approves work.

The scrutiny changes most when a vendor participates in judgment. If the agent recommends eligibility decisions, refund approvals, dispute handling, risk scoring, claims handling, or pricing changes, the buyer needs to know what evidence a human sees, when a human can intervene, and who is accountable when the outcome is disputed. That does not make absorption wrong. In some workflows, a deeply embedded vendor is the rational choice. The vendor may be more capable, more consistent, and more accountable than a fragmented stack of tools and internal handoffs. But the buyer should make that trade-off explicitly.

What changes from classic lock-in

Enterprise buyers already know lock-in. ERP, CRM, ITSM, cloud, billing, procurement systems, RPA, BPO, and managed-services providers have always shaped processes, created switching costs, and given vendors control. AI SaaS does not invent dependency.

What changes is the compression of software, workflow orchestration, decision support, and managed execution into the same vendor surface. Some vendors are beginning to demonstrate pieces of this capability in bounded workflows: interpreting intent, selecting tools, routing work, updating records, escalating exceptions, and measuring outcomes. That compression matters because the system of record becomes less visible to the user. ERP, CRM, ITSM, billing, and support systems may remain underneath. But the AI workflow layer becomes the place where work is assigned, executed, reviewed, and measured.

Classic SaaS lock-in often centered on data, contracts, user adoption, and configuration. AI-era dependency can also span models, orchestration frameworks, runtime environments, prompts, policies, workflow configurations, agent behavior, action history, escalation patterns, and human operating routines.

Connection is not portability. A protocol may help one agent call another tool. It does not automatically make agent memory portable, preserve workflow configuration, transfer escalation habits, recreate behavioral calibration, or tell a new vendor why the old system routed a sensitive case a certain way.

The practical test is simple, even if the answer is difficult: could another vendor operate the workflow with the exported data, configuration, logs, and process documentation, or could it merely connect to the same systems? For many current AI SaaS deployments, full operational portability will be aspirational. That is still useful. It tells the buyer which dependency is being accepted.

Concentration risk versus fragmentation risk

Many CIOs are not trying to add more composability. They are trying to reduce tool sprawl. That matters because the optionality argument has a serious counterargument: one accountable vendor can be less risky than a scattered stack of tools with fragmented ownership.

Incumbents know this. SAP, Microsoft, Salesforce, ServiceNow, and others pitch unified AI platforms as risk reducers. They are not always wrong. A single vendor can reduce coordination cost, simplify procurement, standardize identity and permissions, unify data models and support, and give the buyer one accountable party when outcomes fall short. But consolidation changes the risk. It does not erase it.

The trade-off is concentration risk versus fragmentation risk. Fragmentation risk asks: can we make all these tools work together, and who owns the outcome when they do not? Concentration risk asks: what happens when one vendor controls the workflow layer, the data context, the agent behavior, the pricing meter, and the roadmap? A buyer can rationally choose concentration. But it should be a conscious choice, not the accidental result of accepting the easiest AI bundle from an incumbent.

Pricing becomes a control surface

Traditional SaaS pricing was often frustrating, but it was legible. Seat counts, modules, contract tiers, and renewal uplifts gave buyers something to model. Agentic SaaS puts pressure on that model. If agents execute work that used to require named users, vendors will price less around access and more around usage, actions, resolutions, outcomes, or premium AI capacity.

That can match incentives. A vendor paid for resolved cases has a reason to improve resolution, just as a vendor paid for recovered revenue has a reason to increase yield. But the same model can create financial risk. If cost scales with agent activity, a successful deployment can create a budget problem. If pricing depends on outcomes, the buyer needs a clean way to define attribution. When the agent acts across several systems, who gets credit for the result? Who carries the cost when it makes an error? Outcome pricing becomes risky when the vendor’s commercial meter scales with the buyer’s operational dependence.

Pricing terms should become part of the governance negotiation, not a separate conversation:

Pricing question	Why it matters
What exactly counts as an action, resolution, or outcome?	The meter must be measurable and disputable.
How does cost scale with automation volume?	More successful automation can mean more variable spend.
Are there caps, bands, or approval thresholds?	Production workflows need budget control.
What triggers renegotiation?	New use cases can change economics fast.
Can the buyer audit the meter?	Outcome pricing fails when measurement is opaque.
Does price protection survive deeper adoption?	Embedded vendors gain renewal power.

Outcome-based pricing can be useful when the vendor truly owns part of the result and the contract makes accountability clear. It becomes risky when the vendor gets variable upside without accepting operational responsibility.

Contracting should match delegation

Procurement teams already know how to negotiate uptime, data export, termination rights, and renewal terms. Those still matter. But they miss the action-level concerns that appear when agents execute work. The contract needs to answer what the vendor can do, under whose authority, in which systems, up to what threshold, with what record, and with what recovery path.

Vendor role	Buying focus	Contract focus
Visibility	Can we trust the operational picture?	Metric definitions, data lineage, source access, audit rights
Coordination	Can we see and override workflow logic?	Routing rules, escalation rights, change notice, configuration export
Execution	Can we audit, reverse, and recover actions?	Action logs, rollback, kill-switch rights, recovery SLAs, failure drills
Judgment	Who remains accountable?	Approval thresholds, decision records, validation rules, liability allocation

Before giving an agent execution rights, the buyer should ask a more specific set of questions:

What actions can the agent take?
In which systems can it act?
Does it act under a user identity, service identity, or vendor identity?
What dollar, customer, legal, or operational thresholds limit its authority?
What actions are prohibited?
Where are action logs stored?
Can the buyer pause the agent immediately?
Can the buyer reverse an action?
What happens if the vendor fails mid-workflow?
Can the team run manually for 30 days?

These are not policy questions in disguise. They are operating requirements. An agent that can issue a refund needs refund thresholds, audit logs, exception rules, reversal paths, and dispute rights. One that approves expenses needs policy logic, fraud flags, approval records, human escalation, and liability allocation. One working a healthcare claim queue needs traceability, payer-rule updates, compliance controls, human review points, and recovery plans when automation fails. The deeper the delegation, the more the contract must move from software access to operational control.

Minimum governance before the vendor acts

Governance is often treated as a brake on AI adoption. In agentic SaaS, governance is what lets the buyer safely move from assistance to operational value.

Not every buyer can run an enterprise-grade AI governance program on day one. Many midmarket teams lack the architecture staff, procurement maturity, and data governance required for a full optionality review. That does not mean they should ignore the problem. It means they should tier it.

At a minimum, viable governance should cover the first set of risks before anything goes live. Most teams can start with these seven checks:

Area	Minimum question
Workflow scope	What exact tasks can the agent perform?
Human control	Where can a person pause, override, or reverse the action?
Escalation	What cases must go to a human?
Audit trail	Can we see what the agent did and why?
Cost control	Can spending exceed plan without approval?
Data export	Can we retrieve operational data and decision history?
Failure mode	What happens if the vendor is down?

The buyer does not need to solve every future problem before signing. But the buyer does need to know which problems are being deferred. Use the grid before the contract is signed, not after implementation.

Start by placing the vendor in one cell based on what it will do on day one. Then place it in the cell the roadmap points toward. The gap between those two cells is where dependency grows unnoticed. A vendor may begin by orchestrating (improving prioritization and routing cases across teams) but the roadmap may point toward executing (updating records, issuing refunds, resolving cases, and reporting outcomes). That gap should trigger a different review. The buyer should negotiate for tomorrow’s operating role, not only today’s feature set.

The four questions for the next demo

The answer is not to avoid AI SaaS that absorbs work. Many buyers should use it. The value can be real: faster resolution, fewer manual updates, cleaner handoffs, better exception handling, and one accountable vendor.

The answer is to name the delegation before it becomes invisible. For the next AI SaaS demo, the test is not only “can the agent do the work?” It is these four questions:

Show us the kill switch.
Show us the action log.
Show us the manual fallback plan.
Show us the export path for workflow logic, not just data.

Until the vendor can answer those questions, do not let the agent execute. The test for AI SaaS buyers is simple: if the vendor stopped tomorrow, could the organization still perform the workflow, explain past decisions, control costs, and migrate the operating logic? If the answer is no, the buyer may still choose the vendor. But it should price, govern, and contract for that dependency before the vendor becomes the workflow.

The AI moat is moving to the last mile

estebanf — Thu, 08 Jan 2026 14:43:22 +0000

As frontier models converge for a growing class of work, raw AI capability explains less of the difference between products. That is a narrower claim than it sounds. Model quality still matters. Accuracy matters. Latency matters. Long-context reasoning, code generation, multimodal performance, tool use, and safety behavior can decide whether a product works at all. But for many teams, the strategic question is no longer, “Can we get access to capable AI?” It is, “Can we turn that capability into a workflow users care about, trust, repeat, and eventually cannot replace?”

That is where product sense stops being a soft skill. In the AI era, product sense becomes product judgment: the operational judgment that decides where intelligence belongs, what context it needs, which data should enter the system, when humans stay accountable, and what durable asset the product compounds over time. AI strategy is expanding beyond model access into last-mile judgment, not moving away from models.

The feature that does not change the work

Picture a product team with access to the same frontier models as its competitors. The team ships an AI assistant into an existing B2B product. It can summarize records, answer questions, draft emails, and generate reports. The demo looks good. Users try it once. Then they go back to the old workflow.

The model did not fail. The product never changed the work. It sat beside the workflow instead of inside it. It did not know the user’s permissions, the current customer state, the exception history, the source-of-truth data, or the moment when a decision had to be made. It created another surface to check. This is a common failure mode when AI features are deployed without redesigning the work: impressive capability, weak product judgment.

A better team asks a different question. Not “Where can we add AI?” but “What recurring workflow is expensive, frequent, painful, or strategically important enough that intelligence should live there?” This shift matters because AI now touches nearly every part of the product lifecycle: discovery, design, prototyping, coding, testing, analytics, feedback, competitive analysis, sales, marketing, and operations. Open-weight models are improving. Enterprises increasingly use multiple models by use case. In a 2025 a16z CIO survey, 37% of respondents reported using five or more models in production. That statistic does not prove models are interchangeable. It does show that enterprise AI stacks are becoming multi-model, routed, and use-case specific. In that world, the ability to orchestrate models around proprietary context, workflow constraints, and measurable outcomes becomes a distinct source of value. The model becomes a component. The moat moves to the system around it.

Product sense is not taste

People often reduce product sense to taste: good UX instinct, customer empathy, prioritization, and a feel for what should be built. Those still matter. But AI stretches product sense into something more operational.

In an AI product, product judgment means deciding which workflows deserve intelligence, which context improves or degrades the answer, when the AI should act or stay silent, where human verification must remain, and which part of the value chain the company must own. This is why context has become a product decision, not just an engineering concern. Anthropic’s work on context engineering frames the problem clearly: agents perform better when systems curate what enters memory, what gets retrieved, what gets summarized, and what stays out. That is product judgment.

More context is not automatically better. A financial AI assistant that remembers every conversation may feel powerful until it uses stale preferences, exposes sensitive information, or treats a passing comment as a permanent instruction. The product has to decide what should be remembered, forgotten, hidden, surfaced, and audited. The same logic shows up in Notion’s AI strategy. A generic chatbot can answer questions. An agent with workspace context can help because it understands the documents, projects, owners, permissions, and history inside the user’s real environment. The value comes from connected context, not from the chat box.

Claude in Excel is another useful example. The opportunity is not “AI spreadsheet chat.” The opportunity is intelligence inside a painful, high-frequency workflow. A user is already modeling, reconciling, checking formulas, cleaning data, and preparing analysis in Excel. Asking that user to move into a separate AI interface adds friction. Meeting the user inside the spreadsheet changes the work. This is the product-judgment move: find the work as it really happens, then decide where intelligence reduces effort or improves judgment.

Data alone is not the moat

The easy counterargument is that product judgment matters less than proprietary data. Sometimes that is true. Incumbents with decades of enterprise data, distribution, compliance trust, and system-of-record status can have advantages a sharper startup cannot overcome by taste alone. In commerce, Amazon and Google have behavioral and advertising data that can shape agentic shopping experiences. In vertical AI, domain-specific interaction data and evals can make a smaller specialized system outperform a general one.

But “we have data” is not a strategy. Data creates advantage only when it is relevant, fresh, governed, accessible at the right moment, and tied to a feedback loop that improves the product. Raw data ownership does not help if the product cannot decide which data matters, how to retrieve it, how to protect it, and how to convert it into a better outcome. That is why the strongest AI products combine product judgment with structural assets.

A shopping agent does not win because it can generate a pleasant recommendation. It wins if it understands intent better than a browsing interface: price range, fit, size, availability, support, delivery timing, brand preference, return risk, and the shopper’s past behavior. Retailers can use clickstream data, carts, purchases, ignored results, and returns to understand which products matter in which contexts. The insight is not “add AI to shopping.” It is that shopping is full of hidden constraints. The product has to read them.

The same principle applies in enterprise software. Products that embed into core workflows, connect to proprietary data, and build workflow-native integrations create advantages that are harder to copy than a polished assistant. Products that generate valuable usage data can improve their evals, tune workflows, reduce errors, and create better automation loops. Taste without asset creation is fragile. A good AI wrapper can become a durable company, but only if it compounds into something model providers or incumbents cannot easily absorb: trusted execution, workflow position, proprietary data, distribution, domain-specific evals, implementation knowledge, or an agent-readable system.

The new strategic question: what must we own?

A company building with AI now faces a more precise build-versus-buy decision. It can buy models, infrastructure, and open-weight systems. It can route tasks across vendors, expose tools through MCP servers, build custom agents, wire proprietary data into an existing model, or hire forward-deployed teams to turn customer needs into working systems. The strategic question is not “Which AI should we use?” It is “Which part of the value chain must we own for this product to become defensible?”

For some products, the answer is the data layer. An agent-readable data layer lets AI clients read, query, reason about, write to, and connect data without scraping a human UI. That can become a structural advantage as agents become a normal way users interact with software. For others, the answer is workflow integration. If models are good enough for the task, the company that owns the workflow, permissions, history, and user trust may win. For regulated industries, the answer often includes governance. Healthcare AI adoption can fail when a system disrupts clinical workflow, adds documentation burden, or creates unclear accountability. In finance, legal, healthcare, and insurance, trust, auditability, retention, privacy, and policy enforcement are product features. They decide whether the product can be deployed.

For enterprise AI, the answer may be implementation. The spread of forward-deployed models shows a practical truth: many customers do not have an AI technology problem. They have a business implementation problem. They need the system to fit their workflows, data locality, latency requirements, compliance rules, approval paths, and outcome targets.

This also cuts the other way. Incumbents already own many workflows. Model providers are moving into memory, retrieval, evals, tool-use frameworks, and enterprise services. If they own more of the surrounding system, the last mile does not automatically belong to startups or application companies. Last-mile judgment is a test of where advantage can actually compound, not a slogan for the application layer. The value comes from how models, tools, data, agents, human judgment, and business processes work together inside the customer’s actual environment.

Product judgment also creates speed

There is another reason product judgment matters more as AI gets faster: bad paths get cheaper too. AI lets teams generate prototypes, write code, test variants, analyze feedback, and produce collateral faster. That speed helps only if the team knows what to ignore.

Strong product judgment compresses decision time. It helps a team eliminate the chatbot bolted onto the side of the product, reject the custom model when existing models are sufficient, avoid collecting context that creates privacy risk, and choose a narrow workflow agent with clear escalation rules instead of a general agent with vague authority. Product judgment does not compete with speed. It makes speed useful. Without judgment, AI accelerates activity. With judgment, it accelerates learning.

The last-mile AI strategy test

Before investing in an AI product bet, ask five questions. These questions separate an AI feature from a potential product advantage.

1. Workflow

What recurring workflow are we changing? The workflow should be frequent, expensive, painful, risky, or strategically important enough to matter. If the AI does not change how work gets done, it will compete with every other assistant for attention.

Weak answer: “Users can ask questions about their data.” Strong answer: “Analysts spend three hours every Monday reconciling forecast changes across spreadsheets, CRM exports, and finance assumptions. The product identifies discrepancies, explains variance, and prepares the review packet inside the spreadsheet workflow.”

2. Context

What does the AI know here that a generic model would not know? This can include user history, company data, files, permissions, behavior, domain rules, transaction state, workflow timing, or live system data. But context needs boundaries. The product must decide what to include, persist, retrieve, summarize, hide, or forget.

Weak answer: “We connect the model to all company data.” Strong answer: “For this renewal workflow, the AI sees the contract terms, support history, usage trend, approval policy, and current account owner. It does not see unrelated employee records, stale notes, or private conversations that cannot improve the decision.” A generic assistant has capability. A product with the right context has relevance.

3. Judgment

Where do humans still need taste, accountability, verification, or override authority? This question matters most in high-stakes workflows. AI can expand analysis, draft options, find patterns, and recommend actions. But accountability stays with people and institutions.

Weak answer: “The system decides automatically unless the user stops it.” Strong answer: “The system drafts the recommendation, cites the inputs, flags uncertainty, and routes exceptions to the accountable owner before anything changes in the system of record.” A decision product needs guardrails, escalation paths, audit trails, and clear authority. It should know when to act, when to ask, when to explain, and when to stop. If no one can reconstruct why the system made a recommendation, the product is not ready for serious work.

4. Asset

What durable advantage compounds over time? The answer cannot be “better UX” alone. The product needs an asset that gets stronger with use. That asset could be proprietary data, usage loops, domain-specific evals, workflow integration, trust, distribution, implementation knowledge, agent-readable infrastructure, or customer-specific outcome history.

Weak answer: “Our interface is easier to use than the generic assistant.” Strong answer: “Every completed workflow improves our exception library, eval set, approval logic, and customer-specific outcome history, making the next recommendation more accurate and easier to audit.” This is where many AI wrappers fail. They solve a visible pain, but they do not accumulate anything structural. A competitor with better distribution, deeper data, or model-level access can copy the surface.

5. Outcome

What measurable customer result improves? The answer should be concrete: time saved, revenue gained, errors reduced, decisions improved, compliance strengthened, work removed, cycle time shortened, or margin improved.

Weak answer: “The product makes the team more productive.” Strong answer: “The product cuts the weekly reconciliation cycle from three hours to thirty minutes, reduces formula errors, and gives the finance lead an auditable packet before the forecast meeting.” This question protects teams from demo theater. A product can look intelligent and still fail to change the economics of the customer’s work.

A strong AI strategy can explain exactly which workflow changes, why the product has privileged context, where human judgment remains essential, what asset compounds, and how the customer outcome improves.

The gateway question

The companies that win with AI will not be the ones that simply advertise AI. Many of the strongest AI products will disappear into the work. They will automate a painful step, improve a decision, expose the right context, or make a workflow agent-readable.

The gateway question is simple: what would have to become true for this AI feature to be missed if it disappeared? If the answer is “users would lose a convenient shortcut,” the product may still be useful, but convenience is easy to copy. On the other hand, if users would lose a workflow they now trust, a decision process they can audit, a data loop that improves with every use, or an outcome they can measure, the product is closer to a moat.

Before asking which model to use, ask which workflow deserves intelligence. Before adding more context, ask which context improves the outcome and which context creates risk. Before automating a decision, ask where human judgment, accountability, and override authority must remain. Before claiming a data moat, ask whether the data is fresh, governed, relevant, and used in a feedback loop. AI makes building easier. That does not make strategy easier. It moves strategy closer to the user, closer to the workflow, and closer to the judgment calls that decide what should exist in the first place.

AI products earn autonomy one workflow at a time

estebanf — Thu, 18 Dec 2025 22:02:57 +0000

A demo only has to impress once. An AI product has to work every day. The first version looks magical in a conference room: it answers the clean prompt, completes the happy path, and suggests that broader autonomy is one launch away. Then real users arrive. They ask incomplete questions, use old terminology, need exceptions, and care less about whether the system sounds intelligent than whether it handles the job correctly.

Autonomous agents are not a bad idea. The real lesson is that autonomy is the outcome of a workflow that has proved it can be trusted, not a launch setting. Most useful AI products begin by defining what the system should do, what it should not do, what good enough means, what evidence proves readiness, and when a human or another system must take over. The market sells AI as self-directed intelligence. Users trust it when it behaves like a well-designed workflow.

Here is the short version: before an AI product earns more autonomy, it should be able to answer five questions. What recurring user job does this product handle? What does good enough mean for the output? What are the known failure modes? What test or real-user signal proves the workflow is ready? What must be escalated to a human or another system? If the team cannot answer those five, the product is still a demo.

The failure is usually not the model

A familiar pattern has emerged in enterprise AI. Teams run a promising pilot, then struggle to turn it into production value. The cause is rarely one thing: poor data, weak incentives, unclear ownership, cost, change management, and model limits all matter. But one failure mode shows up again and again: the team treats AI capability as the product strategy. A model can generate a response. A product has to know which response matters, how to verify it, where to send it, what to log, what to retry, what to escalate, and what to do when the world does not match the prompt.

Imagine a customer operations team that wants to build an AI agent. The tempting version is a chat box that promises to resolve anything. It looks powerful because it has no visible boundaries. The better version starts smaller: it classifies the request, checks whether required context is missing, drafts a response using the right policy, identifies exceptions, routes uncertain cases to a person, and logs why each step happened. It starts with read-only access, then earns permission to take action after the workflow proves itself. That version looks less magical. It is also closer to a product.

Start with intent, not intelligence

The first question is not “how autonomous can this be?” The first question is “what work is this product responsible for?”

A useful AI product begins with intent infrastructure. That means the team defines the recurring user job, the required inputs, the desired output, the standard for good enough, and the line between what the system can decide and what it must escalate. This sounds basic because the perceived intelligence of the model makes teams skip basic work. When a system can answer almost anything, it is easy to mistake fluency for understanding the job.

A customer support agent cannot simply handle refunds. It needs to know which refunds are eligible, which markets have different rules, which requests require identity verification, which cases require empathy rather than speed, and which exceptions belong with a human agent. A financial analysis assistant cannot simply summarize risk. It needs definitions, thresholds, source rules, and a way to separate fact from inference. The more consequential the workflow, the more explicit the intent must become.

This is why strong AI products often start with narrow, repeated work. A personal assistant that analyzes a daily spreadsheet inside a clearly defined role is more product-shaped than a general assistant that promises to help with business. A Claude Code skill that walks a user through specific questions and reviews marketing advantages for specificity, maturity, and strength is more reliable than a vague prompt to “make this strategy better.” The narrowness is where the team learns what reliability costs, not weakness.

Reliability comes from structure

AI products are different from traditional software in one uncomfortable way: they can always produce something. That makes “done” harder to define. A conventional workflow often fails visibly: a button breaks, a field rejects input, a transaction does not complete. An AI system can fail more quietly. It can produce an answer that sounds plausible, misses a policy exception, cites the wrong context, or handles a rare case with misplaced confidence.

That is why workflow structure matters. Reliable AI systems use bounded steps, structured outputs, limited permissions, tests, evals, checkers, and fallback paths. They define what the system is allowed to do before it does it. They do not treat the prompt as the only control layer.

Anthropic’s guidance on building effective agents argues for starting with simple systems, evaluating them, and adding more agentic complexity only when simpler patterns fall short. Berkeley AI Research describes a related idea as compound AI systems, where the product’s performance depends on the surrounding system of models, tools, retrieval, orchestration, and controls. In other words, the harness is part of the product.

The model matters. But two products using the same model can perform very differently because one captures the right context, structures the work, tests the output, controls execution, and escalates uncertainty. The other sends a broad prompt into a broad interface and hopes the answer is good. Hope is not a product control.

The metrics can lie

Throughput is not trust. An AI support system can handle more tickets, reduce operational load, and still damage the experience in the cases that matter most. Volume tells you the system did something. It does not tell you whether the customer got the right outcome, whether the tone was appropriate, whether the edge case was caught, or whether repeat contacts rose after the interaction.

Before scaling, a workflow-first product asks: which cases are simple enough for automation? Which cases require human judgment? What signal tells us that quality is deteriorating? What repeat-contact rate, escalation rate, complaint rate, or reviewer override rate would trigger a rollback? What does good enough mean for customers, not just for operations?

Intercom describes Fin as an AI agent, but its public product framing emphasizes support channels, handoff paths, reporting, and review loops. That does not make it less agentic. It makes the autonomy legible. The useful distinction is between an unbounded agent promise and workflow-shaped autonomy, not between workflow and agent.

The product is the harness

A good AI product does not only answer. It manages the conditions under which an answer becomes useful. The harness decides what context reaches the model, what tools the model can use, the output structure, whether generated text becomes a draft, a recommendation, an action, or an escalation. It records what happened. It gives humans a way to correct the system.

This is why coding agents work best when wrapped in structured task lists, one feature per session, repository principles, tests, browser checks, and documentation. The agent appears autonomous, but the product experience depends on the scaffolding around it. The same pattern applies outside software.

A regulated AI product should define its domain of operation before implementation starts. It should decompose autonomous functions into bounded components and map each component to safety tests, inspections, documentation, and operating rules. In high-risk contexts, this is no longer only good design practice. Regulations such as the EU AI Act increase the pressure for logging, transparency, documentation, and traceability. That pressure points to the same product truth: if the system cannot explain its operating boundary, record its behavior, and show how users should interpret its output, it is not ready for serious autonomy.

The real counterargument is speed

The strongest objection to workflow-first AI is that workflows sound slow, not that they are wrong. That objection deserves respect. Some products can tolerate broader autonomy earlier: consumer tools, coding assistants, internal prototypes, and AI-native products often operate in environments where errors are reversible, users expect imperfection, and expert users can inspect the output.

A startup may also need to learn in public. A rival may ship a rough agent, collect feedback, and improve faster than a team that waits for a perfect harness. In low-stakes categories, that trade-off can be rational. The mistake is turning that exception into the default rule.

The better response to leadership is not “we cannot ship until every control is perfect.” It is “we can ship the amount of autonomy that matches the risk. If the error is reversible and visible, we can move faster. If the error is costly, regulated, or hard for users to detect, we need stronger evidence first.” That sentence changes the conversation. It does not frame workflow discipline as caution. It frames it as the mechanism that lets the team decide how fast it can responsibly move.

For enterprise, regulated, trust-critical, or high-consequence workflows, “move fast” does not remove the need for controls. It changes how quickly the team must build them. The practical rule is risk-calibrated: low-stakes products can expand autonomy faster. High-stakes products need stronger proof before autonomy expands.

The autonomy readiness test

Before asking whether an AI product can be more autonomous, ask whether it is workflow-ready. Use this test when evaluating a product idea, reviewing a roadmap, buying an AI tool, or deciding whether an agent should move from pilot to production.

The first five questions are the minimum: what recurring user job does this product handle, what does good enough mean for the output, what are the known failure modes, what test or real-user signal proves the workflow is ready, and what must be escalated to a human or another system.

Then add the operating questions that determine whether autonomy can safely expand. What inputs, context, systems, APIs, and business rules does the workflow need. What can the AI decide on its own. What permissions should the system start with. What behavior will be observed after launch. What signal would cause the team to restrict scope or pause automation. What evidence would justify expanding autonomy. What is the per-unit cost of the workflow, and when does it become economically viable. How will users know what the system can and cannot decide.

The list is product design for nondeterministic systems, not bureaucracy. A deterministic product needs tests because software breaks. An AI product needs tests, evals, monitoring, and escalation because it can appear to work while being wrong in ways users cannot easily detect.

What evidence-driven autonomy looks like

Autonomy should expand when the evidence supports it. That evidence can take several forms: eval pass rates above a defined threshold, real-user error rates below an acceptable ceiling, production interactions without critical failures, human reviewer override rates falling below a target, successful performance across messy edge cases, or a support team seeing fewer repeat contacts without worse customer sentiment.

The important part is that the evidence belongs to the workflow. General benchmark improvement does not prove that your customer support agent can handle billing exceptions. A model release note does not prove that your medical summarization tool can manage domain-specific terminology. A successful demo does not prove that your drive-through ordering system can handle noise, accents, jokes, interruptions, and malicious inputs.

That is the difference between a demo eval and a product eval. A demo eval asks “can the system do this once?” A product eval asks “can the system do this repeatedly, under real conditions, with visible failure handling, at an acceptable cost, without damaging trust?”

How to use the test in a product review

Picture a product manager reviewing a proposed AI feature. The team wants to launch an agent that handles customer operations. The demo is strong: it reads a ticket, checks policy, drafts a response, and closes the case.

The product manager should not start by asking whether the agent feels impressive. The review should start with the workflow. What categories of tickets can it handle? What policy sources does it use? What does it do when information is missing? What counts as a correct resolution? What cases must it route to a person? What metric would reveal that customers are unhappy even if ticket volume improves? What permission does it need on day one? What permission should wait?

Those questions turn the product from a claim into an operating system. The answer might still be ambitious. The team might decide that the AI can classify every ticket, draft responses for half of them, automatically resolve a narrow subset, and escalate anything involving account closure, legal terms, billing disputes, or emotional distress. That is autonomy with a map, not anti-autonomy.

The buyer’s version of the test

The same logic applies when buying AI products. “Agentic” has become a sales word. A buyer should ask whether the vendor’s autonomy is visible enough to evaluate. Is this actually an autonomous agent, or a bounded workflow with a conversational interface. What systems does it connect to. What permissions does it need. What does it log. What does it refuse to do. How does it escalate. What evals prove it works in our environment. What happens when our data, users, policies, or edge cases differ from the vendor’s demo.

These questions do not slow procurement. They reduce the chance of buying a proof of concept that never becomes a product. For enterprise buyers, workflow-first positioning is a strength. A narrow, observable, eval-backed deployment is easier for risk, legal, security, and operations teams to approve than a broad autonomy pitch with unclear controls. A simple rule helps: until a vendor can name what the system refuses to do, the product is probably not ready to earn more permission.

Gateways to autonomy

The path to better AI products does not run through less ambition. It runs through better gates. Before you ask whether the AI can act, ask whether the workflow defines the action. Before you scale the agent, ask whether you know what good enough means. Before you grant write access, ask whether read-only performance has earned it. Before you trust the output, ask whether the eval reflects real users, not clean demos. Before you celebrate automation, ask whether quality metrics agree with volume metrics. Before you expand autonomy, ask what evidence would also force you to contract it. Before you buy the agent, ask whether the harness is visible.

A magic trick rewards surprise. A product rewards repeatability. The best AI products do not become useful because they look autonomous on day one. They become useful because the team defines the work, builds the checks, observes the behavior, and expands autonomy only when the workflow proves it can carry the weight.

The PM thinking stack

estebanf — Fri, 05 Dec 2025 18:41:05 +0000

AI makes it easier to produce plausible work. That makes it more important for PMs to know which work should survive.

A PM can now generate a research synthesis, mock a prototype, draft a PRD, summarize support tickets, ask an agent to update a system, and compare five product directions before lunch. The visible output has changed. The harder question has not: is any of it right?

That is the real shift in AI-era product management. The new PM toolkit is the judgment system around those tools: problem fit, context, constraints, quality criteria, human validation, decision rights, and production learning, not simply ChatGPT, RAG, agents, automation platforms, eval tools, and prototype generators. AI does not remove the need for product judgment. In many product workflows, it raises the cost of weak judgment because plausible output can reach decisions, users, or systems before anyone has examined whether it deserves to.

The demo is not the product

Picture a product review. An agent retrieves customer context, drafts a plan, updates a ticket, writes a polished customer response, and produces a tidy summary. The room reacts to the tool. It looks fast. It looks competent. It looks like the future.

The experienced PM asks different questions. What user task is this solving? What data did it use? What was it allowed to change? What counted as failure? What happens when the source data is stale? Who reviews the edge cases? How will we know when quality gets worse? These questions are not skepticism for its own sake. They are the product work.

A magical demo can hide the hardest parts of the system. The happy path is often cheap. The last 20 percent is where the product lives: permissions, recovery, ambiguity, latency, trust, compliance, user reliance, and the cost of being wrong.

McDonald’s ended its IBM AI drive-thru test after mixed results and complaints about misunderstood orders. The problem was not that voice AI could never work. The issue was that the real workflow included noise, accents, interruptions, substitutions, impatient customers, and low tolerance for wrong orders. NYC’s MyCity chatbot created a different failure mode. It gave official-looking business advice while warning users that answers could be incorrect. In a context where users treat the interface as authoritative, a disclaimer is not a product control. The product needs grounding, legal review, escalation, and clear limits on what the system can say.

Both examples point to the same lesson: AI tools are not self-validating. A system can sound fluent, complete the requested action, and still be wrong for the user, the workflow, or the risk level. Your job as a PM is to spot that gap before it reaches users.

Start with the problem pattern, not the tool

The most common AI mistake is starting with the capability. Could we use an agent here? Could we add a chatbot? Could RAG solve this? Could we automate the workflow?

Better PMs start one layer earlier. What pattern are we trying to improve? Is the task repetitive, judgment-heavy, high-volume, ambiguous, regulated, emotional, time-sensitive, or error-intolerant? Does the user need a recommendation, a draft, a search result, a workflow action, a decision aid, or a repeatable answer with almost no variance? Is AI better than a rule, a form, a dashboard, a saved search, or a simpler automation?

That framing matters because AI is a set of capabilities with different failure modes, not one tool. An LLM can draft and reason over messy language. Retrieval can ground answers in selected sources. Automation can move work between systems. An agent can pursue a goal using tools. A prototype can create shared understanding before a team invests in buildout. None of those capabilities is automatically useful.

A PM’s job is to match the problem to the right kind of system. Sometimes that means using AI. Sometimes it means using AI only as an assistant, and sometimes it means not using AI at all. This is especially true when the workflow needs strict compliance, auditability, low latency, or near-zero error tolerance. In those cases, the best AI decision may be a no.

Judgment has to become testable

Traditional product work allowed a lot of quality judgment to stay implicit. A PM could say the experience should feel clear, trustworthy, or helpful. A designer could interpret that into flows. Engineers could implement it. Customer feedback could tell the team whether it worked. AI products make that looseness expensive.

When output is probabilistic, “good enough” has to become explicit. The team needs examples, rubrics, known-good answers, failure cases, test datasets, human review rules, production sampling, and thresholds for escalation or rollback. OpenAI’s evaluation guidance makes this point directly: generative AI varies, so teams need objectives, datasets, metrics, and iteration. Traditional tests are not enough when the same input can produce different outputs and quality depends on user context.

In practice, this means a PM cannot stop at “the assistant should be helpful.” For a support assistant, helpful might mean: answers only from the approved knowledge base, cites the policy it used, says when it does not know, routes billing disputes to a human, and never invents legal or refund language. Bad output is just as important to define: fabricated policies, confident answers without sources, tone that hides uncertainty, or recommendations that skip required approval. That is the move from taste to operating criteria.

Evals are a team sport, and PMs do not own every eval. Engineers, data scientists, researchers, designers, legal, security, support, and domain experts all own parts of the quality system. The PM’s job is to anchor the “why,” “for whom,” and “what counts as good.” Subject matter experts label edge cases. Engineers operationalize the harness. Researchers test whether users trust and understand the output. Legal and security define the lines the system cannot cross. The PM skill is translating product judgment into criteria the team can act on.

The scarce skill is acceptance, not generation

The old AI productivity story focused on output volume. More drafts. More summaries. More prototypes. More code. More tickets processed. That story is incomplete. As AI increases output volume, teams need sharper acceptance criteria. What should be trusted, revised, escalated, automated, shipped, or rejected? This is where PM judgment shows up.

AI can synthesize a week of customer interviews into a neat set of themes. The PM still has to decide which insight changes the roadmap, which theme is noise, which quote represents a real pattern, and which customer pain is not worth solving now. AI can create a polished prototype. The PM still has to ask what the prototype proves. Does it validate demand, usability, feasibility, stakeholder alignment, or only visual direction? A vibe-coded prototype can impress a room and fail a user test because the fake details hide the real interaction problem. AI can draft a strategy memo. The PM still has to reject sloppy reasoning, missing constraints, false certainty, and recommendations that do not fit the business.

Call it acceptance discipline, not rejection. The point is knowing the standard well enough that yes means something. The useful PM can say: “This output is wrong because it optimizes for the wrong user.” Or: “This prototype is misleading because it skips the failure state.” Or: “This recommendation ignores a compliance constraint.” Or: “This research summary overweights loud feedback.” Or: “This agent should not have write access because the recovery path is unclear.” Or: “This workflow should not use AI because the user needs a repeatable answer with almost no variance.”

Acceptance discipline also has a political cost. It is easy to say “this is not good enough” in an essay and much harder in a room where leadership likes the demo, engineering wants to ship, and the model’s output looks polished enough to pass. The way through is to make rejection less personal. Put the standard in the rubric, the gateway, the example set, and the approval rule before the demo asks for momentum. That is how teams turn taste into a shared operating system.

Agents turn judgment into permissions

Agents raise the stakes because they do not only produce text. They act. That changes the PM question from “Is the answer good?” to “What is this system allowed to do?”

An agent can execute a goal correctly and still make the wrong product decision. If the goal is underspecified, the agent may pursue a proxy, like completion speed or task closure, that conflicts with user trust or long-term value because the specification did not encode those constraints. This is why agent design should start with bounded autonomy.

Anthropic’s guidance on agents favors simple, composable patterns, clear tool boundaries, prototypes, and comprehensive evaluations. Gartner has warned that many agentic AI projects will be canceled because of cost, unclear value, or weak risk controls. McKinsey’s recent AI survey work points in the same direction: adoption is broad, but scaled organizational impact depends on workflow redesign, governance, and human review practices, not tool access alone.

That does not mean teams should avoid agents. It means they should stop treating autonomy as a feature label. Autonomy is a permission model. Before giving an agent more scope, the team should know what task it owns, what tools it can use, what data it can access, what actions require approval, what it must never do, what users can undo, what the system logs, and what triggers escalation or rollback.

A customer support agent that drafts replies is one product. One that issues refunds is another. One that changes account status, updates billing, and sends legal language is a much riskier system. The difference is the boundary around the model, not the model itself.

The PM thinking stack

A useful AI operating model needs to sit between tool access and product action. Call it the PM Thinking Stack, with one important caveat: PMs do not own the whole stack alone. They anchor it. The team operates it.

Use the stack progressively. For an early prototype, start with the problem frame and the permission boundary. If those are unclear, do not spend energy polishing the demo. For a customer-facing beta, add context, evaluation, and acceptance criteria. For production, make the learning loop and governance explicit. Governance and accountability run through all six layers.

1. Problem and workflow frame

Start here before choosing a tool. What user, workflow, pain, or decision are we improving? What are the costs of delay and error? Why is AI better than a rule, search, form, dashboard, or ordinary automation?

This layer protects the team from tool-first thinking. It also creates the first go or no-go gate. If the workflow has low tolerance for error, unclear ownership, or no measurable value, adding AI will not fix it.

2. Context and evidence layer

AI output improves when the system has the right context. It gets worse when context is stale, partial, untrusted, or too broad. Ask: what customer, business, domain, technical, and organizational context does the system need? What sources can it use? How fresh must those sources be? What evidence must it show before users trust it?

This layer matters because many AI failures are not reasoning failures. They are context failures. The system answers from the wrong source, lacks the user’s situation, misses a policy, or treats outdated information as current.

3. Boundaries, permissions, and risk layer

Every AI workflow needs limits. Agentic workflows need them even more. What must the system never do? What data can it read? What systems can it write to? What actions need approval? What escalation, rollback, and logging must exist?

This layer turns abstract governance into product decisions. In a support workflow, for example, drafting a reply can sit inside a lightweight review step. Issuing refunds, changing account status, or sending regulated language needs a different permission model.

4. Evaluation and observability layer

A product team needs to know whether the system works before launch and whether it degrades after launch. What does good mean for this user and task? What examples represent excellent, acceptable, and failed output? What failure cases belong in the regression suite? What should we sample in production? What threshold triggers human review, rollback, or retraining?

This layer is the operating infrastructure for judgment. It turns taste, safety, usefulness, and trust into repeatable checks. Evals do not eliminate risk. They catch known failure modes, compare versions, and make quality visible. They still miss unknown edge cases, distribution shifts, and failures outside the test set. That is why observability and human review belong in the same layer.

5. Acceptance and action layer

AI output should connect to a decision. What action will this output inform? Who decides whether to accept it? What evidence do they need? Should the team ship, revise, reject, escalate, roll back, or stop?

This layer prevents AI from becoming a content machine with no accountability. A faster report only matters if someone decides what action follows. A better draft only matters if someone knows the acceptance standard. An agent only helps if the team knows when it can act without approval.

6. Production learning loop

AI quality changes after launch because users change, data change, models change, prompts change, and edge cases appear. How will we capture user feedback and record rejection reasons? What human override patterns matter? What incidents require eval updates? What model or prompt changes need regression tests?

This layer turns failure into institutional memory. Teams build advantage when expert rejection patterns become shared constraints, examples, and tests.

Cross-cutting layer: Governance and accountability

Governance is not a final review meeting. It has to run across the stack. For AI products, governance means documented ownership, auditability, model and vendor review, security, privacy, compliance, policy controls, and clear accountability for outcomes. NIST’s AI Risk Management Framework, the EU GPAI Code of Practice, and OpenAI’s Preparedness Framework all point in this direction: AI risk management is lifecycle work, not a one-time launch gate.

For PMs, the practical version is simple: never let “human in the loop” stay vague. Name the human. Name the authority. Name the review moment. Name the evidence. Name the escalation threshold.

Tool fluency still matters

This argument can be misread as “tools do not matter.” That is wrong. PMs need hands-on reps with LLMs, retrieval, automation, prototypes, evals, and agents. Junior PMs especially need to use the tools enough to understand their strengths, failure modes, and cost of review.

But tool fluency is the first layer, not the destination. A PM who knows how to prompt but cannot define success criteria will produce more plausible noise. One who can prototype but cannot state what the prototype proves will create false confidence. One who can spec an agent but cannot define permissions, escalation, and rollback will create risk faster than value.

The goal is to connect technical fluency to product judgment, not to become less technical. The more fluent you become with the tools, the harder it can be to keep enough distance to spot their failures. Tool fluency and acceptance discipline have to develop together.

The bottleneck moves upstream

AI can speed up some forms of generation, coding, research synthesis, and prototyping. It does not make execution cheap in every context. When AI increases output speed, the bottleneck often shifts to problem selection, workflow integration, evidence quality, review capacity, risk management, and stakeholder alignment. That is PM territory, but not exclusively. The PM’s value rises when they can help the team decide what is worth building, what is worth trusting, what is worth automating, and what is worth stopping.

This is where the stack has to stay pragmatic. Not every AI experiment needs the same weight. A hackathon prototype needs a clear user, a clear task, and a clear permission boundary. An internal copilot needs a lightweight rubric and a feedback path. A customer-facing agent in a regulated workflow needs deeper evals, monitoring, approvals, and rollback. The standard should scale with the risk. The habit should start early.

The gateways

Before you add AI to a workflow, pass through the problem gateway: does this workflow have a clear user, task, owner, success measure, and reason AI beats a simpler solution? Before you trust AI output, pass through the context gateway: can the team explain what sources, examples, rules, and prior decisions shaped the answer? Before you give a system autonomy, pass through the permission gateway: do you know what it can read, what it can change, what it must never do, and when a human must approve the next step? Before you launch, pass through the quality gateway: have you defined good and bad output with rubrics, examples, evals, human review, production monitoring, and rollback thresholds? Before you act on the output, pass through the decision gateway: who decides whether to ship, revise, reject, escalate, stop, or automate? Before you scale, pass through the learning gateway: will every failure, override, support issue, rejection pattern, and model change make the system easier to evaluate next time?

The PMs who win with AI will not be the ones with the longest tool list. They will be the ones who can make judgment operational. AI gives teams more output. The thinking stack decides what deserves to become product.

The last-mile AI strategy test

estebanf — Mon, 17 Nov 2025 17:40:19 +0000

The easier it gets to add AI, the more valuable it becomes to know where AI does not belong.

Most teams no longer struggle to access models that can summarize, draft, classify, route, recommend, answer, and act well enough to produce impressive demos. They struggle to turn that capability into something users trust, repeat, and pay for. That abundance creates a new failure mode. Teams start asking “where can we add AI?” when the better question is “where does AI improve the work?”

The last mile is everything between model capability and realized user value: the workflow where the model appears, the context it receives, the control the user keeps, the failures the product catches, and the outcome the business measures. As model capability becomes broadly available, durable advantage moves to the product layer: the harness, workflow, data context, trust boundary, and operating loop that turn model output into user value. AI strategy is won by knowing where AI should create value, where it should stay out of the way, and how to prove that it works better than the baseline, not by having AI. A wrapper shows that a model can do something. A product proves that the thing matters inside a real workflow.

The P&L gap

One of the most telling AI statistics is not a model benchmark. MIT Project NANDA reported that 95% of enterprise generative AI deployments produced zero measurable P&L impact. Treat the figure as directional, not as a universal law. The sample, definitions, and maturity of the deployments matter. Many AI projects are early experiments. Some were never designed to move a financial metric. But the pattern names a problem many leaders recognize: weak data, poor workflow fit, no clear business outcome, and no production measurement plan.

A CIO quoted in the research put it plainly: “We’ve seen dozens of demos this year. Maybe one or two are genuinely useful. The rest are wrappers or science projects.” The point is not that enterprise AI does not work. The point is that demos do not compound into value unless the product changes a real workflow and measures a real outcome.

Model access is becoming a weaker moat

For many horizontal use cases, model access is no longer the scarce asset it was two years ago. Inference costs have fallen. Frontier models have converged enough that the same base capability can produce very different user experiences depending on the product around it. One product makes the model feel like a natural extension of the work. Another makes the same class of model feel slow, vague, and risky.

Model choice still matters in regulated domains where an accuracy gap can decide whether a product is safe to use. Domain-specific models, fine-tuning, routing, and cost-performance judgment still matter when the cost of a wrong answer is high. But for many teams building AI into horizontal products, the harder question has moved up the stack: what job should the model do, in what workflow, at what quality threshold, with what human control?

Workflow fit beats feature presence

Imagine a product review meeting where every proposed roadmap item has “AI” in the name. One idea is technically impressive. It uses an agent to scan customer accounts, infer next steps, and generate a full action plan. In the demo, it looks powerful. In the actual workflow, it requires reps to leave their CRM, open a separate dashboard, review uncertain recommendations, and copy the useful parts back into the system where their manager measures activity.

Another idea looks boring. It cleans messy account notes inside the CRM, drafts a next-best follow-up, flags uncertainty, and asks the rep to approve before anything reaches the customer. The second idea wins, not because it uses a better model or has a more ambitious vision. It wins because it changes the user’s day without forcing the user into a new one.

This is where many AI products break. They assume users want intelligence as a destination. Most users want less friction in work they already have to do. GitHub Copilot’s success appears to come in part from tight workflow embedding: it appears inside the IDE, helps at the point of composition, and does not ask the developer to move the work somewhere else.

AI creates value when it meets the user where the work already happens. That does not mean every AI product must live inside an existing surface. Some new AI-native workflows will replace old ones. But even then, the product has to earn the behavior change. A standalone AI workspace has to be better enough to justify switching costs, training, governance, and new habits. Workflow fit is strategy, not polish.

The automation boundary is a product decision

Klarna’s customer service AI is a useful public case because it shows both the promise and the risk. Klarna said its AI assistant handled millions of customer conversations and automated roughly two-thirds of customer service chats. For simple questions, such as order status and payment schedules, that kind of automation can work well.

Then the story became more complicated. Klarna faced public scrutiny over the limits of automation and later described a need to rebalance human support in areas where customer needs are more complex. The story became more complicated later. The lesson here is that automation boundaries matter, not that customer service AI failed. In disputes, fraud claims, hardship cases, and sensitive financial conversations, confident but wrong answers about fees, policy, or payment terms are not merely annoying. They create compliance and trust problems.

AI can answer simple transactional questions. AI can draft responses. AI can route cases. AI can summarize context for a human agent. But not every customer problem should move through the same automation path. Good AI strategy requires deciding whether AI should suggest, draft, decide, act, escalate, or stay silent. That decision depends on user trust, risk, reversibility, cost of error, and the emotional weight of the moment. A model that works for password resets may be the wrong product choice for a fraud dispute. A system that can draft a clinical note may not be allowed to make a clinical judgment. An agent that can update a record may need approval before it changes anything that affects money, rights, or safety.

The boundary is also a learning decision. When AI acts without review, the product may lose the correction signal that teaches the system what good looks like. In some cases, the most defensible product will automate less at first, because human review creates the feedback loop the product needs to improve. These are product decisions, not just policy decisions. In AI products, product sense includes confidence, security, cost, auditability, training, and production readiness.

Almost right is still expensive

A model can win a benchmark and still lose the user. Benchmarks test capability under controlled conditions. Products succeed or fail under messy conditions: bad inputs, unclear user intent, slow response times, edge cases, partial trust, broken data, and users who are trying to finish work while being measured by someone else.

A traditional feature usually behaves the same way each time. AI features vary by prompt, context, user, retrieved data, model update, and confidence threshold. They can fail silently. They can be almost right. They can create support burdens that do not appear in a demo. The Stack Overflow developer survey data points to this trust problem: AI tool adoption has risen, while many developers remain skeptical of output accuracy and report frustration with code that is close enough to look useful but wrong enough to require careful review.

That is the danger zone for AI products. Obviously wrong output is easier to reject. Almost-right output creates hidden review costs. Those costs compound inside organizations. Every uncertain output shifts ambiguity onto the next human in the chain. The user has to inspect it, correct it, explain it, defend it, or absorb the risk of acting on it. If the product saves five minutes of drafting but adds ten minutes of verification, the metric that moved was activity, not value. The best AI teams treat launch as the start of measurement, not the end of implementation.

Product judgment shapes the moat

Proprietary data, feedback loops, distribution, bundling, business model, and regulatory positioning often look like separate advantages. In many cases, they are. Distribution especially can function independently of product quality. Salesforce, Microsoft, ServiceNow, and other incumbents can push AI features into products customers already use. That is a real advantage.

But distribution answers a different question. Distribution determines whether AI reaches users. Product judgment determines whether it earns repeated use, useful data, and trust. Data moats are also real, but simply having data does not guarantee a useful AI product. Teams still have to decide what data matters, how to collect it, how to refresh it, where to surface it, and how to turn it into a better decision for the user. When implementation gets easier, weak product judgment becomes easier to see.

The last-mile AI strategy test

Before building an AI feature, teams need a gate that is harder to pass than “the model can do it.” Use this test before committing roadmap, budget, or organizational attention.

1. User problem

What specific user pain, decision, or workflow does this improve? If the answer is “users want AI,” the idea is not ready. The problem should exist without the AI feature. The user should already spend time, money, attention, or risk on it.

2. Baseline

What does the user do today, how will you measure AI against that baseline in production, and who owns that measurement after launch? Do not compare the AI to a demo. Compare it to the actual workaround, spreadsheet, colleague, search process, support queue, or manual review path the user relies on now.

For a customer support drafting feature, that could mean measuring current handle time, escalation rate, correction rate, customer satisfaction, and policy errors before AI enters the workflow. The baseline does not have to be perfect. It has to be honest enough to prevent the demo from becoming the control group.

3. Workflow fit

Where does this enter the user’s existing work without creating new friction? The best AI products often feel less like a new destination and more like a better moment inside the current workflow. If the feature requires users to leave the system where work is measured, expect adoption to weaken. If the product is creating a new AI-native workflow, raise the bar. The new behavior has to beat the old one by enough to justify migration, training, and governance.

4. Automation boundary

Should AI suggest, draft, decide, act, escalate, or stay silent? For agentic AI, add a harder question: what does the human review before the agent acts, what can the user override, and what audit trail exists after the action?

5. Quality threshold

What reliability level is required before users trust it? Name the dangerous failure mode. In some contexts, a visible error is tolerable. In others, a silent or confidently wrong answer is unacceptable. The threshold should match the cost of being wrong.

A low-risk summarization tool can tolerate a different error profile than a fraud, clinical, legal, or payment workflow. Start with the cost of the error, compare against the current human or process baseline, then decide what level of review the system needs before the output matters.

6. Trust mechanism

How will the product show uncertainty, allow correction, preserve control, and recover from failure? Trust is a product behavior, not a message in onboarding. Users trust systems that make limits visible and give them control when the system is unsure.

7. Defensibility

What gets stronger over time? The answer might be proprietary data, workflow context, feedback loops, distribution, habit, or ownership of a measurable business outcome. If nothing compounds, the feature is likely a thin layer over model access.

8. Measurement commitment

What P&L or user outcome metric will this be measured against, who owns it, and what happens if the metric does not move within a defined window? This is where many AI projects reveal their weakness. They measure prompts sent, seats activated, or demos completed. Useful AI should move a real outcome: lower handling time with stable satisfaction, faster cycle time with fewer errors, higher conversion with lower manual effort, better compliance with less review burden. If the metric does not move, the team needs permission to change the workflow, narrow the use case, lower the automation level, or kill the feature.

How to use the test

The test works best before building, not after launch. Take a proposed AI feature and force it through the questions in sequence. Do not let the team skip from “user problem” to “defensibility.” Most weak ideas try to do that: they start with a vague pain, assume AI will solve it, then claim a data flywheel will appear later.

The stronger sequence is narrower. Start with one painful workflow. Define the baseline. Decide where AI enters. Set the automation boundary. Name the required quality threshold. Design the trust mechanism. Then ask what compounds if the product works.

This sequence changes the roadmap conversation. A team ranking AI use cases by executive excitement will choose visible, demo-friendly ideas. A team ranking by customer impact, technical feasibility, risk posture, business value, and measurement clarity will choose smaller use cases that can actually reach production. That difference matters. Many useful AI products begin with unglamorous work: cleaning inputs, drafting recommendations, routing cases, summarizing context, flagging uncertainty, or reducing review time. The value comes from making the work better, not making the demo louder.

Before adding AI to a surface, ask whether the user problem would matter without AI. Before choosing a model, ask what baseline the product must beat. Before building a standalone experience, ask whether the AI belongs inside a workflow users already trust. Before automating a step, ask what the human must review, control, or override. Before launching, ask what failure mode would damage trust fastest.

If the model gets ten times better, does this product become obsolete, or does the thing you own become more valuable?