Artificial Intelligence

Understanding Amazon Bedrock model lifecycle

Saurabh Trikande — Thu, 09 Apr 2026 17:33:28 +0000

Amazon Bedrock regularly releases new foundation model (FM) versions with better capabilities, accuracy, and safety. Understanding the model lifecycle is essential for effective planning and management of AI applications built on Amazon Bedrock. Before migrating your applications, you can test these models through the Amazon Bedrock console or API to evaluate their performance and compatibility.

This post shows you how to manage FM transitions in Amazon Bedrock, so you can make sure your AI applications remain operational as models evolve. We discuss the three lifecycle states, how to plan migrations with the new extended access feature, and practical strategies to transition your applications to newer models without disruption.

Amazon Bedrock model lifecycle overview

A model offered on Amazon Bedrock can exist in one of three states: Active, Legacy, or End-of-Life (EOL). Their current status is visible both on the Amazon Bedrock console and in API responses. For example, when you make a GetFoundationModel or ListFoundationModels call, the state of the model will be shown in the modelLifecycle field in the response.

The following diagram illustrates the details around each model state.

The state details are as follows:

ACTIVE – Active models receive ongoing maintenance, updates, and bug fixes from their providers. While a model is Active, you can use it for inference through APIs like InvokeModel or Converse, customize it (if supported), and request quota increases through AWS Service Quotas.
LEGACY – When a model provider transitions a model to Legacy state, Amazon Bedrock will notify customers with at least 6 months’ advance notice before the EOL date, providing essential time to plan and execute a migration to newer or alternative model versions. During the Legacy period, existing customers can continue using the model, though new customers might be unable to access it, and existing customers might lose access for inactive accounts if they do not call the model for a period of 15 days or more. Organizations should note that creating new provisioned throughput by model units becomes unavailable, and model customization capabilities might face restrictions. For models with EOL dates after February 1, 2026, Amazon Bedrock introduces an additional phase within the Legacy state:
- Public extended access period – After spending a minimum of 3 months in Legacy status, the model enters this extended access phase. Active users can continue using it for at least another 3 months until EOL. During extended access, quota increase requests through AWS Service Quotas are not expected to be approved, so plan your capacity needs before a model enters this phase. During this period, pricing may be adjusted (see Pricing during extended access below), and customers will receive notifications about the transition date and any changes.
END-OF-LIFE (EOL) – When a model reaches its EOL date, it becomes completely inaccessible across all AWS Regions unless specifically noted in the EOL list. API requests to EOL models will fail, rendering them unavailable to most customers unless special arrangements exist between the customer and provider for continued access. The transition to EOL requires proactive customer action—migration doesn’t happen automatically. Organizations must update their application code to use alternative models before the EOL date arrives. When EOL is reached, the model becomes completely inaccessible for most customers.

After a model launches on Amazon Bedrock, it remains available for at least 12 months after launch and stays in Legacy state for at least 6 months before EOL. This timeline helps customers plan migrations without rushing.

Pricing during extended access

During the extended access period, pricing may be adjusted by the model provider. If pricing changes are planned, you will be notified in the initial legacy announcement and before any subsequent changes take effect, so there will be no surprise retroactive price increases. Customers with existing private pricing agreements with model providers or those using provisioned throughput will continue to operate under their current pricing terms during the extended access period. This makes sure customers who have made specific arrangements with model providers or invested in provisioned capacity will not be unexpectedly affected by any pricing changes.

Communication Process for Model State Changes

Customers will receive a notification 6 months prior to a model’s EOL date when the model provider transitions a model to Legacy state. This proactive communication approach ensures that customers have sufficient time to plan and execute their migration strategies before a model becomes EOL.
Notifications include details about the model being deprecated, important dates, extended access availability, and when the model will be EOL. AWS uses multiple channels to ensure these important communications reach the right people, including:

Email notifications
AWS Health Dashboard
Alerts in the Amazon Bedrock console
Programmatic access through the API.

To make sure you receive these notifications, verify and configure your account contact email addresses. By default, notifications are sent to your account’s root user email and alternate contacts (operations, security, and billing). You can review and update these contacts on your AWS Account page in the Alternate contacts section. To add additional recipients or delivery channels (such as Slack or email distribution lists), go to the AWS User Notifications console and choose AWS managed notifications subscriptions to manage your delivery channels and account contacts. If you are not receiving expected notifications, check that your email addresses are correctly configured in these settings and that notification emails from health@aws.com are not being filtered by your email provider.

Migration strategies and best practices

When migrating to a newer model, update your application code and check that your service quotas can handle expected volume. Planning ahead helps you transition smoothly with minimal disruption.

Planning your migration timeline

Start planning as soon as a model enters Legacy state:

Assessment phase – Evaluate your current usage of the legacy model, including which applications depend on it, typical request patterns, and specific behaviors or outputs that your applications rely on.
Research phase – Investigate the recommended replacement model, understanding its capabilities, differences from the legacy model, new features that could enhance your applications, and the new model’s Regional availability. Review API changes and documentation.
Testing phase – Conduct thorough testing with the new model and compare performance metrics between models. This helps identify adjustments needed in your application code or prompt engineering.
Migration phase – Implement changes using a phased deployment approach. Monitor system performance during transition and maintain rollback capability.
Operational phase – After migration, continuously monitor your applications and user feedback to make sure they’re performing as expected with the new model.

Technical migration steps

Test your migration thoroughly:

Update API references – Modify your application code to reference the new model ID. For example, changing from anthropic.claude-3-5-sonnet-20240620-v1:0 to anthropic.claude-sonnet-4-5-20250929-v1:0 or global cross-Region inference global.anthropic.claude-sonnet-4-5-20250929-v1:0. Update prompt structures according to new model’s best practices. For more detailed guidance, refer to Migrate from Anthropic’s Claude Sonnet 3.x to Claude Sonnet 4.x on Amazon Bedrock.
Request quota increases – Before fully migrating, make sure you have sufficient quotas for the new model by requesting increases through the AWS Service Quotas console if necessary.
Adjust prompts – Newer models might respond differently to the same prompts. Review and refine your prompts accordingly to the new model specifications. You can also use tools such as the prompt optimizer in Amazon Bedrock to assist with rewriting your prompt for the target model.
Update response handling – If the new model returns responses in a different format or with different characteristics, update your parsing and processing logic accordingly.
Optimize token usage – Take advantage of efficiency improvements in newer models by reviewing and optimizing your token usage patterns. For example, models that support prompt caching can reduce the cost and latency of your invocations.

Testing strategies

Thorough testing is critical for a successful migration:

Side-by-side comparison – Run the same requests against both the legacy and new models to compare outputs and identify any differences that might affect your application. For production environments, consider shadow testing—sending duplicate requests to the new model alongside your existing model without affecting end-users. With this approach, you can evaluate model performance, latency and errors rates, and other operational factors before full migration. Perform A/B testing for user impact assessment by routing a controlled percentage of live traffic to the new model while monitoring key metrics such as user engagement, task completion rates, satisfaction scores, and business KPIs.
Performance testing – Measure response times, token usage, and other performance metrics to understand how the new model performs compared to the legacy version. Validate business-specific success metrics.
Regression and edge case testing – Make sure existing functionality continues to work as expected with the new model. Pay special attention to unusual or complex inputs that might reveal differences in how the models handle challenging scenarios.

Conclusion

The model lifecycle policy in Amazon Bedrock gives you clear stages for managing FM evolution. Transition periods offer extended access options, and provisions for fine-tuned models help you balance innovation with stability.

Stay informed about model states through the AWS Health Dashboard, plan migrations when models enter the Legacy state, and test newer versions thoroughly. These guidelines can help you maintain continuity in your AI applications while using improved capabilities in newer models.

If you have further questions or concerns, reach out to your AWS team. We want to help you and facilitate a smooth transition as you continue to take advantage of the latest advancements in FM technology.

For continued learning and implementation support, explore the official AWS Bedrock documentation for comprehensive guides and API references. Additionally, visit the AWS Machine Learning Blog and AWS Architecture Center for real-world case studies, migration best practices, and reference architectures that can help optimize your model lifecycle management strategy.

About the authors

Saurabh Trikande is a Senior Product Manager for Amazon Bedrock and Amazon SageMaker Inference. He is passionate about working with customers and partners, motivated by the goal of democratizing AI. He focuses on core challenges related to deploying complex AI applications, inference with multi-tenant models, cost optimizations, and making the deployment of generative AI models more accessible. In his spare time, Saurabh enjoys hiking, learning about innovative technologies, following TechCrunch, and spending time with his family.

Melanie Li, PhD, is a Senior Generative AI Specialist Solutions Architect at AWS based in Sydney, Australia, where her focus is on working with customers to build solutions using state-of-the-art AI/ML tools. She has been actively involved in multiple generative AI initiatives across APJ, harnessing the power of LLMs. Prior to joining AWS, Dr. Li held data science roles in the financial and retail industries.

Derrick Choo is a Senior Solutions Architect at AWS who accelerates enterprise digital transformation through cloud adoption, AI/ML, and generative AI solutions. He specializes in full-stack development and ML, designing end-to-end solutions spanning frontend interfaces, IoT applications, data integrations, and ML models, with a particular focus on computer vision and multi-modal systems.

Jared Dean is a Principal AI/ML Solutions Architect at AWS. Jared works with customers across industries to develop machine learning applications that improve efficiency. He is interested in all things AI, technology, and BBQ.

Julia Bodia is Principal Product Manager for Amazon Bedrock.

Pooja Rao is a Senior Program Manager at AWS, leading quota and capacity management and supporting business development for the Bedrock Go-To-Market team. Outside of work, she enjoys reading, traveling, and spending time with her family.

The future of managing agents at scale: AWS Agent Registry now in preview

Preethi C N — Thu, 09 Apr 2026 17:28:20 +0000

Now available through Amazon Bedrock AgentCore, use AWS Agent Registry to discover, share, and reuse agents, tools, and agent skills across your organization.

As enterprises scale to hundreds or thousands of agents, platform teams face three critical challenges: visibility (knowing what agents exist across the organization), control (governing who can publish and what becomes discoverable organization-wide), and reuse (preventing teams from rebuilding capabilities that already exist). Without a centralized system, agent sprawl accelerates, compliance risks grow, and development effort is wasted on duplicate work. These challenges are compounded by reality: no organization’s agent landscape lives entirely within one provider. Agents are built across AWS services, other cloud platforms, and on-premises environments. A registry that only covers part of the stack leaves the rest invisible, and invisible agents can’t be discovered, governed, or reused. Solving this requires more than a place to list what exists. Platform teams need to build agents, publish them with approval workflows, help teams to discover and reuse what exists, govern who can publish and consume, monitor what’s running in production, and retire what’s no longer needed. Today, we’re announcing AWS Agent Registry (preview) in AgentCore, a single place to discover, share, and reuse AI agents, tools, and agent skills across your enterprise.

AgentCore is the platform to build, connect, and optimize agents at scale, designed from the ground up for agents: open to any model, any framework, any enterprise architecture. Whether you’re shipping your first agent or your thousandth, you have one platform that scales with you. The registry extends that same flexibility to how you organize and govern what you’ve built. It indexes agents regardless of where they’re built or hosted – on AWS, other cloud providers, or on premises.

What’s available in preview today

The registry stores metadata for every agent, tool, MCP server, agent skill, and custom resources as a structured record. It captures who published each record, what protocols it implements, what it exposes, and how to invoke it. The registry supports established standards like MCP and A2A natively, with the flexibility to define custom schemas for your organization. There are two ways to register a record. You can provide metadata manually through the console, AWS SDK, or API, specifying capability descriptions, ownership, compliance status, and usage documentation. Or you can point to an MCP or A2A endpoint, and the registry will automatically pull in the details. Your registry can reflect your full agent landscape from day one, not only the pieces that happen to run on AWS.

The registry is accessible through the AgentCore Console, APIs, and as an MCP server. Any MCP-compatible client can query it directly, including Kiro and Claude Code. For organizations with custom identity providers, OAuth-based access means that teams can build their own discovery UIs without requiring IAM credentials.

Finding what already exists

Without a central registry, developers search externally for third-party tools or duplicate work that a neighboring team already shipped. You lose visibility into what’s been built, who owns it, and whether it’s approved for use. The registry solves this with hybrid search that combines keyword and semantic matching: all queries use keyword matching, but longer, natural language queries also use semantic understanding to surface conceptually related results. This means a search for “payment processing” surfaces tools tagged as “billing” or “invoicing,” even if they’re named differently. Discovery becomes the path of least resistance. Teams can search by name, descriptions, and resource type to find what already exists before building something new. Developers search the registry first. If a vetted capability exists, they use it. If it doesn’t, they build it, register it, and make it available to everyone else. You can see what exists across your organization.

For Zuora, an AI-first monetization and revenue management platform deploying 50 agents across Sales, Finance, Product, and Developer teams, the AWS Agent Registry in AgentCore gives Principal Architects a unified view to discover, manage, and catalog every agent, tool, and skill in use. This centralized approach enables teams to find and reuse existing assets rather than rebuilding from scratch. Standardized metadata ensures each agent and tool includes consistent details on ownership and capabilities, giving teams end-to-end visibility and accountability across the entire agent ecosystem.

– Pete Hirsch, Chief Product and Technology Officer, Zuora

Governing what gets published

Without governance, anyone can register anything. You lose control over what becomes discoverable, can’t enforce standards, can’t track ownership, and can’t manage agents from development to retirement. When you have a few agents, you can manage them in a spreadsheet. When you have hundreds or thousands, you need a system that enforces standards automatically.

The registry gives you control over what gets published and who can access it. Admins use IAM policies to define who can register agents, tools, and agent skills and who can discover them. Every record follows an approval workflow: they start as drafts, move to pending approval, and become discoverable to the broader organization once approved. The registry tracks agents across their entire lifecycle, from initial development through deployment to eventual retirement. Records are versioned to track changes over time, and organizations can deprecate records that are no longer in use. The registry provides hooks to integrate your existing approval workflows. You can add custom metadata to each entry through a record, capturing information like team ownership, compliance status, or deployment environment.

Southwest Airlines is enabling an enterprise-wide agent catalog and governance across the enterprise. AWS Agent Registry in AgentCore solves the critical discoverability challenge— enabling teams to find and reuse existing agents instead of rebuilding capabilities from scratch. With managed governance across multiple platforms, every agent carries standardized ownership metadata and policy enforcement. This will prevent agent sprawl across the organization while establishing the foundation for scaling thousands of agents with enterprise-grade governance from day one.

– Justin Bundick, VP AI and Intelligent Platforms, Southwest Airlines

Where we’re headed

We’re building toward a future where the registry spans every AWS service where agents are built, including Amazon Quick, and Kiro. Agents will be automatically indexed the moment that they’re deployed. Developers will search from the IDE, business users will discover agents in their workspace, and admins will govern from the console, all backed by the same source of truth. Cross-registry federation will let you connect multiple registries and search across them as one. You will be able to define categories and taxonomies that match how your organization thinks about agents, backed by structured metadata schemas capturing ownership, compliance status, cost center, and whatever else your governance model requires. Over time, operational intelligence from AgentCore Observability will surface alongside registry records: invocation counts, latency, uptime, and usage patterns, helping you to understand not only what exists, but what’s actively working in production.

Beyond AWS Agent Registry, we’re building toward connecting with external partner catalogs. We’re excited about early partner interest in centralized discovery and governance across your technology landscape.

Get started

Today’s preview is the starting line. No more rebuilding what already exists. No more agents deployed without visibility. The AWS Agent Registry gives you one place to discover, govern, and reuse every agent across your enterprise.

AWS Agent Registry is available in preview today through AgentCore in five AWS Regions: US East (N. Virginia), US West (Oregon), Asia Pacific (Sydney), Asia Pacific (Tokyo), and Europe (Ireland).

Get started with AWS Agent Registry through the AgentCore Console. Learn more by reading the documentation.

About the authors

Embed a live AI browser agent in your React app with Amazon Bedrock AgentCore

Sundar Raghavan — Thu, 09 Apr 2026 17:06:07 +0000

When you build AI-powered applications, your users must understand and trust AI agents that navigate websites and interact with web content on their behalf. When an agent interacts with web content autonomously, your users require visibility into those actions to maintain confidence and control, which they don’t currently have.

The Amazon Bedrock AgentCore Browser BrowserLiveView component addresses this challenge by providing a real-time video feed of the agent’s browsing session directly within your React application. This component, part of the Bedrock AgentCore TypeScript SDK, streamlines the integration by embedding a live browser stream with three lines of JavaScript XML (JSX).

The BrowserLiveView component uses the Amazon DCV protocol to render the browser session, creating transparency into agent actions. Implementation requires only a presigned URL from your server, without requiring you to build streaming infrastructure.

This post walks you through three steps: starting a session and generating the Live View URL, rendering the stream in your React application, and wiring up an AI agent that drives the browser while your users watch. At the end, you will have a working sample application you can clone and run.

Why embed Live View in your application

Embedding Live View inside your own application unlocks additional value for your users at scale.

With an embedded Live View, your users follow every navigation, form submission, and search query as the agent performs it. They get immediate visual confirmation that the agent is on the right page, interacting with the correct elements, and progressing through the workflow. This real-time feedback loop gives end users direct insight into agent behavior without waiting for the final result.

Users who delegate browsing tasks to an AI agent are more confident when they can observe the work. Watching the agent fill in a form field by field is more reassuring than receiving a text confirmation. For regulated workflows, visual evidence of agent actions can support audit requirements.

In workflows that require human supervision, like handling customer accounts and processing sensitive data, a supervisor can use the embedded Live View to watch the agent in real time and intervene if needed, without leaving your application.

Organizations also gain audit trail support through visual evidence of agent actions, which proves valuable for compliance requirements and troubleshooting scenarios. Combined with session recordings to Amazon Simple Storage Service (Amazon S3) and console-based session replay, you get both real-time observation and post-hoc review.

How it works

The integration has three components.

The user’s web browser runs a React application containing the BrowserLiveView component, which receives a SigV4-presigned URL and establishes a persistent WebSocket connection to receive the DCV video stream from a remote browser session. The React application handles video rendering and user interface presentation while maintaining the WebSocket connection for continuous streaming.

The application server functions as an AI agent within the Amazon Bedrock session lifecycle, orchestrating the connection between client browsers and cloud-hosted browser sessions. It starts sessions using the Amazon Bedrock AgentCore API and generates SigV4-presigned URLs that grant secure, time-limited access to the Live View stream. This layer handles session management, authentication, and stream distribution.

AWS Cloud hosts Amazon Bedrock AgentCore Browser and Amazon Bedrock services that provide the underlying browser automation and streaming capabilities. Amazon Bedrock AgentCore hosts the isolated cloud browser sessions within AWS Cloud and provides both the automation endpoint (Playwright CDP) and the Live View streaming endpoint (DCV).

The key efficiency advantage with this architecture is that the DCV Live View stream flows directly from Amazon Bedrock AgentCore to the user’s browser. It doesn’t pass through your application server. Your server generates the URL and runs the agent, but the video stream is a direct WebSocket connection from AWS to the client. This helps minimize latency and reduce infrastructure requirements.

Figure 1: Solution architecture showing the data flow between three components. The numbered arrows in the diagram represent the following data flows:
Arrow 1 (gray, solid): The client sends prompts and polls status from the Application Server using REST.
Arrow 2 (orange, solid): The Application Server calls the Amazon Bedrock Converse API for AI model reasoning.
Arrow 3 (blue, solid): The Application Server runs browser tools against Amazon Bedrock AgentCore Browser using Playwright Chrome DevTools Protocol (CDP).
Arrow 4 (red, dashed): The DCV Live View stream flows directly from Amazon Bedrock AgentCore to the User Browser, bypassing the Application Server.

Prerequisites

Before you begin, verify that you have the following:

Node.js 20 or later
An AWS account in a supported AWS Region
AWS credentials with Amazon Bedrock AgentCore Browser permissions
Access to an AI model to drive the agent (this post uses the Amazon Bedrock Converse API with Anthropic Claude, but Live View is model-agnostic and you can use a model provider or agent framework of your choice)

Important: Live View (Steps 1 and 2) requires only Amazon Bedrock AgentCore permissions. It does not depend on Amazon Bedrock or any specific AI model. The AI agent in Step 3 uses the Amazon Bedrock Converse API, which requires additional Amazon Bedrock permissions, but this is specific to our sample. You can substitute a model provider or agent framework of your choice. Use temporary credentials from AWS IAM Identity Center or AWS Security Token Service (AWS STS). Do not use long-lived access keys. Follow the principle of least privilege when configuring AWS Identity and Access Management (IAM) permissions.

Install the Amazon Bedrock AgentCore TypeScript SDK:

npm install bedrock-agentcore

For the AI agent in Step 3, you also need the AWS SDK for JavaScript:

npm install @aws-sdk/client-bedrock-runtime

The code in this post runs in two environments: the server-side code (Steps 1 and 3) runs in Node.js, and the client-side code (Step 2) runs in a React application bundled with Vite. The sample application at the end of this post packages everything together.

Step-by-step implementation

1: Start a browser session and generate the Live View URL

On your application server, use the Browser class to start a session and generate the presigned URL. The API returns a session identifier and streaming URL, which the server converts into a presigned URL with a defined expiration time of 300 seconds by default. It contains SigV4 credentials in the query parameters, so no secrets reach the browser. Pass this URL to your frontend through an API endpoint.

import { Browser } from 'bedrock-agentcore/browser'

const browser = new Browser({
  region: 'us-west-2'
})
await browser.startSession({
  viewport: { width: 1920, height: 1080 }
})

const signedUrl =
  await browser.generateLiveViewUrl()
// Send signedUrl to your frontend via API

2: Render the BrowserLiveView component in your React app

On your browser, import the BrowserLiveView component from the Bedrock AgentCore TypeScript SDK and render it with the presigned URL. The component handles WebSocket connection, DCV protocol negotiation, video stream decoding, and frame rendering. It auto scales to fit its parent container while preserving its aspect ratio. The remoteWidth and remoteHeight must match the viewport that you set in Step 1. Mismatched values cause cropping or black bars.

import { BrowserLiveView }
  from 'bedrock-agentcore/browser/live-view'

<BrowserLiveView
  signedUrl={presignedUrl}
  remoteWidth={1920}
  remoteHeight={1080}
/>

After adding this component, the Live View begins streaming as soon as the presigned URL is valid and the browser session is active. You should see the remote browser’s desktop appear within the component’s container. If the container remains empty, verify that the presigned URL hasn’t expired and that the browser session is still running.

3: Connect an AI agent to drive browser actions

With the Live View streaming, you need something interesting to watch. The following example uses the Amazon Bedrock Converse API, but Live View is model agnostic. You can use an AI model or agent framework of your choice to drive the browser.

The code creates a PlaywrightBrowser client, which starts a new AgentCore Browser session and connects to it using the Playwright Chrome DevTools protocol. This is the same type of cloud browser session as Step 1 but accessed through the Playwright automation interface rather than the Live View interface.

The model decides which browser tools to call, including navigate, click, type, getText, getHtml, and pressKey. Your server runs these tools and feeds the results back to the model for the next iteration.

import { BedrockRuntimeClient, ConverseCommand }
  from '@aws-sdk/client-bedrock-runtime'
import { PlaywrightBrowser }
  from 'bedrock-agentcore/browser/playwright'

const browser = new PlaywrightBrowser({
  region: 'us-west-2'
})
await browser.startSession()

// Define browser tools as JSON Schema
// (navigate, click, type, getText, and more)

while (step < maxSteps) {
  const response = await bedrockClient.send(
    new ConverseCommand({
      modelId: modelId,
      system: [{ text: systemPrompt }],
      messages,
      toolConfig: browserTools,
    })
  )

  if (response.stopReason === 'tool_use') {
    // Run browser tool, add result
    // to conversation, continue loop
  } else {
    break // Final answer from model
  }
}

The model is configurable. You can use Anthropic Claude, Amazon Nova, or an Amazon Bedrock model that supports tool use. Every tool call that the model makes is visible to your user through the Live View. They see the browser navigate, the search box fill in, and the results page load.

Note: The TypeScript SDK also includes a Vercel AI SDK integration (BrowserTools) that wraps these browser operations as framework-native tools.

Try it using the sample application

We built a complete sample application on GitHub that puts Steps 1–3 together. The sample includes a React dashboard with the embedded Live View, an activity log showing agent reasoning and actions, and a Fastify server running the AI agent. The agent navigates to Wikipedia, searches for a topic, reads the page content, and summarizes what it finds while you watch every step. You can download it from the GitHub repository.

Figure 2: The sample application mid-run. The left panel shows the BrowserLiveView component streaming a Wikipedia page that the agent has navigated to. The right panel shows the activity log with timestamped tool calls (navigate, getText, click). At the bottom, the prompt input field and Launch Agent button are visible.

To clone and run the sample application

Complete the following steps to clone and run the sample application.

Clone the repository and navigate to the sample folder.

git clone https://github.com/awslabs/bedrock-agentcore-samples-typescript.git
cd bedrock-agentcore-samples-typescript
cd use-cases/browser-live-view-agent

Install the dependencies.

npm install

Export your AWS credentials.

export AWS_ACCESS_KEY_ID=<your-access-key>
export AWS_SECRET_ACCESS_KEY=<your-secret-key>
export AWS_SESSION_TOKEN=<your-session-token>
export AWS_REGION=us-west-2

Important: Use temporary credentials. Do not commit credentials to source control.

Start the application.
```
npm run dev
```

Open http://localhost:5173, enter a prompt, and choose Launch Agent.

Bundler configuration

The BrowserLiveView component uses the Amazon DCV Web Client SDK, which ships vendored files inside the bedrock-agentcore npm package. You don’t need to download or install DCV separately. Your Vite configuration needs three additions:

resolve.alias points the dcv and dcv-ui bare specifiers to the vendored SDK files.
resolve.dedupe verifies that React and shared dependencies resolve from your node_modules, not from the vendored path.
viteStaticCopy copies DCV runtime files (workers, WASM decoders) to your build output.

The sample application’s vite.config.ts has the complete configuration ready to use. For more details on the BrowserLiveView component, see the live-view source directory in the TypeScript SDK.

Clean up resources

To avoid incurring charges, stop the browser session and shut down the application when you’re done:

In the application UI, choose Stop Session to end the Amazon Bedrock AgentCore Browser session.
In your terminal, press Ctrl+C to stop the development servers.
If you created any IAM roles or policies specifically for this demo, delete them from the IAM console.

Amazon Bedrock AgentCore Browser sessions incur charges while active. For pricing details, refer to the Amazon Bedrock AgentCore pricing page.

Next steps

Now that you have a working Live View integration, here are some things to explore.

To get started, clone the sample application, fill in your AWS credentials, and run npm run dev to see the full demo in action. For instructions, refer to the To clone and run the sample application section in this post.

The sample application defaults to Anthropic Claude, but you can switch to Amazon Nova or another Amazon Bedrock model that supports tool use by setting the BEDROCK_MODEL_ID environment variable. For a list of available models and their tool use capabilities, refer to the Amazon Bedrock model documentation.

The React dashboard in the sample application is a starting point for your own implementation. You can adapt the layout to match your design system, integrate the Live View into an existing application, or add controls that let users intervene mid-workflow. For guidance on building React applications with the AgentCore SDK, refer to the Bedrock AgentCore TypeScript SDK documentation.

The BrowserLiveView component supports multiple instances on the same page, each streaming a different browser session. This capability is useful for monitoring dashboards. The component’s source code, including scaling logic and DCV authentication flow, is available in the live-view source directory in the TypeScript SDK.

Conclusion

In this post, you learned how to use the BrowserLiveView component to embed a Live View of an Amazon Bedrock AgentCore Browser session into your React application. The three-step implementation and architecture that streams video directly from AWS to client browsers makes live agent visualization accessible without specialized streaming expertise.

For a deeper look at Amazon Bedrock AgentCore Browser capabilities, refer to the Amazon Bedrock AgentCore Browser documentation. If you have feedback or questions, open an issue in the GitHub repository.

Important: This sample application is intended for local development and demonstration. For production use, add authentication to your API endpoints, enable HTTPS, restrict CORS origins, implement rate limiting, and follow the AWS Well-Architected Framework security pillar.

About the authors

Introducing stateful MCP client capabilities on Amazon Bedrock AgentCore Runtime

Evandro Franco — Thu, 09 Apr 2026 14:47:57 +0000

Stateful MCP client capabilities on Amazon Bedrock AgentCore Runtime now enable interactive, multi-turn agent workflows that were previously impossible with stateless implementations. Developers building AI agents often struggle when their workflows must pause mid-execution to ask users for clarification, request large language model (LLM)-generated content, or provide real-time progress updates during long-running operations, stateless MCP servers can’t handle these scenarios. This solves these limitations by introducing three client capabilities from the MCP specification:

Elicitation (request user input mid-execution)
Sampling (request LLM-generated content from the client)
Progress notification (stream real-time updates)

These capabilities transform one-way tool execution into bidirectional conversations between your MCP server and clients.

Model Context Protocol (MCP) is an open standard defining how LLM applications connect with external tools and data sources. The specification defines server capabilities (tools, prompts, and resources that servers expose) and client capabilities (features clients offer back to servers). While our previous release focused on hosting stateless MCP servers on AgentCore Runtime, this new capability completes the bidirectional protocol implementation. Clients connecting to AgentCore-hosted MCP servers can now respond to server-initiated requests. In this post, you will learn how to build stateful MCP servers that request user input during execution, invoke LLM sampling for dynamic content generation, and stream progress updates for long-running tasks. You will see code examples for each capability and deploy a working stateful MCP server to Amazon Bedrock AgentCore Runtime.

From stateless to stateful MCP

The original MCP server support on AgentCore used stateless mode: each incoming HTTP request was independent, with no shared context between calls. This model is straightforward to deploy and reason about, and it works well for tool servers that receive inputs and return outputs. However, it has a fundamental constraint. The server can’t maintain a conversation thread across requests, ask the user for clarification in the middle of a tool call, or report progress back to the client as work happens.

Stateful mode removes that constraint. When you run your MCP server with stateless_http=False, AgentCore Runtime provisions a dedicated microVM for each user session. The microVM persists for the session’s lifetime (up to 8 hours, or 15 minutes of inactivity per idleRuntimeSessionTimeout setting), with CPU, memory, and filesystem isolation between sessions. The protocol maintains continuity through a Mcp-Session-Id header: the server returns this identifier during the initialize handshake, and the client includes it in every subsequent request to route back to the same session.

The following table summarizes the key differences:

	Stateless mode	Stateful mode
stateless_httpsetting	TRUE	FALSE
Session isolation	Dedicated microVM per session	Dedicated microVM per session
Session lifetime	Up to 8 hours; 15-min idle timeout	Up to 8 hours; 15-min idle timeout
Client capabilities	Not supported	Elicitation, sampling, progress notifications
Recommended for	Simple tool serving	Interactive, multi-turn workflows

When a session expires or the server is restarted, subsequent requests with the early session ID return a 404. At that point, clients must re-initialize the connection to obtain a new session ID and start a fresh session.The configuration change to enable stateful mode is a single flag in your server startup:

mcp.run( transport="streamable-http", host="0.0.0.0", port=8000, stateless_http=False # Enable stateful mode)

Beyond this flag, the three client capabilities become available automatically once the MCP client declares support for them during the initialization handshake.

The three new client capabilities

Stateful mode brings three client capabilities from the MCP specification. Each addresses a different interaction pattern that agents encounter in production workflows.

Elicitation allows a server to pause execution and request structured input from the user through the client. The tool can ask targeted questions at the right moment in its workflow, gathering a preference, confirming a decision, or collecting a value that depends on earlier results. The server sends an elicitation/create request with a message and an optional JSON schema describing the expected response structure. The client renders an appropriate input interface, and the user can accept (providing the data), decline, or cancel.

Sampling allows a server to request an LLM-generated completion from the client through sampling/createMessage. This is the mechanism that makes it possible for tool logic on the server to use language model capabilities without holding its own model credentials. The server provides a prompt and optional model preferences; the client forwards the request to its connected LLM and returns the generated response. Practical uses include generating personalized summaries, creating natural-language explanations of structured data, or producing recommendations based on earlier conversation context.

Progress notifications allow a server to report incremental progress during long-running operations. Using ctx.report_progress(progress, total), the server emits updates that clients can display as a progress bar or status indicator. For operations that span multiple steps, for example, searching across data sources, this keeps users informed rather than watching a blank screen.

All three capabilities are opt-in at the client level: a client declares which capabilities it supports during initialization, and the server must only use capabilities the client has advertised.

Elicitation: server-initiated user input

Elicitation is the mechanism by which an MCP server pauses mid-execution and asks the client to collect specific information from the user. The server sends an elicitation/create JSON-RPC request containing a human-readable message and a requestedSchema that describes the expected response. The client presents this as a form or prompt, and the user’s response (or explicit decline) is returned to the server so execution can continue.The MCP specification supports two elicitation modes:

Form mode: structured data collection directly through the MCP client. Suitable for preferences, configuration inputs, and confirmations that don’t involve sensitive data.
URL mode: directs the user to an external URL for interactions that must not pass through the MCP client, such as OAuth flows, payment processing, or credential entry.

The response uses a three-action model: accept (user provided data), decline (user explicitly rejected the request), or cancel (user dismissed without choosing). Servers should handle each case appropriately. The following example implements an add_expense_interactive tool that collects a new expense through four sequential elicitation steps: amount, description, category, and a final confirmation before writing to DynamoDB. Each step defines its expected input as a Pydantic model, which FastMCP converts to the JSON Schema sent in the elicitation/create request.

Server

The add_expense_interactive tool walks a user through four sequential questions before writing to Amazon DynamoDB. Each step defines its expected input as a separate Pydantic model, because the form mode schema must be a flat object. You can collect all four fields in a single model with four properties but splitting them here gives the user one focused question at a time, which is the interactive pattern elicitation is designed for.

agents/mcp_client_features.py

import os
from pydantic import BaseModel
from fastmcp import FastMCP, Context
from fastmcp.server.elicitation import AcceptedElicitation
from dynamo_utils import FinanceDB

mcp = FastMCP(name='ElicitationMCP')

_region = os.environ.get('AWS_REGION') or os.environ.get('AWS_DEFAULT_REGION') or 'us-east-1'
db = FinanceDB(region_name=_region)

class AmountInput(BaseModel):
    amount: float

class DescriptionInput(BaseModel):
    description: str

class CategoryInput(BaseModel):
    category: str  # one of: food, transport, bills, entertainment, other

class ConfirmInput(BaseModel):
    confirm: str  # Yes or No

@mcp.tool()
async def add_expense_interactive(user_alias: str, ctx: Context) -> str:
    """Interactively add a new expense using elicitation.

    Args:
        user_alias: User identifier
    """
    # Step 1: Ask for the amount
    result = await ctx.elicit('How much did you spend?', AmountInput)
    if not isinstance(result, AcceptedElicitation):
        return 'Expense entry cancelled.'
    amount = result.data.amount

    # Step 2: Ask for a description
    result = await ctx.elicit('What was it for?', DescriptionInput)
    if not isinstance(result, AcceptedElicitation):
        return 'Expense entry cancelled.'
    description = result.data.description

    # Step 3: Select a category
    result = await ctx.elicit(
        'Select a category (food, transport, bills, entertainment, other):',
        CategoryInput
    )
    if not isinstance(result, AcceptedElicitation):
        return 'Expense entry cancelled.'
    category = result.data.category

    # Step 4: Confirm before saving
    confirm_msg = (
        f'Confirm: add expense of ${amount:.2f} for {description}'
        f' (category: {category})? Reply Yes or No'
    )
    result = await ctx.elicit(confirm_msg, ConfirmInput)
    if not isinstance(result, AcceptedElicitation) or result.data.confirm != 'Yes':
        return 'Expense entry cancelled.'

    return db.add_transaction(user_alias, 'expense', -abs(amount), description, category)

if __name__ == '__main__':
    mcp.run(
        transport="streamable-http",
        host="0.0.0.0",
        port=8000,
        stateless_http=False
    )

Each await ctx.elicit() suspends the tool and sends an elicitation/create request over the active session. The isinstance(result, AcceptedElicitation) check handles decline and cancel uniformly at every step.

Client

Registering an elicitation_handler on fastmcp.Client is both how the handler is wired in and how the client advertises elicitation support to the server during initialization.

import asyncio
from fastmcp import Client
from fastmcp.client.transports import StreamableHttpTransport

# Pre-loaded responses simulate the user answering each question in sequence
_responses = iter([
    {'amount': 45.50},
    {'description': 'Lunch at the office'},
    {'category': 'food'},
    {'confirm': 'Yes'},
])

async def elicit_handler(message, response_type, params, context):
    # In production: render a form and return the user's input
    response = next(_responses)
    print(f'  Server asks: {message}')
    print(f'  Responding:  {response}\n')
    return response

transport = StreamableHttpTransport(url=mcp_url, headers=headers)

async with Client(transport, elicitation_handler=elicit_handler) as client:
    await asyncio.sleep(2)  # allow session initialization
    result = await client.call_tool('add_expense_interactive', {'user_alias': 'me'})

print(result.content[0].text)

Running this against the deployed server:

Server asks: How much did you spend?
Responding:  {'amount': 45.5}

Server asks: What was it for?
Responding:  {'description': 'Lunch at the office'}

Server asks: Select a category (food, transport, bills, entertainment, other):
Responding:  {'category': 'food'}

Server asks: Confirm: add expense of $45.50 for Lunch at the office (category: food)? Reply Yes or No
Responding:  {'confirm': 'Yes'}

Expense of $45.50 added for me

The complete working example, including DynamoDB setup and AgentCore deployment, is available in the GitHub sample repository.

Use elicitation when your tool needs information that depends on earlier results, is better collected interactively than upfront, or varies across users in ways that cannot be parameterized in advance. A travel booking tool that first searches destinations and then asks the user to choose among them is a natural fit. A financial workflow that confirms a transaction amount before submitting is another. Elicitation isn’t appropriate for sensitive inputs like passwords or API keys, use URL mode or a secure out-of-band channel for those.

Sampling: server-initiated LLM generation

Sampling is the mechanism by which an MCP server requests an LLM completion from the client. The server sends a sampling/createMessage request containing a list of conversation messages, a system prompt, and optional model preferences. The client forwards the request to its connected language model (subject to user approval) and returns the generated response. The server receives a structured result containing the generated text, the model used, and the stop reason.

This capability inverts the typical flow: instead of the client asking the server for tool results, the server asks the client for model output. The benefit is that the server doesn’t need API keys or a direct model integration. The client retains full control over which model is used, and the MCP specification calls for a human-in-the-loop step where users can review and approve sampling requests before they are forwarded.

Servers can express model preferences using capability priorities (costPriority, speedPriority, intelligencePriority) and optional model hints. These are advisory, the client makes the final selection based on what models it has access to.

Server

The analyze_spending tool fetches transactions from DynamoDB, builds a prompt from the structured data, and delegates the analysis to the client’s LLM via ctx.sample().

agents/mcp_client_features.py (added tool, same file as elicitation)

@mcp.tool()
async def analyze_spending(user_alias: str, ctx: Context) -> str:
    """Fetch expenses from DynamoDB and ask the client's LLM to analyse them.

    Args:
        user_alias: User identifier
    """
    transactions = db.get_transactions(user_alias)
    if not transactions:
        return f'No transactions found for {user_alias}.'

    lines = '\n'.join(
        f"- {t['description']} (${abs(float(t['amount'])):.2f}, {t['category']})"
        for t in transactions
    )

    prompt = (
        f'Here are the recent expenses for a user:\n{lines}\n\n'
        f'Please analyse the spending patterns and give 3 concise, '
        f'actionable recommendations to improve their finances. '
        f'Keep the response under 120 words.'
    )

    ai_analysis = 'Analysis unavailable.'
    try:
        response = await ctx.sample(messages=prompt, max_tokens=300)
        if hasattr(response, 'text') and response.text:
            ai_analysis = response.text
    except Exception:
        pass

    return f'Spending Analysis for {user_alias}:\n\n{ai_analysis}'

The tool calls await ctx.sample() and suspends. The server sends a sampling/createMessage request to the client over the open session. When the client returns the LLM response, execution resumes.

Client

The sampling_handler receives the prompt from the server and forwards it to a language model. In this example, that’s Claude Haiku on Amazon. Registering the handler is also how the client declares sampling support to the server during initialization.

import json
import asyncio
import boto3
from mcp.types import CreateMessageResult, TextContent
from fastmcp import Client
from fastmcp.client.transports import StreamableHttpTransport

MODEL_ID = 'us.anthropic.claude-haiku-4-5-20251001-v1:0'
bedrock = boto3.client('bedrock-runtime', region_name=region)

def _invoke_bedrock(prompt: str, max_tokens: int) -> str:
    body = json.dumps({
        'anthropic_version': 'bedrock-2023-05-31',
        'max_tokens': max_tokens,
        'messages': [{'role': 'user', 'content': prompt}]
    })
    resp = bedrock.invoke_model(modelId=MODEL_ID, body=body)
    return json.loads(resp['body'].read())['content'][0]['text']

async def sampling_handler(messages, params, ctx):
    """Called by fastmcp.Client when the server issues ctx.sample()."""
    prompt = messages if isinstance(messages, str) else ' '.join(
        m.content.text for m in messages if hasattr(m.content, 'text')
    )
    max_tokens = params.maxTokens if params and hasattr(params, 'maxTokens') and params.maxTokens else 300
    text = await asyncio.to_thread(_invoke_bedrock, prompt, max_tokens)
    return CreateMessageResult(
        role='assistant',
        content=TextContent(type='text', text=text),
        model=MODEL_ID,
        stopReason='endTurn'
    )

transport = StreamableHttpTransport(url=mcp_url, headers=headers)

async with Client(transport, sampling_handler=sampling_handler) as client:
    result = await client.call_tool('analyze_spending', {'user_alias': 'me'})

print(result.content[0].text)

Running this against a user with four seeded expenses:

Spending Analysis for me:

Total Spending: $266.79

Breakdown:
- Food: $130.80 (49%)
- Bills: $120.00 (45%)
- Entertainment: $15.99 (6%)

3 Actionable Recommendations:

1. Meal prep at home — cook groceries into multiple meals to reduce restaurant
   spending and lower food costs by 20-30%.

2. Review entertainment subscriptions — audit all subscriptions and cancel
   unused services or share family plans.

3. Reduce energy costs — use programmable thermostats, LED bulbs, and unplug
   devices to lower electricity bills by 10-15% monthly.

Use sampling when your tool must produce natural-language output that benefits from a language model’s capabilities. A tool that has collected a user’s travel preferences and wants to generate a tailored trip itinerary narrative is a good example. Sampling isn’t appropriate for deterministic operations like database queries, calculations, or API calls with well-defined outputs. We recommend that you use tool logic for those.

Progress notifications: real-time operation feedback

Progress notifications are events that a server sends during long-running operations to keep the client and the user informed about how much work has been completed. await ctx.report_progress(progress, total) emits a notifications/progress message and returns immediately. The server doesn’t wait for a response, it’s fire-and-forget in both directions. The client receives the notification asynchronously and can render a progress bar, log a status line, or use it to prevent the user from assuming the connection has stalled. The pattern is to call report_progress at each logical step of a multi-stage operation, with progress incrementing toward total.

Server

The generate_report tool builds a monthly financial report in five steps, emitting a progress notification at the start of each one.

agents/mcp_progress_server.py

import os
from fastmcp import FastMCP, Context
from dynamo_utils import FinanceDB

mcp = FastMCP(name='Progress-MCP-Server')

_region = os.environ.get('AWS_REGION') or os.environ.get('AWS_DEFAULT_REGION') or 'us-east-1'
db = FinanceDB(region_name=_region)

@mcp.tool()
async def generate_report(user_alias: str, ctx: Context) -> str:
    """Generate a monthly financial report, streaming progress at each stage.

    Args:
        user_alias: User identifier
    """
    total = 5

    # Step 1: Fetch transactions
    await ctx.report_progress(progress=1, total=total)
    transactions = db.get_transactions(user_alias)

    # Step 2: Group by category
    await ctx.report_progress(progress=2, total=total)
    by_category = {}
    for t in transactions:
        cat = t['category']
        by_category[cat] = by_category.get(cat, 0) + abs(float(t['amount']))

    # Step 3: Fetch budgets
    await ctx.report_progress(progress=3, total=total)
    budgets = {b['category']: float(b['monthly_limit']) for b in db.get_budgets(user_alias)}

    # Step 4: Compare spending vs budgets
    await ctx.report_progress(progress=4, total=total)
    lines = []
    for cat, spent in sorted(by_category.items(), key=lambda x: -x[1]):
        limit = budgets.get(cat)
        if limit:
            pct = (spent / limit) * 100
            status = 'OVER' if spent > limit else 'OK'
            lines.append(f'  {cat:<15} ${spent:>8.2f} / ${limit:.2f}  [{pct:.0f}%] {status}')
        else:
            lines.append(f'  {cat:<15} ${spent:>8.2f}  (no budget set)')

    # Step 5: Format and return
    await ctx.report_progress(progress=5, total=total)
    total_spent = sum(by_category.values())
    return (
        f'Monthly Report for {user_alias}\n'
        f'{"=" * 50}\n'
        f'  {"Category":<15} {"Spent":>10}   {"Budget":>8}  Status\n'
        f'{"-" * 50}\n'
        + '\n'.join(lines)
        + f'\n{"-" * 50}\n'
        f'  {"TOTAL":<15} ${total_spent:>8.2f}\n'
    )

if __name__ == '__main__':
    mcp.run(
        transport="streamable-http",
        host="0.0.0.0",
        port=8000,
        stateless_http=False
    )

Each await ctx.report_progress() is fire-and-forget: the notification is sent and execution moves immediately to the next step.

Client

The progress_handler receives progress, total, and an optional message each time the server emits a notification. Registering the handler is how the client declares progress support during initialization.

import logging
logging.getLogger('mcp.client.streamable_http').setLevel(logging.ERROR)

from fastmcp import Client
from fastmcp.client.transports import StreamableHttpTransport

async def progress_handler(progress: float, total: float | None, message: str | None):
    pct = int((progress / total) * 100) if total else 0
    filled = pct // 5
    bar = '#' * filled + '-' * (20 - filled)
    print(f'\r  Progress: [{bar}] {pct}% ({int(progress)}/{int(total or 0)})',
          end='', flush=True)
    if total and progress >= total:
        print('  Done!')

transport = StreamableHttpTransport(url=mcp_url, headers=headers)

async with Client(transport, progress_handler=progress_handler) as client:
    result = await client.call_tool('generate_report', {'user_alias': 'me'})

print(result.content[0].text)

As the server moves through its five stages, the client renders the bar in place:

  Progress: [####----------------] 20% (1/5)
  Progress: [########------------] 40% (2/5)
  Progress: [############--------] 60% (3/5)
  Progress: [################----] 80% (4/5)
  Progress: [####################] 100% (5/5)  Done!

Use progress notifications for any tool call that takes more than a few seconds and involves discrete, measurable steps. Operations like searching multiple data sources, running a sequence of API calls, processing a batch of records, or running a multi-step booking workflow are all good candidates. A tool that completes in under a second generally does not need progress reporting; the overhead of emitting events is not worthwhile for fast operations.

Conclusion

In this post, you have been introduced to stateful MCP client capabilities on Amazon Bedrock AgentCore Runtime. We explained the difference between stateless and stateful MCP deployments, walked through elicitation, sampling, and progress notifications with code examples, and showed how to deploy a stateful MCP server into AgentCore Runtime. With these capabilities, you can build MCP servers that engage users in structured conversations, use the client’s LLM for content generation, and provide real-time visibility into long-running operations, all hosted on managed, isolated infrastructure powered by AgentCore Runtime.We encourage you to explore the following resources to get started:

About the Authors

Customize Amazon Nova models with Amazon Bedrock fine-tuning

Bhavya Sruthi Sode — Wed, 08 Apr 2026 19:51:50 +0000

Today, we’re sharing how Amazon Bedrock makes it straightforward to customize Amazon Nova models for your specific business needs. As customers scale their AI deployments, they need models that reflect proprietary knowledge and workflows — whether that means maintaining a consistent brand voice in customer communications, handling complex industry-specific workflows or accurately classifying intents in a high-volume airline reservation system. Techniques like prompt engineering and Retrieval-Augmented Generation (RAG) provide the model with additional context to improve task performance, but these techniques do not instill native understanding into the model.

Amazon Bedrock supports three customization approaches for Nova models: supervised fine-tuning (SFT), which trains the model on labeled input-output examples; reinforcement fine-tuning (RFT), which uses a reward function to guide learning toward target behaviors; and model distillation, which transfers knowledge from a larger teacher model into a smaller, faster student model. Each technique embeds new knowledge directly into the model weights, rather than supplying it at inference time through prompts or retrieved context. With these approaches, you get faster inference, lower token costs, and higher accuracy on the tasks that matter most to your business. Amazon Bedrock manages the training process automatically, requiring only that you upload your data to Amazon Simple Storage Service (Amazon S3) and initiate the job through the AWS Management Console, CLI, or API. Deep machine learning expertise is not required. Nova models support on-demand invocation of customized models in Amazon Bedrock. This means you pay only per-call at the standard rate for the model, instead of needing to purchase more expensive allocated capacity (Provisioned Throughput).

In this post, we’ll walk you through a complete implementation of model fine-tuning in Amazon Bedrock using Amazon Nova models, demonstrating each step through an intent classifier example that achieves superior performance on a domain specific task. Throughout this guide, you’ll learn to prepare high-quality training data that drives meaningful model improvements, configure hyperparameters to optimize learning without overfitting, and deploy your fine-tuned model for improved accuracy and reduced latency. We’ll show you how to evaluate your results using training metrics and loss curves.

Understanding fine-tuning and when to use it

Context-engineering techniques such as prompt engineering or Retrieval-Augmented Generation (RAG) place information into the model’s prompt. These approaches offer significant advantages: they take effect immediately with no training required, allow for dynamic information updates, and work with multiple foundation models without modification. However, these techniques consume context window tokens on every invocation, which can increase cumulative costs and latency over time. More importantly, they do not generalize well. The model is simply reading instructions each time rather than having internalized the knowledge, so it can struggle with novel phrasings, edge cases, or tasks that require reasoning beyond what was explicitly provided in the prompt. Customization techniques, by comparison, incorporate the new knowledge directly into the model by adding an adapter matrix of additional weights and customizing those (“parameter-efficient fine-tuning”, aka “PEFT”). The resulting customized model has acquired new domain-specific skills. Customization allows faster and more efficient small models to reach performance comparable to larger models in the specific training domain.

When to fine-tune: Consider fine-tuning when you have a high-volume, well-defined task where you can assemble quality labeled examples or a reward function. Use cases include training a model to correctly render your company’s logo, embedding brand tone and company policies into the model, or replacing a traditional ML classifier with a small LLM. For example, Amazon Customer Service customized Nova Micro for specialized customer support to improve accuracy and reduce latency, improving accuracy by 5.4% on domain-specific issues and 7.3% on general issues.

Fine-tuned small LLMs like Nova Micro are increasingly replacing traditional ML classifiers for tasks such as intent detection. They deliver the flexibility and world knowledge of an LLM at the speed and cost of a lightweight model. Unlike classifiers, LLMs handle natural variation in phrasing, slang, and context without retraining, and fine-tuning sharpens their accuracy further for the specific task. We demonstrate this with an intent classifier example later in this blog.

When NOT to fine-tune: Fine-tuning requires assembling quality labeled data or a reward function and executing a training job, which involves upfront time and cost. However, this initial investment can reduce per-request inference costs and latency for high-volume applications.

Customization approaches

Amazon Bedrock offers three customization approaches for Nova models:

Supervised fine-tuning (SFT) customizes the model to learn patterns from labeled data that you supply. This post demonstrates this technique in action.
Reinforcement fine-tuning (RFT) takes a different approach, using training data combined with a reward function, either custom code or an LLM acting as a judge, to guide the learning process.
Model distillation, for scenarios requiring knowledge transfer, lets you compress insights from large teacher models into smaller, more efficient student models suitable for resource-constrained devices.

Amazon Bedrock automatically uses parameter efficient fine-tuning (PEFT) techniques appropriate to the model for customizing Nova models. This reduces memory requirements and accelerates training compared to full fine-tuning, while maintaining model quality. Having established when and why to use fine-tuning, let’s explore how Amazon Bedrock simplifies the implementation process, and which Nova models support this customization approach.

Understanding Amazon Nova models on Amazon Bedrock

Amazon Bedrock fully automates infrastructure provisioning, compute management, and training orchestration. You upload data to S3 and start training with a single API call, without managing clusters and GPUs or configuring distributed training pipelines. It provides clear documentation for data preparation (including format specifications and schema requirements), sensible hyperparameter defaults (such as epochCount, learningRateMultiplier), and training visibility through loss curves that help you monitor convergence in real-time.

Nova Models: Several of the Nova models allow fine-tuning (see documentation). After training is completed, you have the option to host the customized Nova models on Amazon Bedrock using cost-effective On Demand inference, at the same low inference price as the non-customized model.

Nova 2 Lite, for example, is a fast, cost-effective reasoning model. As a multimodal foundation model, it processes text, images, and video within a 1-million token context window. This context window supports analysis of documents longer than 400 pages or 90-minute videos in a single prompt. It excels at document processing, video understanding, code generation, and agentic workflows. Nova 2 Lite supports both SFT and RFT.

The smallest Nova model, Nova Micro, is also particularly useful because it offers fast, low-cost inference with LLM intelligence. Nova Micro is ideal for pipeline processing tasks done as part of a larger system, such as fixing addresses or extracting data fields from text. In this post, we show an example of customizing Nova Micro for a segmentation task instead of building a custom data science model.This table shows both Nova 1 and Nova 2 reasoning models and their current availability as of publication time, with which models currently allow RFT or SFT. These capabilities are subject to change; see the online documentation for the most current model availability and customization, and the Nova Users Guide for more detail on the models.

Model	Capabilities	Input	Output	Status	Bedrock fine-tuning
Nova Premier	Most capable model for complex tasks and teacher for model distillation	Text, images, video (excluding audio)	Text	Generally available	Can be used as a teacher for model distillation
Nova Pro	Multimodal model with best combination of accuracy, speed, and cost for a wide range of tasks	Text, images, video	Text	Generally available	SFT
Nova 2 Lite	Low cost multimodal model with fast processing	Text, images, video	Text	Generally available	RFT, SFT
Nova Lite	Low cost multimodal model with fast processing	Text, images, video	Text	Generally available	SFT
Nova Micro	Lowest latency responses at low cost.	Text	Text	Generally available	SFT

Now that you understand how Nova models support fine-tuning through the Amazon Bedrock managed infrastructure, let’s examine a real-world scenario that demonstrates these capabilities in action.

Use case example – intent detection (replacing traditional ML models)

Intent detection determines the category of the user’s intended interaction from the input case. For example, in the case of an airline travel assistance system, the user might be attempting to get information about a previously booked flight or asking a question about airline services, such as how to transport a pet. Often systems will want to route the inquiry to specific agents based on intent. Intent detection systems must operate quickly and economically at high volume.

The traditional solution for such a system has been to train a machine-learning model. While this is effective, developers are more often turning to small LLMs for these tasks. LLMs offer more flexibility, can quickly be modified through prompt changes, and come with extensive world knowledge built in. Their understanding of shorthand, texting slang, equivalent words, and context can provide a better user experience, and the LLM development experience is familiar for AI engineers.

For our example, we will customize Nova Micro model on the open-source Airline Travel Information System (ATIS) data set, an industry standard benchmark for intent-based systems. Nova Micro achieves 41.4% on ATIS with no customization, but we can customize it for the specific task, improving its accuracy to 97% with a simple training job.

Technical implementation: Fine-tuning process

The two critical factors that drive model fine-tuning success are data quality and hyperparameter selection. Getting these right determines whether your model converges efficiently or requires costly retraining. Let’s walk through each component of the implementation process, starting with how to prepare your training data.

Data preparation

Amazon Bedrock requires JSONL (JavaScript Object Notation Lines) format because it supports efficient streaming of large datasets during training, so that you can process your data incrementally without memory constraints. This format also simplifies validation. Each line can be checked independently for errors. Verify that each row in the JSONL file is valid JSON. If the file format is invalid, the Amazon Bedrock model creation job will fail with an error. For more detail, see the documentation on Nova model fine-tuning. We used a script to format the ATIS dataset as JSONL. Nova Micro accepts a separate validation set so we then off split 10% of the data into a validation set (Nova 2 models do this automatically in customization). We also reserved a test set of records, which the model was not trained on, to facilitate clean testing results.

For our intent classifier example, our input data is text only. However, when fine-tuning multimedia models, also make sure you are using only supported image formats (PNG, JPEG, and GIF). Make sure your training examples span the important cases. Validate your dataset with your team and remove ambiguous or contradictory answers before fine-tuning.

{"schemaVersion": "bedrock-conversation-2024", "system": [{"text": "Classify the intent of airline queries. Choose one intent from this list: abbreviation, aircraft, aircraft+flight+flight_no, airfare, airfare+flight_time, airline, airline+flight_no, airport, capacity, cheapest, city, distance, flight, flight+airfare, flight_no, flight_time, ground_fare, ground_service, ground_service+ground_fare, meal, quantity, restriction\n\nRespond with only the intent name, nothing else."}], "messages": [{"role": "user", "content": [{"text": "show me the morning flights from boston to philadelphia"}]}, {"role": "assistant", "content": [{"text": "flight"}]}]}

Prepared row in a training data sample (note that although it appears wrapped, JSONL format is really a single row per example)

Important: Note that the system prompt appears in the training data. It is important that the system prompt used for training match the system prompt used for inference, because the model learns the system prompt as context that triggers its fine-tuned behavior.

Data privacy considerations:

When fine-tuning with sensitive data:

Anonymize or mask PII (names, email addresses, phone numbers, payment details) before uploading to Amazon S3.
Consider data residency requirements for regulatory compliance.
Amazon Bedrock does not use your training data to improve base models.
For enhanced security, consider using Amazon Virtual Private Cloud (VPC) endpoints for private connectivity between S3 and Amazon Bedrock, eliminating exposure to the public internet.

Key hyperparameters

Hyperparameters control the training job. Amazon Bedrock sets reasonable defaults, and you can often use them with no adjustment, but you might need to adjust them for your fine-tuning job to achieve your target accuracy. Here are the hyperparameters for the Nova understanding models – consult the documentation for other models:

Three hyperparameters control your training job’s behavior, and while Amazon Bedrock sets reasonable defaults, understanding them helps you optimize results. Getting these settings right can save you hours of training time and minimize compute costs.

The first hyperparameter, epochCount, specifies how many complete passes the model makes through your dataset. Think of it like reading a book multiple times to improve comprehension. After the first read you might retain 60% of the material; a second pass raises comprehension to 80%. However, after you understand 100% of the material, additional readings waste training time without producing gains. Amazon Nova models support 1 to 5 epochs with a default of 2. Larger datasets typically converge with fewer epochs, while smaller datasets benefit from more iterations. For our ATIS intent classifier example with ~5000 combined samples, we set epochCount to 3.

The learningRateMultiplier controls how aggressively the model learns from errors. It is essentially the step size for corrections. If the learning rate is too high, you might miss details and jump to wrong conclusions. If the rate is too low, you form conclusions slowly. We use 1e-5 (0.00001) for the ATIS example, which provides stable, gradual learning. The learningRateWarmupSteps parameter gradually increases the learning rate to the specified value over a set number of iterations, alleviating unstable training at the start. We use the default value of 10 for our example.

Why this matters to you: Setting the right epoch count avoids wasted training time and costs. Each epoch represents another pass through the complete training data, which will increase the number of tokens processed (the main cost in model training—see “Cost and training time” later in this post). Too few epochs mean your model might not learn the training data effectively enough. Finding this balance early saves both time and budget. The learning rate directly impacts your model’s accuracy and training efficiency, potentially meaning the difference between a model that converges in hours versus one that never reaches acceptable performance.

Starting a fine-tuning job

The prerequisite of fine-tuning is creating an S3 bucket with training data.

S3 bucket setup

Create an S3 bucket in the same region as your Amazon Bedrock job with the following security configurations:

Enable server-side encryption (SSE-S3 or SSE-KMS) to protect training data at rest.
Block public access on the bucket to prevent unauthorized exposure.
Enable S3 versioning to protect training data from accidental overwrites and track changes across training iteration.

Apply the same encryption and access controls to your output S3 bucket. Upload your JSONL file in the new S3 bucket and then organize it with the /training-data prefix. S3 versioning helps protect your training data from accidental overwrites and allows you to track changes across training iterations. This is essential when you’re experimenting with different dataset versions to optimize results.

To create a supervised fine-tuning job

In the AWS Management Console, choose Amazon Bedrock.
Choose Test, Chat/Text playground and confirm that Nova Micro appears in the model selector drop-down list.
Under Custom model, choose Create, and then select Supervised fine-tuning job.

Figure 1: Creating supervised fine-tuning job

Specify “Nova Micro” model as the source model.
In the Training data section, enter the S3 URI path to your JSONL training file (for example, s3://amzn-s3-demo-bucket/training-data/focused-training-data-v2.jsonl).
In the Output data section, specify the S3 URI path where training outputs will be stored (for example, s3://amzn-s3-demo-bucket/output-data/).
Expand the Hyperparameters section and configure the following values: epochCount: 3, learningRateMultiplier: 1e-5, learningRateWarmupSteps: 10
Select the IAM role with least-privilege S3 access permissions or you can create one. The role should have:
- Scoped permissions limited to specific actions (s3:GetObject and s3:PutObject) on specific bucket paths (for example, arn:aws:s3:::your-bucket-name/training-data/* and arn:aws:s3:::your-bucket-name/output-data/*)
- Avoid over-provisioning and include IAM condition keys.
- For detailed guidance on S3 permission best practices and security configurations, refer to the AWS IAM Best Practices documentation.
Choose Create job.

Monitoring job status

To monitor the training job’s status and convergence:

Monitor the job status in the Custom models dashboard.
Wait for the Data validation phase to complete, followed by the Training phase (completion time ranges from minutes to hours depending on dataset size and modality).
After training completes, choose your job name to view the Training metrics tab and verify the loss curve shows proper convergence.
After training is completed, if the job is successful, a custom model is created and ready for inference. You can deploy the customized Nova model for on-demand inference.

Figure 2: Verifying job status

Evaluating training success

With Amazon Bedrock, you can evaluate your fine-tuning job’s effectiveness through training metrics and loss curves. By analyzing the training loss progression across steps and epochs, you can assess whether your model is learning effectively and determine if hyperparameter adjustments are needed for optimal performance. Amazon Bedrock customization automatically stores training artifacts, including validation results, metrics, logs, and training data in your designated S3 bucket, giving you complete visibility into the training process. Training metrics data lets you track how your model performs with specific hyperparameters and make informed tuning decisions.

Figure 3: Example training metrics in CSV format

You can visualize your model’s training progress directly from the Amazon Bedrock Custom Models console. Select your customized model to access detailed metrics, including an interactive training loss curve that shows how effectively your model learned from the training data over time. The loss curve gives insight into how training progressed, and whether hyperparameters need modification for effective training. From the Amazon Bedrock Custom Models tab, select the customized model to see its details, including the training loss curve. (Figure 4).

Figure 4: Analyzing the loss curve from the training metrics

This loss curve shows that the model is performing well. The decreasing loss curve shown in your metrics confirms the model successfully learned from your training data. Ideally while the model is learning, the training loss and validation loss curves should track similarly .A well-configured model shows steady convergence—the loss decreases smoothly without dramatic fluctuations. If you see oscillating patterns in your loss curve (wild swings up and down), reduce your learningRateMultiplier by 50% and restart training. If your loss decreases too slowly (flat or barely declining curve), increase your learningRateMultiplier by 2x. If your loss plateaus early (flattens before reaching good accuracy), increase your epochCount by 1-2 epochs.

Figure 5: Understanding the loss curve

Key takeaway: Your loss curve tells the complete story. A smooth downward trend means success. Wild oscillations mean that your learning rate is too high. Flat lines mean you need more epochs or better data. Monitor this one metric to avoid costly retraining.

Customization best practices

Maximizing your fine-tuning success starts with data quality. Small, high-quality datasets consistently outperform large, noisy ones. Focus on curating labeled examples that accurately represent your target domain rather than collecting massive volumes of mediocre data. Each training sample should be properly formatted and validated before use, as clean data directly translates to better model performance. Remember to specify an appropriate system prompt.

Common pitfalls to avoid include over-training (running too many epochs after convergence), suboptimal data formatting (inconsistent JSON/JSONL structures), and hyperparameter settings that need adjustment. We recommend validating your training data format before starting and monitoring loss curves actively during training. Watch for signs that your model has converged. Continuing training beyond this point wastes resources without improving results.

Cost and training time

Training the customized Nova Micro model for our ATIS example with 4,978 combined examples and 3 training epochs (~1.75M total tokens) completed in about 1.5 hours and cost only $2.18, plus a $1.75 monthly recurring storage fee for the model. On-Demand inference using customized Amazon Nova models is charged at the same rate as the non-customized models. See the Bedrock pricing page for reference. The managed fine-tuning provided by Amazon Bedrock and the Amazon Nova models bring fine-tuning well within cost thresholds for most organizations. The ease of use and cost effectiveness opens new possibilities for customizing models to produce better and faster results without maintaining long prompts or knowledge bases of information specific to your organization.

Deploying and testing the fine-tuned model

Consider on-demand inference for unpredictable or low-volume workloads. Use the more expensive provisioned throughput when needed for consistent, high-volume production workloads requiring guaranteed performance and lower per-token costs.

Model security considerations:

Restrict model invocation using IAM resource policies to control which users and applications can invoke your custom model.
Implement authentication/authorization for API callers accessing the on-demand inference endpoint through IAM roles and policies.

Network security:

Configure VPC endpoints for Amazon Bedrock to keep traffic within your AWS network.
Restrict network access to training and inference pipelines using security groups and network ACLs.
Consider deploying resources within a VPC for additional network-level controls.

The deployment name should be unique, and the description should explain in detail what the custom model is used for.

To deploy the model, enter deployment name, description and choose Create (Figure 6).

Figure 6: Deploying a custom model with on-demand inference

After the status changes to “Active” the model is ready to use by your application and can be tested via the Amazon Bedrock playground. Choose Test in playground (Figure 7).

Figure 7: Testing the model from the deployed inference endpoint

Logging and monitoring:

Enable the following for security auditing and incident response:

AWS CloudTrail for Amazon Bedrock API call logging
Amazon CloudWatch for model invocation metrics and performance monitoring
S3 access logs for tracking data access patterns.

Testing the model in the playground:

To test inference with the custom model, we use the Amazon Bedrock playground, giving the following example prompt:system:

Classify the intent of airline queries. Choose one intent from this list: abbreviation, aircraft, aircraft+flight+flight_no, airfare, airfare+flight_time, airline, airline+flight_no, airport, capacity, cheapest, city, distance, flight, flight+airfare, flight_no, flight_time, ground_fare, ground_service, ground_service+ground_fare, meal, quantity, restriction\n\nRespond with only the intent name, nothing else. I would like to find a flight from charlotte to las vegas that makes a stop in st. louisIf called on the base model, the same prompt will return a less accurate answer.

Important: Note that the system prompt provided with the training data for fine-tuning must be included with your prompt during invocation for best results. Because the playground does not provide a separate place to put the system prompt for our custom model, we include it in the preceding prompt string.

Figure 8: Manually evaluating a customized model in the test playground

Evaluating your customized model

After you have trained your model, you must evaluate its real-world performance. A common evaluation is “LLM as a judge,” where a larger, more intelligent model with access to a full RAG database scores the trained model’s responses against the expected responses. Amazon Bedrock provides the Amazon Bedrock Evaluations service for this purpose (or you can use your own framework). For guidance, refer to the blog post LLM-as-a-judge on Amazon Bedrock Model Evaluation.

Your evaluation should use a test set of questions and answers, prepared using the same method as your training data, but kept separate so the model has not seen the exact questions. Figure 9 shows the fine-tuned model achieves accuracy of 97% on the test data set, a 55% improvement vs. the base Nova Micro model.

Figure 9: Evaluation of fine-tuning results vs. base model

Beyond Amazon Bedrock customization

Amazon Bedrock’s simplified customization experience will meet many customer needs. Should you need more extensive control over customization, Amazon SageMaker AI provides a broader range of customization types and more detailed control over hyperparameters – see the blog Announcing Amazon Nova customization in Amazon SageMaker AI for more detail.

For cases where even more extensive customization is needed, Amazon Nova Forge provides a strategic alternative to building foundation models from scratch. While fine-tuning teaches specific task behaviors through labeled examples, Nova Forge uses continued pre-training to build comprehensive domain knowledge by immersing the model in millions to billions of tokens of unlabeled, proprietary data. This approach is ideal for organizations with massive proprietary datasets, highly specialized domains requiring deep expertise, or those building long-term strategic foundational models that will serve as organizational assets.

Nova Forge goes beyond standard fine-tuning by offering advanced capabilities including data mixing to mitigate catastrophic forgetting during full-rank supervised fine-tuning, checkpoint selection for optimal model performance, and bring-your-own-optimizer (BYOO) for multi-turn reinforcement fine-tuning. While requiring greater investment through an annual subscription and longer training cycles, Forge can deliver a significantly more cost-effective path than training foundation models from scratch. This approach is ideal for building strategic AI assets that serve as long-term competitive advantages. For Nova Forge customization examples, see the Amazon Nova Customization Hub on GitHub.

Conclusion

As we have demonstrated through our intent classifier example, the Amazon Bedrock managed fine-tuning capabilities, together with the Nova and Nova 2 models, make AI customization accessible at low cost and with low effort. This simplified approach requires minimal data preparation and hyperparameter management, minimizing the need for dedicated data science skills. You can customize models to improve latency and reduce inference cost by reducing the tokens of contextual information that the model must process. Fine-tuning Nova models on Amazon Bedrock transforms generic foundation models into powerful, domain-specific tools that deliver higher accuracy and reduced latency, at low training cost. The ability of Amazon Bedrock to host the Nova models using On-Demand inference allows you to run the model at the same per-token pricing as the base Nova model. See the Bedrock pricing page for current rates.

To get started with your own fine-tuning project using Amazon Bedrock, explore the Amazon Bedrock fine-tuning documentation and review sample notebooks in the AWS Samples GitHub repository.

About the authors

Human-in-the-loop constructs for agentic workflows in healthcare and life sciences

Pierre de Malliard — Wed, 08 Apr 2026 19:48:07 +0000

In healthcare and life sciences, AI agents help organizations process clinical data, submit regulatory filings, automate medical coding, and accelerate drug development and commercialization. However, the sensitive nature of healthcare data and regulatory requirements like Good Practice (GxP) compliance require human oversight at key decision points. This is where human-in-the-loop (HITL) constructs become essential. In this post, you will learn four practical approaches to implementing human-in-the-loop constructs using AWS services.

Why human-in-the-loop matters in healthcare

Healthcare and life sciences organizations face unique challenges when deploying AI agents:

Regulatory compliance – GxP regulations require human oversight for sensitive operations. For example, deleting patient records or modifying clinical trial protocols can’t proceed without documented authorization.

Patient safety – Medical decisions affecting patient care must have clinical validation before execution.

Audit requirements – Healthcare systems need complete traceability of who approved what actions and when.

Data sensitivity – Protected Health Information (PHI) requires explicit authorization before access or modification.

HITL constructs provide the necessary control points while maintaining the efficiency gains of agentic automation to meet these requirements.

Solution overview

We present four complementary approaches to implementing HITL in agentic workflows. Each workflow is suited for different scenarios and risk profiles as described in our guide to building AI agents in GxP Environments. We build these patterns using the Strands Agents framework, Amazon Bedrock AgentCore Runtime, and the Model Context Protocol (MCP), with code examples that you can adapt for your own use cases.

Agentic Loop Interrupt (Agent Framework Hook System) – We use the Strands Agent Framework Hooks to enforce the human-in-the-loop policy. With the hooks, we can intercept tool calls before their execution.
Tool Context Interrupt – The human-in-the-loop approval logic can also be implemented within the tool logic directly for fine-grained, tool-specific control and flexibility. The session context can be used for custom approval logic.
Remote Tool Interrupt (AWS Step Functions) – In some cases, one might want to send an approval request to a third party system or person asynchronously. We demonstrate this pattern by sending a notification to an external approver using Amazon Simple Notification Service (Amazon SNS). The agent session continues without blocking while approval proceeds in the background.
MCP Elicitation – The MCP protocol recently introduced elicitation, which is used by servers to request additional information from users through the client during interactions. The MCP’s native elicitation protocol allows for real-time, interactive approval using server-sent events (SSE) for stateful, two-way communication.

Architecture

The solution architecture uses the Strands Agents Framework for agent lifecycle management and interrupt handling, deployed on Amazon Bedrock AgentCore Runtime for serverless scalability and session isolation. AWS Step Functions orchestrates asynchronous approval workflows with Amazon SNS, while MCP servers expose tools to the agent through the MCP—also deployed on AgentCore Runtime.

Implementation details

All the code for these architecture patterns is available publicly in the GitHub repository.

Each of the following methods demonstrates a self-contained approach. The agent deploys on Amazon Bedrock AgentCore Runtime with access to healthcare tools at different sensitivity levels. Low-risk operations, like looking up a patient’s name, execute without approval, while high-risk actions, like retrieving vitals or medical conditions, require human authorization. Operations such as patient discharge require external supervisor approval through email notification.

Method 1: Agentic loop hook local tool interrupt

The Strands Agent Framework provides a hook system that intercepts tool calls before execution at the agent loop level. This enforces a blanket HITL policy across sensitive tools without modifying the tools themselves.

A HookProvider registers a callback on BeforeToolCallEvent. When a sensitive tool is invoked, the hook fires an interrupt, pausing the agent loop until the human responds. The user can reply with “y” (approve once), “n” (deny), or “t” (trust—approve this tool for the rest of the session):

class ApprovalHook(HookProvider):
    SENSITIVE_TOOLS = ["get_patient_condition", "get_patient_vitals"]

    def register_hooks(self, registry: HookRegistry, **kwargs: Any) -> None:
        registry.add_callback(BeforeToolCallEvent, self.approve)

    def approve(self, event: BeforeToolCallEvent) -> None:
        tool_name = event.tool_use["name"]
        if tool_name not in self.SENSITIVE_TOOLS:
            return

        # Skip if user previously chose "trust always" for this tool
        approval_key = f"{tool_name}-approval"
        if event.agent.state.get(approval_key) == "t":
            return

        approval = event.interrupt(
            approval_key,
            reason={"reason": f"Authorize {tool_name} with args: {event.tool_use.get('input', {})}"},
        )
        if approval.lower() not in ["y", "yes", "t"]:
            event.cancel_tool = f"User denied permission to run {tool_name}"
            return

        if approval.lower() == "t":
            event.agent.state.set(approval_key, "t")  # trust tool for the rest of the session

The hook is attached to the agent at construction—tools remain completely unaware of the approval logic:

agent = Agent(
    hooks=[ApprovalHook()],
    tools=[get_patient_name, get_patient_condition, get_patient_vitals],
)

Method 2: Tool context interrupt

Instead of a centralized hook, the approval logic is embedded directly inside each tool using tool_context.interrupt(). This gives fine-grained, per-tool control: each tool can implement its own access rules based on session context. In this example, the agent session carries a user_role. A shared check_accessfunction enforces role-based access: In our code example, Non-Physicians are denied outright, while Physicians are prompted for approval: Like Method 1, the trust option caches approval for the session:

def check_access(tool_context, patient_id: str, action: str):
    user_role = tool_context.agent.state.get("user_role") or "Non-Physician"

    if user_role != "Physician":
        return f"Access denied: {action} requires Physician role (current: {user_role})"

    approval_key = f"{action}-{patient_id}-approval"
    if tool_context.agent.state.get(approval_key) == "t":
        return None  # previously trusted

    approval = tool_context.interrupt(
        approval_key,
        reason={"reason": f"[{user_role}] Authorize {action} for patient {patient_id}"},
    )
    if approval.lower() not in ["y", "yes", "t"]:
        return f"Physician denied access to {action} for patient {patient_id}"

    if approval.lower() == "t":
        tool_context.agent.state.set(approval_key, "t")
    return None  # approved

Method 3: Asynchronous tool approval using AWS Step Functions

In many enterprise scenarios, the approval flow requires authorization from a third-party approver who is not the person invoking the agent. This necessitates an asynchronous approval workflow that can operate independently of the agent session. One effective approach uses AWS Step Functions to orchestrate these external approval processes.

In this pattern, the agent tool triggers a Step Functions workflow that sends an approval request to an external approver through email notification through Amazon SNS. The tool polls for the approval result and updates the agent session state accordingly. The user can also check the approval status later using a separate check_discharge_status tool. The discharge_patient tool starts the Step Functions execution and polls for the result:

@tool(context=True)
def discharge_patient(tool_context, patient_id: str, reason: str) -> str:
    # Skip workflow if already approved in this session
    if tool_context.agent.state.get("external-approver-state") == "approved":
        return f"Patient {patient_id} discharged (pre-approved). Reason: {reason}"

    response = sfn_client.start_execution(
        stateMachineArn=state_machine_arn,
        input=json.dumps({"patient_id": patient_id, "action": "discharge", "reason": reason}),
    )
    return f"Waiting for approval. Execution ARN: {response['executionArn']}"

This asynchronous approach enables non-blocking operations where users aren’t forced to wait for approvals that can take hours or days, and agent execution can continue independently. Step Functions maintains detailed audit trails with complete execution history, persistent state management across session timeouts, and integration with existing enterprise communication channels like email, Slack, or Microsoft Teams. The user that starts a sensitive workflow will trigger a State Function: The agent returns a confirmation to the user that the workflow was launched. At all times, the user can check for a state update to make sure that the workflow completed.

Method 4: MCP elicitation

The MCP protocol recently introduced the elicitation protocol that allows MCP servers to request additional information or approval from users during tool execution. This approach follows protocol standards and provides a dynamic mechanism for prompting users at runtime without requiring parameters to be hardwired upfront. It can be used to authorize a tool call and include some business justification.

When a sensitive tool is called, the MCP server pauses execution and sends an approval prompt back through the MCP client to the end user. The user sees the prompt, makes a decision, and the server resumes—either proceeding with the operation or denying access. This two-way communication is enabled by MCP’s streamable HTTP transport, which maintains a stateful connection between client and server.

On the MCP server, the approval logic is a single ctx.elicit() call inside each sensitive tool:

@server.tool
async def get_patient_condition(patient_id: str, ctx: Context) -> str:
    """Get patient condition. Sensitive — requires approval via MCP elicitation."""
    result = await ctx.elicit(
        f"⚠ Approve access to SENSITIVE condition data for patient {patient_id}?"
    )
    if result.action != "accept":
        return f"Access to condition data for patient {patient_id} DENIED."
    return f"Patient {patient_id} condition: Hypertension Stage 2, Type 2 Diabetes"

On the agent side, an elicitation_callback is registered with the MCP client. When the server calls ctx.elicit(), this callback fires, relaying the approval prompt to the user and returning their decision back to the server. For local agents, this is a terminal prompt. For agents deployed on AgentCore Runtime, we use a WebSocket connection to relay the elicitation to the remote end user in real time:

This approach keeps the approval logic entirely within the MCP server’s tool definitions. The agent itself has no knowledge of which tools require approval, so you can add or modify approval requirements independently.

Conclusion

You can use these human-in-the-loop (HITL) constructs to build safe, compliant AI agent deployments in healthcare and life sciences. By implementing the appropriate HITL pattern for your use case, you can deploy production-ready workflows that scale from pilot projects to enterprise-wide deployments. Start by identifying which operations in your workflow require human oversight. Then, select the HITL pattern that matches your approval requirements—centralized (Method 1), tool-specific (Method 2), asynchronous (Method 3), or real-time (Method 4).

For more information about Amazon Bedrock AgentCore, visit the Amazon Bedrock AgentCore documentation.

About the author

Building intelligent audio search with Amazon Nova Embeddings: A deep dive into semantic audio understanding

Madhavi Evana — Wed, 08 Apr 2026 19:45:13 +0000

If you’re looking to enhance your content understanding and search capabilities, audio embeddings offer a powerful solution. In this post, you’ll learn how to use Amazon Nova Multimodal Embeddings to transform your audio content to searchable, intelligent data that captures acoustic features like tone, emotion, musical characteristics, and environmental sounds.

Finding specific content in these libraries presents real technical challenges. Traditional search methods like manual transcription, metadata tagging, and speech-to-text conversion work well for capturing and searching spoken words. However, these text-based approaches focus on linguistic content rather than acoustic properties like tone, emotion, musical characteristics, and environmental sounds. Audio embeddings address this gap. They represent your audio as dense numerical vectors in high-dimensional space that encode both semantic and acoustic properties. These representations let you perform semantic search using natural language queries, match similar-sounding audio, and automatically categorize content based on what it sounds like rather than just metadata tags. Amazon Nova Multimodal Embeddings, announced on October 28, 2025, is a multimodal embedding model available in Amazon Bedrock [1]. It’s the unified embedding model that supports text, documents, images, video, and audio through a single model for cross-modal retrieval with accuracy.

This post walks you through understanding audio embeddings, implementing Amazon Nova Multimodal Embeddings, and building a practical search system for your audio content. You’ll learn how embeddings represent audio as vectors, explore the technical capabilities of Amazon Nova, and see hands-on code examples for indexing and querying your audio libraries. By the end, you’ll have the knowledge to deploy production-ready audio search capabilities.

Understanding Audio Embeddings: Core Concepts

Vector Representations for Audio Content

Think of audio embeddings as a coordinate system for sound. Just as GPS coordinates pinpoint locations on Earth, embeddings map your audio content to specific points in high-dimensional space. Amazon Nova Multimodal Embeddings gives you four-dimension options: 3,072 (default), 1,024, 384, or 256 [1]. Each embedding is a float32 array. Individual dimensions encode acoustic and semantic features—rhythm, pitch, timbre, emotional tone, and semantic meaning—all learned through the model’s neural network architecture during training. Amazon Nova uses Matryoshka Representation Learning (MRL), a technique that structures embeddings hierarchically [1]. Think of MRL like Russian nesting dolls. A 3,072-dimension embedding contains all the information, but you can extract just the first 256 dimensions and still get accurate results. Generate embeddings once, then choose the size that balances accuracy with storage costs. No need to reprocess your audio when trying different dimensions— the hierarchical structure lets you truncate to your preferred size.

How you measure similarity: When you want to find similar audio, you compute cosine similarity between two embeddings v₁ and v₂ [1]:

similarity = (v₁ · v₂) / (||v₁|| × ||v₂||)

Cosine similarity measures the angle between vectors, giving you values from -1 to 1. Values closer to 1 indicate higher semantic similarity. When you store embeddings in a vector database, it uses distance metrics (distance = 1 – similarity) to perform k-nearest neighbor (k-NN) searches, retrieving the top-k most similar embeddings for your query.

Real-world example: Suppose you have two audio clips—”a violin playing a melody” and “a cello playing a similar melody”—that generate embeddings v₁ and v₂. If their cosine similarity is 0.87, they cluster near each other in vector space, indicating strong acoustic and semantic relatedness. A different audio clip like “rock music with drums” generates v₃ with cosine similarity 0.23 to v₁, placing it far away in the embedding space.

Audio Processing Architecture and Modalities

Understanding the end-to-end workflow: Before diving into technical details, let’s look at how audio embeddings work in practice. There are two main workflows:

Figure 1 – End-to-end audio embedding workflow

Data ingestion and indexing flow: During the ingestion phase, you process your audio library in bulk. You upload audio files to Amazon S3, then use the asynchronous API to generate embeddings. For long audio files (over 30 seconds), the model automatically segments them into smaller chunks with temporal metadata. You store these embeddings in a vector database along with metadata like filename, duration, and genre. This happens once for your entire audio library.

Runtime search flow: When a user searches, you use the synchronous API to generate an embedding for their query—whether it’s text like “upbeat jazz piano” or another audio clip. Because queries are short, and users expect fast results, the synchronous API provides low-latency responses. The vector database performs a k-NN search to find the most similar audio embeddings, returning results with their associated metadata. This entire search happens in milliseconds.

When you submit audio-only inputs, temporal convolutional networks or transformer-based architectures analyze your acoustic signals for spectro-temporal patterns. Rather than working with raw waveforms, Amazon Nova operates on audio representations like mel-spectrograms or learned audio features, which allows efficient processing of high-sample-rate audio [1].Audio is sequential data that requires temporal context. Your audio segments (up to 30 seconds) pass through architectures with temporal receptive fields that capture acoustic patterns across time [1]. This approach captures rhythm, cadence, prosody, and long-range acoustic dependencies spanning multiple seconds—preserving the full richness of your audio content.

API Operations and Request Structures

When to use synchronous embedding generation: Use the invoke_model API for runtime search when you need embeddings for real-time applications where latency matters [1]. For example, when a user submits a search query, the query text is short, and you want to provide a fast user experience—the synchronous API is ideal for this:

import boto3
import json
 
# Create the Bedrock Runtime client.
bedrock_runtime = boto3.client("bedrock-runtime", region_name="us-east-1")
 
# Define the request body for a search query.
request_body = {
    "taskType": "SINGLE_EMBEDDING",  # Use for single items
    "singleEmbeddingParams": {
        "embeddingPurpose": "GENERIC_RETRIEVAL",  # Use GENERIC_RETRIEVAL for queries
        "embeddingDimension": 1024,  # Choose dimension size
        "text": {
            "truncationMode": "END",  # How to handle long inputs
            "value": "jazz piano music"  # Your search query
        }
    }
}
 
# Invoke the Nova Embeddings model.
response = bedrock_runtime.invoke_model(
    body=json.dumps(request_body),
    modelId="amazon.nova-2-multimodal-embeddings-v1:0",
    contentType="application/json"
)
 
# Extract the embedding from response.
response_body = json.loads(response["body"].read())
embedding = response_body["embeddings"][0]["embedding"]  # float32 array

Understanding request parameters:

taskType: Choose SINGLE_EMBEDDING for single items or SEGMENTED_EMBEDDING for chunked processing [1, 2]
embeddingPurpose: Optimizes embeddings for your use case—GENERIC_INDEX for indexing your content, GENERIC_RETRIEVAL for queries, DOCUMENT_RETRIEVAL for document search [1]
embeddingDimension: Your output dimension choice (3072, 1024, 384, 256) [1]
truncationMode: How to handle inputs exceeding context length—END truncates at the end, START at beginning [1]

What you get back: The API returns a JSON object containing your embedding:

{
  "embeddings": [
    {
      "embedding": [0.123, -0.456, 0.789, ...],  // float32 array
      "embeddingLength": 1024
    }
  ]
}

When to use asynchronous processing: Amazon Nova Multimodal Embeddings supports two approaches for processing large volumes of content: the asynchronous API and the batch API. Understanding when to use each helps you optimize your workflow.

Asynchronous API: Use the start_async_invoke API when you need to process large individual audio or video files that exceed the synchronous API limits [1]. This is ideal for:

Processing single large files (multi-hour recordings, full-length videos)
Files requiring segmentation (over 30 seconds)
When you need results within hours but not immediately

response = bedrock_runtime.start_async_invoke(
    modelId="amazon.nova-2-multimodal-embeddings-v1:0",
    modelInput=model_input,
    outputDataConfig={
        "s3OutputDataConfig": {"s3Uri": "s3://amzn-s3-demo-bucket/output/"}
    }
)
invocation_arn = response["invocationArn"]
# Poll job status
job = bedrock_runtime.get_async_invoke(invocationArn=invocation_arn)
status = job["status"]  # "InProgress" | "Completed" | "Failed"

When your job completes, it writes output to Amazon S3 in JSONL format (one JSON object per line). For AUDIO_VIDEO_COMBINED mode, you’ll find the output in embedding-audio-video.jsonl [1].

Batch API: Use the batch inference API when you need to process thousands of audio files in a single job [3].

This is ideal for:

Bulk processing of your entire audio library (thousands to millions of files)
Cost optimization through batch pricing
Non-time-sensitive indexing operations where you can wait 24-48 hours
Processing many small-to-medium sized files efficiently

The batch API offers better cost efficiency for large-scale operations and handles job management automatically. You submit a manifest file with all your input files, and the service processes them in parallel, writing results to S3.

Choosing between async and batch:

Single large file or real-time segmentation needs? → Use async API
Thousands of files to process in bulk? → Use batch API
Need results within hours? → Use async API
Can wait 24-48 hours for cost savings? → Use batch API

Learn more about batch inference in the Amazon Bedrock batch inference documentation.[3]

Segmentation and Temporal Metadata

Why you need segmentation: If your audio files exceed 30 seconds, you need to segment them [1]. Imagine you have a 2-hour podcast and want to find the specific 30-second segment where the host discusses AI—segmentation makes this possible.

You control chunking with the segmentationConfig parameter:

"segmentationConfig": {
    "durationSeconds": 15  # Generate one embedding every 15 seconds
}
This configuration processes a 5-minute audio file (300 seconds) into 20 segments (300 ÷ 15 = 20), generating 20 embeddings [1]. Each segment receives temporal metadata marking its position in your original file.

Understanding segmented output: The asynchronous API writes your segmented embeddings to JSONL with temporal metadata [1]:

{"startTime": 0.0, "endTime": 15.0, "embedding": [...]}
{"startTime": 15.0, "endTime": 30.0, "embedding": [...]}
{"startTime": 30.0, "endTime": 45.0, "embedding": [...]}

How to parse segmented output:

import json
from boto3 import client
s3 = client("s3", region_name="us-east-1")
# Read JSONL file from S3
response = s3.get_object(Bucket="bucket", Key="output/embedding-audio-video.jsonl")
content = response['Body'].read().decode('utf-8')
segments = []
for line in content.strip().split('\n'):
    if line:
        segment = json.loads(line)
        segments.append({
            'start': segment['startTime'],
            'end': segment['endTime'],
            'embedding': segment['embedding'],
            'duration': segment['endTime'] - segment['startTime']
        })
print(f"Processed {len(segments)} segments")
print(f"First segment: {segments[0]['start']:.1f}s - {segments[0]['end']:.1f}s")
print(f"Embedding dimension: {len(segments[0]['embedding'])}")

Real-world use case—temporal search: You can store segmented embeddings with their temporal metadata in a vector database. When someone searches for “customer complaint about billing,” you retrieve the specific 15-second segments with timestamps, giving you precise navigation to relevant moments within multi-hour call recordings. There is no need to listen to the entire recording.

Vector Storage and Indexing Strategies

Referring to the architecture: In Section 2.2, we showed you the end-to-end workflow diagram. Now we’re diving deeper into the Vector Database component—the storage layer where your embeddings live during both the ingestion phase and the runtime search phase. This is the critical component that connects your indexed audio embeddings to fast search queries.

Understanding your storage requirements: Embeddings are float32 arrays requiring 4 bytes per dimension. Here’s what you’ll need:

3,072 dimensions: 12,288 bytes (12 KB) per embedding
1,024 dimensions: 4,096 bytes (4 KB) per embedding
384 dimensions: 1,536 bytes (1.5 KB) per embedding
256 dimensions: 1,024 bytes (1 KB) per embedding

Example calculation: For 1 million audio clips with 1,024-dimensional embeddings, you need 4 GB of vector storage (excluding metadata and index structures).

Choosing your dimension size: Larger dimensions give you more detailed representations but require more storage and computation. Smaller dimensions offer a practical balance between retrieval performance and resource efficiency. Start with 1,024 dimensions—it provides excellent accuracy for most applications while keeping costs manageable.

Using Amazon S3 Vectors: You can store and query your embeddings using Amazon S3 Vectors [2]:

s3vectors = boto3.client("s3vectors", region_name="us-east-1")
# Create vector index
s3vectors.create_index(
    vectorBucketName="audio-vectors",
    indexName="audio-embeddings",
    dimension=1024,
    dataType="float32",
    distanceMetric="cosine"
)
# Store embedding with metadata
s3vectors.put_vectors(
    vectorBucketName="audio-vectors",
    indexName="audio-embeddings",
    vectors=[{
        "key": "audio:track_12345",
        "data": {"float32": embedding},
        "metadata": {
            "filename": "track_12345.mp3",
            "duration": 180.5,
            "genre": "jazz",
            "upload_date": "2025-10-28"
        }
    }]
)

How metadata enhances your search: Metadata attributes work alongside embeddings to provide richer search results. When you retrieve results from the vector database, the metadata helps you filter, sort, and display information to users. For example, the genre field lets you filter results to only jazz recordings, duration helps you find tracks within a specific length range, and filename provides the path to the actual audio file for playback. The upload_date can help you prioritize recent content or track data freshness. This combination of semantic similarity (from embeddings) and structured metadata creates a powerful search experience.

Querying your vectors: k-NN search retrieves the top-k most similar vectors [2]:

    vectorBucketName="audio-vectors",
    indexName="audio-embeddings",
    queryVector={"float32": query_embedding},
    topK=10,  # Return 10 most similar results
    returnDistance=True,
    returnMetadata=True
)
for result in response["vectors"]:
    print(f"Key: {result['key']}")
    print(f"Distance: {result['distance']:.4f}")  # Lower = more similar
    print(f"Metadata: {result['metadata']}")

Using Amazon OpenSearch Service: OpenSearch provides native k-NN search with HNSW (Hierarchical Navigable Small World) indexes for sub-linear query time complexity [1]. This means your searches stay fast even as your audio library grows to millions of files.

Index configuration:

  "mappings": {
    "properties": {
      "audio_embedding": {
        "type": "knn_vector",
        "dimension": 1024,
        "method": {
          "name": "hnsw",
          "space_type": "cosinesimil",
          "engine": "nmslib",
          "parameters": {"ef_construction": 512, "m": 16}
        }
      },
      "metadata": {"type": "object"}
    }
  }
}

Batch Optimization and Production Patterns

Why batch processing matters: When you process multiple audio files, batch inference improves throughput by reducing network latency overhead [1]. Instead of making separate API calls for each file, you can process them more efficiently.

Example batch pattern:

texts = ["jazz music", "rock music", "classical music"]
vectors = []
for text in texts:
    response = bedrock_runtime.invoke_model(
        body=json.dumps({
            "taskType": "SINGLE_EMBEDDING",
            "singleEmbeddingParams": {
                "embeddingDimension": 1024,
                "text": {"truncationMode": "END", "value": text}
            }
        }),
        modelId="amazon.nova-2-multimodal-embeddings-v1:0",
        contentType="application/json"
    )
    embedding = json.loads(response["body"].read())["embeddings"][0]["embedding"]
    vectors.append(embedding)
# Batch write to vector store
s3vectors.put_vectors(
    vectorBucketName="audio-vectors",
    indexName="audio-embeddings",
    vectors=[
        {"key": f"text:{text}", "data": {"float32": emb}}
        for text, emb in zip(texts, vectors)
    ]
)

Multilingual support: The model supports text inputs in 200+ languages [1]. This supports powerful cross-modal search scenarios: your customers can search in Spanish for audio content indexed in English, or vice versa. The embeddings capture semantic meaning across languages.

Amazon Nova Audio Multimodal Embeddings Deep Dive

Technical Specifications

Model architecture: Amazon Nova Multimodal Embeddings is built on a foundation model trained to understand relationships across different modalities—text, images, documents, video, and audio—within a unified embedding space.

Flexible embedding dimensions: You get four output dimension options: 3,072, 1,024, 384, and 256. Larger dimensions provide more detailed representations but require more storage and computation. Smaller dimensions offer a practical balance between retrieval performance and resource efficiency. This flexibility helps you optimize for your specific application and cost requirements.

Media processing capabilities: For video and audio inputs, the model supports segments of up to 30 seconds, and automatically segments longer files [1]. This segmentation capability is particularly useful when you work with large media files—the model splits them into manageable pieces and creates embeddings for each segment. The output includes embeddings for your video and audio files with temporal metadata.

API flexibility: You can access the model through both synchronous and asynchronous APIs. Use synchronous APIs for querying where latency matters. Use asynchronous APIs for data ingestion and indexing where you can tolerate longer processing times. The asynchronous API supports batch segmentation/chunking for text, audio, and video files. Segmentation refers to splitting a long file into smaller chunks, each of which creates a unique embedding, allowing for fine-grained and more accurate retrieval.

Input methods: You can pass content to embed by specifying an S3 URI or inline as a base64 encoding. This gives you flexibility in how you integrate embeddings into your workflow.

How the workflow works:

You use Amazon Nova Multimodal Embeddings to generate embeddings for your video or audio clips
You store the embeddings in a vector database
When your end-user searches for content, you use Amazon Nova to generate an embedding for their search query
Your application compares how similar the search query embedding is to your indexed content embeddings
Your application retrieves the content that best matches the search query based on a similarity metric (such as cosine similarity)
You show the corresponding content to your end-user

Supported inputs: Your inputs to generate embeddings can be in text, image, document image, video, or audio form. The inputs refer to both the items you use to create the index and the end-user search queries. The model outputs embeddings which you use to retrieve the assets that best match the query to display to your end-user.

Audio format support: Amazon Nova Multimodal Embedding currently supports mp3, wav, and ogg as input formats. These formats cover most common audio use cases from music to speech recordings.

Key Capabilities

Audio-to-Audio search: Find acoustically similar content in your library. For example, find all recordings with similar musical characteristics or speaking styles.

Text-to-Audio search: Use natural language queries to retrieve relevant audio segments. Search for “upbeat jazz piano” or “customer expressing frustration” and get back matching audio clips.

Cross-modal retrieval: Search across images, audio, video, and text simultaneously. This unified approach means you can use one query to search your entire content library regardless of format.

Temporal understanding: The model recognizes actions and events within audio over time. This lets you search for specific moments within long recordings.

When to Choose Amazon Nova

Amazon Nova Multimodal Embeddings is designed for production applications requiring scalable performance, rapid deployment, and minimal operational overhead.

Why choose Amazon Nova:

Speed to market: Deploy in hours or days, not months
Managed service: No infrastructure to maintain or models to train
Cross-modal capabilities: One model for all your content types with enterprise level deployment support
Continuous improvements: Benefit from model updates without migration work

Decision factors to consider:

Scale requirements: How many audio files and queries do you need to handle
Time-to-market: How quickly do you need a working solution
Expertise availability: Do you have engineering team to maintain custom models
Integration needs: Do you need seamless AWS service integration

Core application domains: Amazon Nova Multimodal Embeddings serves a wide range of applications optimized for multimodal RAG, semantic search, and clustering:

Agentic Retrieval-Augmented Generation (RAG): You can use Amazon Nova Multimodal Embeddings for RAG-based applications where the model serves as the embedding for the retrieval task. Your input can be text from documents, images, or document images that interleave text with infographics, video, and audio. The embedding lets you retrieve the most relevant information from your knowledge base that you can provide to an LLM system for improved responses.
Semantic Search: You can generate embeddings from text, images, document images, video, and audio to power search applications stored in a vector index. A vector index is a specialized embedding space that reduces the number of comparisons needed to return effective results. Because the model captures the nuance of your user’s query within the embedding, it supports advanced search queries that don’t rely on keyword matching. Your users can search for concepts, not just exact words.
Clustering: You can use Amazon Nova Multimodal Embeddings to generate embeddings from text, images, document images, video, and audio. Clustering algorithms can group together items that are close to each other based on distance or similarity. For example, if you work in media management and want to categorize your media assets across similar themes, you can use the embeddings to cluster similar assets together without needing metadata for each asset. The model understands content similarity automatically.

Conclusion

In this post, we explored how Amazon Nova Multimodal Embeddings enables semantic audio understanding beyond traditional text-based approaches. By representing audio as high-dimensional vectors that capture both acoustic and semantic properties, you can build search systems that understand tone, emotion, and context not just spoken words. We covered the end-to-end workflow for building an audio search system, including:- Generating embeddings using synchronous and asynchronous APIs- Segmenting long audio files with temporal metadata- Storing embeddings in a vector database- Performing k-NN search to retrieve relevant audio segments. This approach allows you to transform large audio libraries into searchable, intelligent datasets that support use cases such as call center analysis, media search, and content discovery.

In our implementation, we took a real-world scenario embedding call center recordings and used Amazon Nova Multimodal Embeddings model to make them searchable by both sentiment and content. Instead of manually tagging calls, we used text queries such as: “Find a call where the speaker sounds angry” or “Show me a conversation about billing issues.” It worked, pulling out the right audio clips on demand. In other words, we turned audio archives into a searchable experience by both tone and topic without the hassle. For those who want to dive deeper, you can see our code samples and snippets linked in the final section.

References

[1] Blog on Amazon Nova Multimodal Embeddings

[2] Nova Embeddings

[3] Supported Regions and models for batch inference

About the authors

Reinforcement fine-tuning on Amazon Bedrock: Best practices

Nick McCarthy — Wed, 08 Apr 2026 19:43:28 +0000

You can use reinforcement Fine-Tuning (RFT) in Amazon Bedrock to customize Amazon Nova and supported open source models by defining what “good” looks like—no large labeled datasets required. By learning from reward signals rather than static examples, RFT delivers up to 66% accuracy gains over base models at reduced customization cost and complexity. This post covers best practices for RFT on Amazon Bedrock, from dataset design, reward function strategy, and hyperparameter tuning for use cases like code generation, structured extraction, and content moderation.

In this post, we explore where RFT is most effective, using the GSM8K mathematical reasoning dataset as a concrete example. We then walk through best practices for dataset preparation and reward function design, show how to monitor training progress using Amazon Bedrock metrics, and conclude with practical hyperparameter tuning guidelines informed by experiments across multiple models and use cases.

RFT use-cases: Where can RFT shine?

Reinforcement Fine-Tuning (RFT) is a model customization technique that improves foundation model (FM) behavior using reward signals. Compared to supervised fine-tuning (SFT), it doesn’t directly train on correct responses (labeled I/O pairs). Instead, RFT uses a dataset of inputs and a reward function. The reward function can be rule-based or another trained grader model, or large language model (LLM) as a judge. During training, the model generates candidate responses and the reward function scores each response. Based on the reward, the model weights are updated to increase the probability of generating responses that receive a high reward. This iterative cycle of sample responses, score responses, and update weights steers the model to learn which behaviors lead to better outcomes. RFT is particularly valuable when the desired behavior can be evaluated, but difficult to demonstrate—whether because labeled data is impractical to curate or because static examples alone can’t capture the reasoning a task demands. It excels in two primary areas:

Tasks where a rule or test can verify correctness automatically
Subjective tasks where another model can effectively evaluate response quality

Tasks in the first category are code generation that must pass tests, math reasoning with verifiable answers, structured data extraction that must match strict schemas, or API/tool calls that must parse and execute correctly. Because success criteria can be translated directly into reward signals, the model can discover stronger strategies than what a small set of labeled examples could teach. This pattern is known as Reinforcement Learning with Verifiable Rewards (RLVR).

In addition, RFT suits subjective tasks such as content moderation, chatbots, creative writing, or summarization that lack easily quantifiable correctness. A judge model, guided by a detailed evaluation rubric, can serve as the reward function. It scores outputs against criteria that would be impractical to encode as static training pairs. This approach is known as Reinforcement Learning with AI Feedback (RLAIF).

For RFT in Amazon Bedrock, you can implement both rule-based and model-based approaches as a custom AWS Lambda function, which is the reward function that Amazon Bedrock calls during the training loop.

A comparison of these two approaches is depicted in the following diagram:

The following are a few common use cases that can be tackled through RLVR, RLAIF, or a combination of both.

Use Case	Reward Signal
Code generation for production services	Unit-test pass rates, linting, and runtime checks
Tool and API orchestration	Successful end-to-end task completion (like, booking flows, data retrieval pipelines)
Complex math and algorithmic reasoning	Correct final answers and/or intermediate verification steps
Structured data extraction and transformation	Schema validation, exact matches, penalties for malformed outputs
SQL / query synthesis over databases	Query results matching expected answers or satisfying runtime properties
Agentic workflows	Combination of RLVR and RLAIF; RLVR for tool calling correctness; RLAIF for final task completion, for example, measured as usefulness, correctness, or robustness

GSM8K: Using RFT to improve solutions to mathematical calculations

To illustrate how reinforcement fine-tuning works in practice, we can examine a concrete example: improving a model’s ability to solve mathematical reasoning problems. RFT is useful for mathematical problems because solutions can often be objectively verified, making it possible to design clear reward signals that guide the model toward correct reasoning and structured outputs. Let’s look at an example from the GSM8K (Grade School Math 8K) dataset:

Tina makes $18.00 an hour. If she works more than 8 hours per shift, she is eligible for overtime, which is paid by your hourly wage + 1/2 your hourly wage. If she works 10 hours every day for 5 days, how much money does she make?

Let’s look at what an ideal response might look like:

<begin_internal_thought>
I need to find total pay for 5 days of 10-hour shifts. Because she works over 8 hours daily, I'll need to split each day into regular and overtime hours, calculate the overtime rate (1.5x regular), then multiply by 5 days.
</end_internal_thought>

<begin_of_solution>
Overtime rate: $18.00 + (1/2 × $18.00) = $27.00/hour

Daily earnings (10 hours):
  Regular (8 hours):  8 × $18 = $144
  Overtime (2 hours): 2 × $27 = $54
  Daily total: $198

Total for 5 days: 5 × $198 = $990

\boxed{990}
</end_of_solution>

Here, we see that the problem is broken down into logical steps and shows clear reasoning paths, not only final answers. Additionally, we would like the model to respond in this specific format and have the answer exactly match the ground truth solution. Other fine-tuning methods like SFT struggle with mathematical reasoning because they primarily learn to pattern-match training data rather than truly reason. These models can memorize solution templates but often fail when presented with novel variations of a problem.

Because we can use RFT to define reward functions, exact answers like the previous answer of $990 can be objectively evaluated while also assigning partial credit for correct intermediate reasoning steps. This enables the model to discover valid solution approaches while learning to follow required structured, and in many cases achieves strong performance with relatively small datasets (around 100–1000 examples).

Best practices for preparing Your dataset

RFT requires carefully prepared datasets to achieve effective results. On Amazon Bedrock, RFT training data is provided as a JSONL file, with each record following the OpenAI chat completion format.

Dataset size guidelines

RFT supports dataset sizes between 100–10,000 training samples, though requirements vary depending on task complexity and reward function design. Tasks involving complex reasoning, specialized domains, or broad application scopes generally benefit from larger datasets and a sophisticated reward function. For initial experimentation, start with a small dataset (100–200 examples) to validate that your prompts and reward function produce meaningful learning signals and that the base model can achieve measurable reward improvements. Note that for certain domains, only customizing on small datasets can yield limited generalization and show inconsistent results across prompt variations. Typical implementations using 200–5,000 examples provide stronger generalization and more consistent performance across prompt variations. For more complex reasoning tasks, specialized domains, or sophisticated reward functions, 5,000–10,000 examples can improve robustness across diverse inputs.

For more information about the dataset requirements, see the Amazon Bedrock documentation.

Dataset quality principles

The quality of your training data fundamentally determines RFT outcomes. Consider the following principles when preparing your dataset:

1. Prompt distribution
Make sure that the dataset reflects the full range of prompts that the model will encounter in production. A skewed dataset can lead to poor generalization or unstable training behavior.

2. Base model capability
RFT assumes that the base model demonstrates basic task understanding. If the model can’t achieve a non-zero reward on your prompts, the learning signal will be too weak for effective training. A simple validation step is generating several responses from the base model (like, temperature ≈ 0.6) and confirming that the outputs produce meaningful reward signals.

3. Clear prompt design
Prompts should clearly communicate expectations and constraints. Ambiguous instructions lead to inconsistent reward signals and degraded learning. Prompt structure should also align with reward function parsing. For example, requiring final answers after a specific marker or enforcing code blocks for programming tasks, as well as the prompt structure that the base model is familiar with from pre-training.

4. Reliable reference answers
When possible, include a reference answer that represents the desired output pattern, formatting, and correctness criteria. Reference answers anchor reward computation and reduce noise in the learning signal. For example, mathematical tasks might include a correct numerical answer, while coding tasks might include unit tests or input-output pairs.

It’s also good practice to validate reference answers by confirming that a response aligned with the ground truth receives the maximum reward score.

5. Consistent reward signals within the data

Because RFT relies entirely on reward signals to guide learning, the quality of those signals is critical. Your dataset and reward function should work together to produce consistent, well-differentiated scores. This means that strong responses reliably score higher than weak ones across similar inputs. If the reward function can’t clearly distinguish between good and poor responses, or if similar outputs receive widely varying scores, the model might learn the wrong patterns or fail to improve altogether.

In the next section you will learn what to keep in mind when writing your reward function.

Preparing your reward function

Reward functions are central to RFT because they evaluate and score model responses, assigning higher rewards to preferred outputs and lower rewards to less desirable ones. This feedback guides the model toward improved behavior during training. For objective tasks like mathematical reasoning, a candidate response that produces the correct answer might receive a reward of 1, while an incorrect answer receives 0. A response with a partially correct reasoning trace and an incorrect final answer might get a reward of 0.8 (depending on how much you want to penalize an incorrect final response). For subjective tasks, the reward function encodes desired qualities. For example, in summarization it might capture faithfulness, coverage, and clarity. For more information about setting up your reward function, see setting up reward functions for Amazon Nova models.

Reward design for verifiable tasks

For tasks that can be deterministically verified, like math reasoning or coding, the simplest approach is to programmatically check correctness. Effective reward functions typically evaluate both format constraints and performance objectives. Format checks make sure that the responses can be reliably parsed and evaluated. Performance metrics determine whether the result is correct. Rewards can be implemented using binary signals (correct compared to incorrect) or continuous scoring depending on the task.

For GSM8K-style mathematical reasoning tasks, reward functions must also account for how models express numerical answers. Models can format numbers with commas, currency symbols, percentages, or embed answers within explanatory text. To address this, answers should be normalized by stripping formatting characters and applying flexible extraction that prioritizes structured formats before falling back to pattern matching. This approach makes sure that the models are rewarded for correct reasoning rather than penalized for stylistic formatting choices. You can find the full reward function implementation for GSM8K in the amazon-bedrock-samples GitHub repository.

Reward design for non-verifiable tasks

Tasks like summarization, creative writing, or semantic alignment require an LLM-based judge to approximate subjective preferences. In this setting, the judge prompt effectively acts as the reward function, defining what behaviors are rewarded and how responses are scored. A practical judge prompt should clearly define the evaluation goal and include a concise scoring rubric with numeric scales reflecting the qualities the model should improve for.

Judge prompts should also return structured outputs, for example JSON or tagged formats containing the final score and optional reasoning, so reward values can be reliably extracted during training while maintaining observability into how each response was evaluated. An example of a reward function that utilizes AI feedback can be seen in this PandaLM reward function script in GitHub.

Combining verifiable rewards with AI feedback

Reward functions for verifiable tasks can also be augmented with AI feedback to evaluate solution quality beyond numerical correctness. For example, an LLM-as-a-judge can assess the reasoning chain, verify intermediate calculations, or evaluate the clarity of explanations, providing a reward signal that captures both correctness and reasoning quality.

Iterating on reward design

Reward functions often require iteration. Early versions might produce noisy signals or during the training loop the model might learn to exploit the reward function to generate a high reward without learning the desired behavior. Refining the reward logic based on observed training behavior is essential. Before launching full training jobs, it’s also good practice to test reward functions independently using sample prompts and known outputs to ensure that the scoring logic produces stable and meaningful reward signals.

Evaluating training progress: signals that the model is learning

After your dataset and reward function are ready, you can launch RFT training using either the Amazon Bedrock API or through the console. The exact workflow depends on your preferred development environment. The Create and manage fine-tuning jobs for Amazon Nova models topic in the Amazon Bedrock User Guide provides step-by-step instructions for both approaches. After training begins, monitoring the training metrics is critical. These signals indicate whether the reward function is meaningful and whether the model is learning useful behaviors rather than overfitting or collapsing to trivial strategies. The following image shows the training metrics of one of our GSM8K training run showing healthy training dynamics.

Training rewards plots the average reward score at each training step. Variance is expected because the input prompts in a batch are sampled randomly so difficulty in batches differ. In addition, the model is exploring different strategies leading to variance. What matters is the overall trend: rewards increase from roughly 0.5 to around 0.8–0.9, indicating that the model is converging on receiving higher rewards. Validation rewards provide a clearer signal because they are computed on a held-out dataset. Here we see a steep improvement during the first ~40 steps followed by a plateau around 0.88, suggesting the model is generalizing rather than memorizing training examples. Validation rewards that track closely with training rewards are typically a sign that overfitting isn’t occurring.

Training episode length measures the average response length. The drop from roughly 625 tokens to ~400 tokens suggests that the model is learning to reach correct answers more efficiently, producing less redundant reasoning as training progresses. Policy entropy measures how much the model is exploring different response strategies during training. Values in the 0.8–1.1 range indicate healthy exploration. If entropy collapsed toward zero it would suggest the model had prematurely converged, but sustained entropy implies the model is still exploring and improving.

Hyperparameter tuning guidelines

In this section, we cover practical hyperparameter tuning guidelines for Amazon Bedrock RFT. These recommendations are informed by a series of internal experiments that we ran across multiple models and use cases. This includes reasoning tasks like GSM8K and other structured and generative workloads. While effective values will vary by task, the patterns observed across these experiments provide useful starting points when configuring RFT jobs. For more information about the hyperparameters that you can configure before launching an RFT customization job, see the official boto3 docs.

EpochCount

Training duration and epochCount require adjustment based on dataset size and model behavior. Smaller datasets often show continued improvement through 6-12 epochs, while larger datasets may achieve optimal performance in 3-6 epochs. This relationship isn’t linear and careful monitoring of validation metrics remains essential to prevent overfitting while ensuring sufficient model adaptation.

BatchSize

This parameter controls how many prompts are processed before the updated model generates a new round of candidate responses (rollouts). For example, with a batchSize of 128, the model processes, updates, and generates new rollouts for 128 prompts at a time until it has worked through the full dataset. The total number of rollout rounds equals the (filtered) dataset size divided by batchSize.
A batchSize of 128 works well for most use cases and models. Increase it if loss is erratic or reward isn’t improving. Decrease it if iterations take too long.

LearningRate

In Amazon Bedrock RFT, we perform parameter-efficient RFT using Low Rank Adaptation (LoRA) adapters with a rank of 32. Across a range of use cases, a learning rate of 1e-4 has consistently produced strong results. In the following experiment, we swept learning rates across seven orders of magnitude on Qwen3-1.7B using the GSM8K dataset (1K training samples, 256 test samples), running a single epoch with batch size 64, group size 16, and LoRA rank 1.As shown in the following figure, LoRA’s optimal learning rate peaks around 1e-4 to 1e-3, approximately one order of magnitude higher than full fine-tuning (FFT). Even with a rank of 1, LoRA achieves within ~5.5% of FFT’s best validation reward at roughly the same wall-clock time. In practice, LoRA-based RFT tends to be more forgiving and performs well across a wider range of learning rates than FFT, though both approaches can collapse outside their optimal ranges. We recommend monitoring reward curves closely and lowering the learning rate if they begin to oscillate or collapse.

Prompt length and response length

The maxPromptLength defines the maximum allowed length for input prompt in the dataset. Prompts exceeding this limit are filtered out during training. If your dataset contains unusually long prompts or other outliers, set an appropriate value that excludes outliers while retaining most samples. Otherwise, you can set it to the length of the longest prompt in your dataset. On the other hand, inferenceMaxTokens defines the maximum response length for any rollout or response generated during RL training. You can use this argument to control whether the resulting model generates detailed outputs or concise answers. We recommend that you choose a value based on the requirements of your task. An excessively large value can increase training time while a too small value could degrade model performance. For the tasks that don’t require complex reasoning, setting the maximum response length to 1,024 is typically sufficient. In contrast, for challenging tasks like coding or long-form generation, using a larger upper bound (more than 4,096) is preferable.

Early stopping and evaluation interval

Our RFT service provides two features that optimize training efficiency and model quality. EarlyStopping (enabled by default) automatically stops training when performance improvements plateau, preventing overfitting and reducing unnecessary computation costs. The system continuously monitors validation metrics and terminates training after it detects that further iterations are unlikely to yield meaningful improvements. Meanwhile, evalInterval determines how frequently the model evaluates its performance on the validation dataset during training. This hyperparameter is automatically calculated as min(10, data_size/batch_size), maintaining at least one evaluation per epoch while maintaining reasonable frequency. For datasets where data_size significantly exceeds 10×batch_size, evaluations typically occur every 10 steps, providing sufficient monitoring granularity without excessive overhead.

RFT metrics and their meaning

Amazon Bedrock exposes several training metrics through Amazon CloudWatch and the Amazon Bedrock console that give you a clear picture of whether your RFT job is progressing as expected. Understanding what each metric represents and what anomalies to watch for makes the difference between catching a problem early and waiting hours for a failed run to finish.

Training and validation rewards

The training reward is the average reward on the episodes that you’re training on. The validation reward is the same metric on a held-out set of prompts that don’t contribute gradients. In a healthy run, train reward should climb steadily early on, with validation reward rising more slowly but in the same general direction.

Train and validation episode lengths

These encode the average number of tokens generated per response. Use this to detect verbosity hacking. If lengths explode while rewards increase, the model has learned that longer = better regardless of quality. In reasoning tasks (like Chain Of Thought (CoT)), a gradual increase is healthy (learning to think), but a sudden vertical spike usually indicates a loop or failure. In some cases, you will see a gradual decrease, and that is fine too. That could mean that the model was initially exploring more to get to the answer, but later figures out shorter yet rewarding trajectories.

Policy entropy

Policy entropy measures how confident the model is in its outputs. High entropy means the model is uncertain and still exploring, while low entropy means it’s converging on consistent responses. Over a healthy training run, you’d expect a gentle decline from the initial baseline to a stable plateau as the model learns. A sharp drop to near zero is a warning sign: it typically means that the model has collapsed into repeating a single response rather than reasoning through problems. On the other end, a flat line at a persistently high value suggests the model is ignoring the reward signal entirely and not learning from feedback.

Gradient norm

The magnitude (L2 norm) of the gradients applied to the model at each update. In a stable run it fluctuates within a reasonable band, with occasional spikes; sustained growth or extreme spikes can indicate issues with learning rate, reward scaling, or numeric stability.

Common pitfalls

Even well-configured RFT jobs can run into failure modes that aren’t always obvious from the metrics alone. The two most common are reward hacking—where the model learns to game the reward function rather than improve at the actual task—and reward instability, where high variance in the reward signal undermines the learning process. Both are recoverable, but easier to address if you know what to look for.

Reward hacking

This occurs when the policy learns to exploit weaknesses in the reward function to maximize scores without improving quality. You will see training rewards climb steadily while human evaluation scores degrade or plateau. To mitigate this, ensure that the reward function captures all aspects of the behavior you want encoded through fine-tuning. If not, observe the model generations, and iterate on the reward function. Use strict length penalties in the reward function if needed.

Reward variance and instability

Even with a good average reward, high fluctuation in scores for similar inputs creates a noisy signal that destabilizes training. This manifests as jittery reward curves and wildly oscillating loss metrics. The first line of defense is rigorous normalization: standardize rewards (zero mean, unit variance) within every batch, clip extreme outliers, and ensure your reward inference is deterministic (no dropout), so the optimizer receives a consistent and stable learning signal.

Conclusion

In this post, we demonstrated how to apply Reinforcement Fine-Tuning (RFT) in Amazon Bedrock to improve model performance using feedback-driven training. Using the GSM8K mathematical reasoning dataset as a concrete example, we showed where RFT is most effective, how to structure training datasets, and how to design reward functions that reliably evaluate model outputs. We also explored how to monitor training progress using Bedrock’s training metrics and provided practical hyperparameter tuning guidelines informed by experiments across multiple models and use cases. Together, these components form the core foundation for running successful RFT workflows. When datasets are well structured, reward functions capture the right notion of quality, and training metrics are monitored carefully. RFT can significantly improve model performance across both verifiable tasks (such as reasoning, coding, and structured extraction) and subjective tasks using AI feedback.

Next steps

Ready to start customizing with RFT in Amazon Bedrock? Log in to the Amazon Bedrock console or review the official AWS API docs and create your first RFT training job using the open source models that were fine-tuned for this use-case.

To begin:

Explore the Documentation: Visit the comprehensive guides and tutorials: Create a reinforcement fine-tuning job
Try the Sample Notebooks: Access ready-to-run examples in the AWS Samples GitHub repository
Experiment with your own workloads – Apply the dataset preparation, reward design, and hyperparameter tuning practices covered in this post to your own use cases.

Acknowledgement

Thank you to the contributions from the Amazon Bedrock Applied Scientist team, Zhe Wang and Wei Zhu, who’s experimental work served as the foundation for many of the best practices listed in this blog post.

About the authors

Manage AI costs with Amazon Bedrock Projects

Ba'Carri Johnson — Tue, 07 Apr 2026 23:32:00 +0000

As organizations scale their AI workloads on Amazon Bedrock, understanding what’s driving spending becomes critical. Teams might need to perform chargebacks, investigate cost spikes, and guide optimization decisions, all of which require cost attribution at the workload level.

With Amazon Bedrock Projects, you can attribute inference costs to specific workloads and analyze them in AWS Cost Explorer and AWS Data Exports. In this post, you will learn how to set up Projects end-to-end, from designing a tagging strategy to analyzing costs.

How Amazon Bedrock Projects and cost allocation work

A project on Amazon Bedrock is a logical boundary that represents a workload, such as an application, environment, or experiment. To attribute the cost of a project, you attach resource tags and pass the project ID in your API calls. You can then activate the cost allocation tags in AWS Billing to filter, group, and analyze spend in AWS Cost Explorer and AWS Data Exports.

The following diagram illustrates the end-to-end flow:

Figure 1: End-to-end cost attribution flow with Amazon Bedrock Projects

Notes:

Amazon Bedrock Projects support the OpenAI-compatible APIs: Responses API and Chat Completions API.
Requests without a project ID are automatically associated with the default project in your AWS account.

Prerequisites

To follow along with the steps in this post, you need:

Access to Amazon Bedrock with the OpenAI SDK. See Amazon Bedrock Quickstart to get started.
IAM permissions for Amazon Bedrock Projects, inference, and tagging. For this example, you can attach the AWS managed policy AmazonBedrockMantleFullAccess. For production, see Implementing least privilege for Amazon Bedrock.
Access to the AWS Billing and Cost Management console.

Define your tagging strategy

The tags that you attach to projects become the dimensions that you can filter and group by in your cost reports. We recommend that you plan these before creating your first project. A common approach is to tag by application, environment, team, and cost center:

Tag key	Purpose	Example values
Application	Which workload or service	CustomerChatbot, Experiments, DataAnalytics
Environment	Lifecycle stage	Production, Development, Staging, Research
Team	Ownership	CustomerExperience, PlatformEngineering, DataScience
CostCenter	Finance mapping	CC-1001, CC-2002, CC-3003

For more guidance on building a cost allocation strategy, see Best Practices for Tagging AWS Resources. With your tagging strategy defined, you’re ready to create projects and start attributing costs.

Create a project

With your tagging strategy and permissions in place, you can create your first project. Each project has its own set of cost allocation tags that flow into your billing data. The following example shows how to create a project using the Projects API.

First, install the required dependencies:

$ pip3 install openai requests

Create a project with your tag taxonomy:

The OpenAI SDK uses the OPENAI_API_KEY environment variable. Set this to your Bedrock API key.

import os
import requests

# Configuration
BASE_URL = "https://bedrock-mantle.<YOUR-REGION-HERE>.api.aws/v1"
API_KEY  = os.environ.get("OPENAI_API_KEY")  # Your Amazon Bedrock API key

def create_project(name: str, tags: dict) -> dict:
    """Create a Bedrock project with cost allocation tags."""
    response = requests.post(
        f"{BASE_URL}/organization/projects",
        headers={
            "Authorization": f"Bearer {API_KEY}",
            "Content-Type": "application/json"
        },
        json={"name": name, "tags": tags}
    )

    if response.status_code != 200:
        raise Exception(
            f"Failed to create project: {response.status_code} - {response.text}"
        )

    return response.json()

# Create a production project with full tag taxonomy
project = create_project(
    name="CustomerChatbot-Prod",
    tags={
        "Application": "CustomerChatbot",
        "Environment": "Production",
        "Team":        "CustomerExperience",
        "CostCenter":  "CC-1001",
        "Owner":       "alice"
    }
)
print(f"Created project: {project['id']}")

The API returns the project details, including the project ID and ARN:

{
  "id": "proj_123",
  "arn": "arn:aws:bedrock-mantle:<YOUR-REGION-HERE>:<YOUR-ACCOUNT-ID-HERE>:project/<YOUR-PROJECT-ID>"
}

Save the project ID. You will use it to associate inference requests in the next step. The ARN is used for IAM policy attachment if you must restrict access to this project. Repeat this for each workload. The following table shows a sample project structure for an organization with three applications:

Project name	Application	Environment	Team	Cost Center
CustomerChatbot-Prod	CustomerChatbot	Production	CustomerExperience	CC-1001
CustomerChatbot-Dev	CustomerChatbot	Development	CustomerExperience	CC-1001
Experiments-Research	Experiments	Production	PlatformEngineering	CC-2002
DataAnalytics-Prod	DataAnalytics	Production	DataScience	CC-3003

You can create up to 1,000 projects per AWS account to fit your organization’s needs.

Associate inference requests with your project

With your projects created, you can associate inference requests by passing the project ID in your API calls. The following example uses the Responses API:

from openai import OpenAI

client = OpenAI(
    base_url="https://bedrock-mantle.<YOUR-REGION-HERE>.api.aws/v1",
    project="<YOUR-PROJECT-ID>", # ID returned when you created the project
)
response = client.responses.create(
    model="openai.gpt-oss-120b",
    input="Summarize the key findings from our Q4 earnings report."
)
print(response.output_text)

To maintain clean cost attribution, always specify a project ID in your API calls rather than relying on the default project.

Activate cost allocation tags

Before your project tags appear in cost reports, you must activate them as cost allocation tags in AWS Billing. This one-time setup connects your project tags to the billing pipeline. For more information about activating cost allocation tags, see the AWS Billing documentation.

It can take up to 24 hours for tags to propagate to AWS Cost Explorer and AWS Data Exports. You can activate your tags immediately after creating your first project to avoid gaps in cost data.

View project costs

With projects created, inference requests tagged, and cost allocation tags activated, you can see exactly where your Amazon Bedrock spend is going. Every dimension that you defined in your taxonomy is now available as a filter or grouping in your AWS Billing cost reports.

AWS Cost Explorer

AWS Cost Explorer provides the fastest way to visualize your costs by project. Complete the following steps to review your costs by project:

Open the AWS Billing and Cost Management console and choose Cost Explorer.
In the Filters pane, expand Service and select Amazon Bedrock.
Under Group by, select Tag and choose your tag key (for example, Application).

Figure 2: Cost Explorer showing daily Amazon Bedrock spending grouped by the Application tag

For more ways to refine your view, see Analyzing your costs and usage with AWS Cost Explorer.

For more granular analysis and line-item detail with your project tags, see Creating Data Exports in the AWS Billing documentation.

Conclusion

With Amazon Bedrock Projects, you can attribute costs to individual workloads and track spending using the AWS tools that your organization already relies on. As your workloads scale, use the tagging strategy and cost visibility patterns covered in this post to maintain accountability across teams and applications.

For more information, see Amazon Bedrock Projects documentation and the AWS Cost Management User Guide.

About the authors

Building real-time conversational podcasts with Amazon Nova 2 Sonic

Madhavi Evana — Tue, 07 Apr 2026 16:29:11 +0000

Content creators and organizations today face a persistent challenge: producing high-quality audio content at scale. Traditional podcast production requires significant time investment (research, scheduling, recording, editing) and substantial resources including studio space, equipment, and voice talent. These constraints limit how quickly organizations can respond to new topics or scale their content production. Amazon Nova 2 Sonic is a state-of-the-art speech understanding and generation model that delivers natural, human-like conversational AI with low latency and industry-leading price-performance. It provides streaming speech understanding, instruction following, tool invocation, and cross-modal interaction that seamlessly switches between voice and text. Supporting seven languages with up to 1M token context windows, developers can use Amazon Nova 2 Sonic to build voice-first applications for customer support, interactive learning, and voice-enabled assistants.

This post walks through building an automated podcast generator that creates engaging conversations between two AI hosts on any topic, demonstrating the streaming capabilities of Nova Sonic, stage-aware content filtering, and real-time audio generation.

What is Amazon Nova 2 Sonic?

Amazon Nova 2 Sonic processes speech input and delivers speech output and text transcriptions, creating human-like conversations with rich contextual understanding. Amazon Nova 2 Sonic provides a streaming API for real-time, low-latency multi-turn conversations, so developers can build voice-first applications where speech drives app navigation, workflow automation, and task completion.

The model is accessible through Amazon Bedrock and can be integrated with key Amazon Bedrock features, including Guardrails, Agents, multimodal RAG, and Knowledge Bases for seamless interoperability across the platform.

Key capabilities:

Streaming Speech Understanding – Process and respond to speech in real-time with low latency
Instruction Following – Execute complex multi-step voice commands
Tool Invocation: Call external functions and APIs during conversations
Cross-Modal Interaction – Seamlessly switch between voice and text I/O
Multilingual Support – Native support for English, French, Italian, German, Spanish, Portuguese, and Hindi
Large Context Window – Up to 1M tokens for maintaining extended conversation context

Understanding the challenge

Podcasts have experienced explosive growth, evolving from a niche medium to mainstream content format. This surge comes from podcasts’ unique ability to deliver information during multitasking activities (commuting, exercising, household tasks) providing an accessibility advantage that visual content can’t match.

However, traditional podcast production faces structural challenges:

Content Scalability: Human hosts require extensive time for research, scheduling, recording, and post-production, limiting output frequency and volume.

Consistency: Human hosts face scheduling conflicts, illness, varying energy levels, and availability constraints that create irregular publishing schedules.

Personalization: Traditional podcasts follow a one-size-fits-all model, unable to tailor content to individual listeners for interests or knowledge levels in real-time.

Resource Efficiency: Quality production requires significant ongoing investment in talent, equipment, editing software, and operational overhead.

Expert Access: Securing knowledgeable hosts across diverse topics remains challenging and expensive, restricting content breadth and depth.

By using the conversational AI capabilities of Amazon Nova Sonic, organizations can address these limitations and enable new interactive and personalized audio content formats that scale globally without traditional human resource constraints.

Solution overview

The Nova Sonic Live Podcast Generator demonstrates how to create natural conversations between AI hosts about any topic using the speech-to-speech model of Amazon Nova Sonic. Users enter a topic through a web interface, and the application generates a multi-round dialogue with alternating speakers streamed in real-time.

Key features

Real-time streaming audio generation with low latency
Natural back-and-forth dialogue across multiple conversational turns
Stage-aware content filtering that removes duplicate audio
Simple web interface with live conversation updates
Concurrent user support through AsyncIO architecture
Provides multiple voice personas for different use cases.

Prerequisites

To implement this solution, the following requirements must be met:

AWS account with access to Amazon Bedrock and Amazon Nova 2 Sonic model
Python 3.8 or later
Flask web framework and AsyncIO
AWS credentials are configured (access key, secret key, AWS Region)
Development environment with pip package manager

Implementation details

For detailed code samples and complete implementation guidance, view in GitHub.

Architecture overview

The solution follows a Flask-based architecture with streaming and reactive event processing, designed to demonstrate the capabilities of Amazon Nova Sonic for proof-of-concept and educational purpose.

System architecture diagram

The following diagram illustrates the real-time streaming architecture:

Architecture components

The architecture follows a layered approach with clear separation of concerns:

Client Application hosts three tightly coupled components that manage the full audio lifecycle:

PyAudio Engine captures microphone input at 16kHz PCM and streams it to Amazon Bedrock. It also receives playback-ready audio from the Audio Output Queue at 24kHz PCM, handling speaker output in real time.
Response Processor receives the raw response stream returned by Amazon Nova Sonic, decodes the Base64-encoded audio payload, and forwards the decoded audio to the Audio Output Queue.
Audio Output Queue acts as a buffer between the Response Processor and the PyAudio Engine, absorbing variable-latency responses and ensuring smooth, uninterrupted audio playback at 24kHz PCM.

AWS Cloud – all model communication runs through Amazon Bedrock, which brokers a bidirectional event stream with Amazon Nova Sonic:

Amazon Bedrock receives the outbound 16kHz PCM audio stream from the PyAudio Engine and routes it to the model. It also carries the model’s response stream back to the client.
Amazon Nova Sonic receives the audio input through the bidirectional stream, performs real-time speech-to-speech inference, and returns a response stream containing synthesized audio encoded as Base64 PCM at 24kHz.

Production Architecture Note: This implementation uses Flask with PyAudio for demonstration purposes. PyAudio does not provide built-in echo cancellation and is best suited for server-side audio playback. For production web-based client applications, JavaScript-based audio libraries (Web Audio API) or WebRTC are recommended for browser-native audio handling with better echo cancellation and lower latency. See the GitHub repository for production architecture patterns.

Key technical innovations

Amazon Bedrock integration

At the heart of the system is the BedrockStreamManager, a custom component that manages persistent connections to the Amazon Nova 2 Sonic model. This manager handles the complexities of streaming API interactions, including initialization, message sending, and response processing. AWS credentials that are configured through environment variables maintains secure access to the foundation model (FM). The full code is in the GitHub Repository

# Initialize BedrockStreamManager for each conversation turn

manager = BedrockStreamManager(
    model_id='amazon.nova-sonic-v1:0',
    region='us-east-1'
)

# Configure voice persona (Matthew or Tiffany)

manager.START_PROMPT_EVENT = manager.START_PROMPT_EVENT.replace(
    '"matthew"', f'"{voice}"'
)

# Initialize streaming connection
await manager.initialize_stream()

Reactive streaming pipeline

The application employs RxPy (Reactive Extensions for Python) to implement an observable pattern for handling real-time data streams. This reactive architecture processes audio chunks and text tokens as they arrive from Amazon Nova Sonic, rather than waiting for complete responses.

# Subscribe to streaming events from BedrockStreamManager

manager.output_subject.subscribe(on_next=capture)

# Capture function processes events in real-time

def capture(event):
    if 'textOutput' in event['event']:
        text = event['event']['textOutput']['content']
        text_parts.append(text)
    if 'audioOutput' in event['event']:
        audio_chunks.append(event['event']['audioOutput']['content'])

The output_subject in the BedrockStreamManager acts as the central event bus, so multiple subscribers can react to streaming events simultaneously. This design choice reduces latency and improves the user experience by providing immediate feedback.

Stage-aware content filtering

One of the key technical innovations in this implementation is the stage-aware filtering mechanism. Amazon Nova 2 Sonic generates content in multiple stages: SPECULATIVE (preliminary) and FINAL (polished). The application implements an intelligent filtering logic that monitors contentStart events for generation stage metadata. It captures only FINAL stage content to remove duplicate or preliminary audio, and prevents audio artifacts for clean, natural-sounding output.

def capture(event):
    nonlocal is_final_stage
    if 'event' in event:

       # Detect generation stage from contentStart event
        if 'contentStart' in event['event']:
            content_start = event['event']['contentStart']
            if 'additionalModelFields' in content_start:
                additional_fields = json.loads(content_start['additionalModelFields'])
                stage = additional_fields.get('generationStage', 'FINAL')
                is_final_stage = (stage == 'FINAL')

        # Only capture content in FINAL stage
        if is_final_stage:
            if 'textOutput' in event['event']:
                text = event['event']['textOutput']['content']
                if text and '{ "interrupted" : true }' not in text:
                    text_parts.append(text)
            if 'audioOutput' in event['event']:
                audio_chunks.append(event['event']['audioOutput']['content'])

The filtering operates at three levels:

Interrupted Content Filter – Removes canceled content by checking for interruption markers.
Text Deduplication – Filters exact duplicate text across SPECULATIVE and FINAL stages.
Audio Hash Deduplication – Filters duplicate audio chunks using hash fingerprinting.

This filtering happens in real-time within the capture callback function, which subscribes to the output stream and selectively processes events based on generation stage.

Note: The code snippets shown are simplified for clarity. The is_final_stage variable must be defined in the enclosing scope. See the GitHub repository for complete, production-ready implementations.

Conversation management

The system implements a turn-based conversation model with multiple rounds of dialogue. Each turn follows a consistent pattern for natural conversation flow:

Conversation History – The application maintains conversation context through speaker-specific variables, so each speaker can reference what was previously said.
Dynamic Prompt Generation – Prompts are constructed dynamically based on speaker role and conversation contex, for example, Matthew (host) introduces topics and asks follow-up questions, while Tiffany (expert) provides informed responses.
Fresh Stream Per Turn – The application creates a fresh BedrockStreamManager instance for each speaker turn, preventing state contamination between turns for clean audio streams.

Asynchronous execution model

To handle the blocking nature of audio playback and model API calls, the application creates a new asyncio event loop for each podcast generation request. This way, multiple users can generate podcasts simultaneously without blocking each other. The loop manages stream initialization, prompt sending, audio playback coordination, and cleanup, supporting concurrent usage while maintaining clean separation between user sessions.

Data flow overview

The system follows a streamlined flow from user input to audio output. Users enter a topic, the backend orchestrates conversation turns with dynamic prompt generation, Amazon Nova 2 Sonic generates speech responses through a streaming API, and stage-aware filtering makes sure that only polished FINAL content reaches the audio pipeline for playback.

For detailed code samples and complete implementation guidance, view in GitHub.

Use cases

The Amazon Nova 2 Sonic architecture enables automated, interactive audio content creation across multiple industries. By orchestrating conversational AI instances in dialogue, organizations can generate engaging, natural-sounding content at scale.

Interactive learning and knowledge sharing

Organizations struggle to create engaging content that helps people learn and retain information, whether for student education or employee training. Amazon Nova 2 Sonic instances can simulate classroom discussions or Socratic dialogues, with one instance posing questions while the other provides explanations and examples.

For educational institutions, this creates dynamic learning experiences that accommodate different learning styles and paces. For enterprises, it transforms internal communications (policies, procedures, organizational changes) into conversational formats that employees can consume while multitasking. Integration with Retrieval Augmented Generation (RAG) and Amazon Bedrock Knowledge Bases keeps content current and aligned with curriculum or organizational requirements, while the conversational format increases information retention and reduces follow-up questions.

Multilingual content localization

Global organizations need consistent messaging across markets while respecting cultural nuances. The Amazon Nova Sonic support for English, French, Italian, German, Spanish, Portuguese, and Hindi enables creation of localized audio content with native-sounding conversations. The model can generate market-specific discussions that adapt language, cultural references, and communication styles, going beyond simple translation to produce culturally relevant content that resonates with local audiences.

The polyglot voice capabilities – individual voices that can switch between languages within the same conversation – enable advanced code-switching capabilities that handle mixed-language sentences naturally. This is particularly valuable for multilingual customer support and global team collaboration.

Product commentary and reviews

Ecommerce platforms need engaging ways to help customers understand complex products. Amazon Nova 2 Sonic instances can generate conversational product reviews, with one asking common customer questions while the other provides answers based on specifications, user reviews, and technical documentation. This creates accessible content that helps customers evaluate products through natural dialogue, with integration to product catalogs ensuring accuracy.

Thought leadership and industry analysis

Professional services firms need to establish thought leadership through regular content but producing analysis requires significant time investment. Amazon Nova 2 Sonic instances can engage in expert-level discussions about industry trends or market analysis, with one challenging assumptions while the other defends positions with data. This allows organizations to repurpose existing research into accessible audio content that reaches busy executives who prefer audio formats.

Performance characteristics

Latency: Low-latency streaming with immediate audio playback
Podcast Duration: Flexible duration based on conversational turns (typically 2–5 minutes)
Concurrent Users: Supports multiple simultaneous podcast generations through AsyncIO
Audio Quality: Professional-grade speech synthesis with natural intonation and pacing
Language Support: English, French, Italian, German, Spanish, Portuguese, and Hindi
Context Window: Up to 1M tokens for extended conversation context

Conclusion

Amazon Nova 2 Sonic is a state-of-the-art speech understanding and generation model that enables natural, human-like conversational AI experiences. The architecture outlined in this post provides a practical foundation for building conversational AI applications. Whether streamlining customer support, creating educational content, or generating thought leadership materials, the patterns demonstrated here apply across use cases.

With expanded language support, polyglot voice capabilities, enhanced telephony integration, and cross-modal interaction, Amazon Nova 2 Sonic provides organizations with tools for building global, voice-first applications at scale.

To get started with building with Amazon Nova Sonic, visit the Amazon Nova product page. For comprehensive documentation, explore the Amazon Nova 2 Sonic User Guide.

Learn more

About the authors

Text-to-SQL solution powered by Amazon Bedrock

Monica Jain — Tue, 07 Apr 2026 16:28:20 +0000

Building a text-to-SQL solution using Amazon Bedrock can alleviate one of the most persistent bottlenecks in data-driven organizations: the delay between asking a business question and getting a clear, data-backed answer. You might be familiar with the challenge of navigating competing priorities when your one-time question is waiting in the queue behind higher-impact work. A text-to-SQL solution augments your existing team—business users self-serve routine analytical questions, freeing up technical capacity across the organization for complex, high-value initiatives. Questions like “What is our year-over-year revenue growth by customer segment?” become accessible to anyone, without creating an additional workload for technical teams.

Many organizations find that accessing data insights remains a significant bottleneck in business decision-making processes. The traditional approach requires either learning SQL syntax, waiting for technical resources, or settling for pre-built dashboards that might not answer your specific questions.

In this post, we show you how to build a natural text-to-SQL solution using Amazon Bedrock that transforms business questions into database queries and returns actionable answers. The model returns not only raw SQL, but executed results synthesized into clear, natural language narratives in seconds rather than hours. We walk you through the architecture, implementation strategies, and lessons learned from deploying this solution at scale. By the end, you will understand how to create your own text-to-SQL system that bridges the gap between business questions and data accessibility.

Why traditional business intelligence falls short

It’s worth noting that tools like Amazon Quick already address many self-service analytics needs effectively, including natural language querying of dashboards and automated insight generation. These tools are an excellent fit when your analytics requirements align with structured dashboards, curated datasets, and governed reporting workflows. A custom text-to-SQL solution becomes valuable when users must query across complex, multi-table schemas with deep organizational business logic, domain-specific terminology, and one-time questions beyond what pre-configured dashboard datasets support.

Building a text-to-SQL solution surfaces three fundamental challenges that drive the need beyond traditional Business Intelligence (BI) tools:

The SQL expertise barrier blocks rapid analysis. Most business users lack the technical SQL knowledge needed to access complex data. Simple questions often require multi-table joins, temporal calculations, and hierarchical aggregations. This dependency creates bottlenecks where business users wait extended periods for custom reports, while analysts spend valuable time on repetitive query requests rather than strategic analysis.
Even modern BI systems have flexibility boundaries. Modern BI tools have made significant strides in natural language querying and self-service analytics. However, these capabilities typically work best within pre-curated semantic layers, governed datasets, or pre-modeled dashboards. When business users need to explore beyond curated boundaries, one-time joins, on-the-fly organization-specific calculations, or querying raw warehouse tables outside the semantic layer, they still face constraints that require technical intervention. A custom text-to-SQL solution fills this gap by operating directly against your data warehouse schema with dynamically retrieved business context, rather than depending on pre-configured semantic models.
Context and semantic understanding create translation gaps. Even with SQL access, translating business terminology into correct database queries proves to be challenging. Terms like attainment, pipeline, and forecast each have unique calculation logic, specific data source requirements, and business rules that vary across organizations. Understanding which tables to join, how metrics are defined, and which filters to apply requires deep institutional knowledge that isn’t readily accessible to most users.

When building your own solution, consider how your system will encode this deep business context (strategic principles, customer segmentation rules, and operational processes), so users can make faster, data-driven decisions without understanding complex database schemas or SQL syntax.

How it works: The experience

Before diving into architecture, here’s what the experience looks like from a user’s perspective.

A business user enters a question into a conversational interface asking something like, “How is revenue trending this year compared to last year across our top customer segments?” Behind the scenes, the system does the following in a matter of seconds:

Understands the question. It determines whether this is a single-step lookup or a complex question that must be broken into parts. In this case, it recognizes that “revenue trending,” “year-over-year comparison,” and “top customer segments” each require distinct data retrieval steps.
Retrieves business context. The system searches a knowledge graph that encodes your organization’s specific metric definitions, business terminology, table relationships, and data rules. It knows what revenue means in your environment, which tables contain it, and how customer segment is defined.
Generates and validates SQL. The system produces a structured SQL query, validates it for correctness and safety using deterministic checks, and executes it against your data warehouse. If validation catches an issue, it automatically revises and retries without requiring human intervention.
Synthesizes the answer. Raw query results are translated back into a natural language narrative with supporting data, giving users both the insight and the transparency to trust it.

The result is that business users get answers to complex analytical questions in seconds to minutes, with full visibility into the underlying logic. Analysts are relieved from repetitive query work to focus on higher-value strategic analysis.

Solution overview

To deliver this experience, the solution combines three core capabilities:

Foundation models (FMs) in Amazon Bedrock for natural language understanding and SQL generation
Graph Retrieval-Augmented Generation (GraphRAG) for business context retrieval
High-performance data warehouses for fast query execution.

Amazon Bedrock plays a central role in this architecture by providing both the large language model (LLM) inference layer and the agent orchestration runtime. Amazon Bedrock offers access to a broad selection of FMs, so teams can choose and swap models based on evolving performance, cost, and latency requirements without re-architecting the system.

As shown in the architecture diagram,

Amazon Bedrock AgentCore Runtime serves as the central orchestration layer, hosting a supervisor Agent that coordinates the end-to-end workflow. It routes user questions, invoking the GraphRAG Search Tool for context retrieval, enforcing Row-Level Security, triggering SQL generation and validation, and executing queries against a database (Amazon Redshift). The runtime supports multiple entry points, including MCP and HTTP protocols, enabling integration with both embedded analytics surfaces like AWS Quick Sight and custom web interfaces.
Amazon Bedrock AgentCore also provides built-in observability, feeding agent execution traces and performance metrics into Amazon CloudWatch for monitoring, debugging, and continuous optimization. This managed runtime alleviates the undifferentiated heavy lifting of building custom agent infrastructure, so teams can focus on business logic, prompt tuning, and domain knowledge enrichment.

The following diagram illustrates how this workflow operates:

The architecture operates as an orchestrated multi-agent system with five key stages:

Stage 1: Question analysis and decomposition

When a question arrives, the question processor first classifies it. Straightforward, atomic, fact-based questions like “What was total revenue in Q4?”, are routed directly to the data retrieval pipeline. Complex or multi-part questions are decomposed into self-contained, independent subquestions that can be processed in parallel by separate agent teams. This decomposition step is what allows the system to handle sophisticated analytical questions that span multiple data domains, time periods, or business dimensions.

Stage 2: Knowledge graph and GraphRAG context retrieval

This is where the system solves the context barrier, and it’s the most critical differentiator from naive text-to-SQL approaches.

A knowledge graph built on Amazon Neptune and Amazon OpenSearch Service serves as the semantic foundation. It stores your organization’s table ontology and captures the relationships between business entities, metrics, terminology, and organizational hierarchies. Crucially, this graph is enriched with domain knowledge from table owners and subject matter experts for business-specific descriptions, metric definitions, terminology mappings, and classification tags loaded from structured configuration files.

When the system processes a question, it performs a lightweight GraphRAG search that works in three phases:

Vector search (using Amazon OpenSearch Service): Finds semantically relevant column values, column names, and table descriptions that match the concepts in the user’s question.
Graph traversal (using Amazon Neptune): Follows the relationships in the knowledge graph, from matched values to their parent columns to their parent tables, to build a complete picture of which data assets are relevant and how they connect.
Relevance scoring and filtering: Ranks and structures the retrieved context so the SQL generator receives precisely the information it needs, the right tables, the right columns, the right join paths, and the right business logic.

The knowledge graph and its associated data are refreshed regularly to reflect schema changes, new tables, and evolving business definitions. The richer this contextual layer, the more accurate the downstream SQL generation becomes.

Stage 3: Structured SQL generation and validation

The system uses the function calling capabilities of Amazon Bedrock to produce SQL queries as structured data. This enforces strict output formats, alleviates the need for fragile post-processing or complex regular expressions, and significantly improves reliability.

Generated queries then pass through deterministic SQL validators operating at the Abstract Syntax Tree (AST) level. These validators proactively flag potentially risky operations, queries that are syntactically correct but semantically dangerous (for example, unbounded scans, missing filters, incorrect aggregation logic). When a validator flags an issue, it returns detailed feedback explaining the problem and suggesting a revision.

To further enhance robustness, the entire cycle is wrapped in a lightweight SQL generation agent that automatically iterates until it produces a valid, executable query or exhausts a configurable retry limit. This approach aims to deliver significantly better reliability than prompt engineering alone.

Stage 4: Test-time parallel compute

For ambiguous or complex questions, the system can generate multiple potential answers or reasoning paths simultaneously by submitting the same question to parallel agents. Results are synthesized through majority voting, selecting the most reliable output. This is particularly valuable for questions that can be interpreted in multiple ways, and it meaningfully improves both accuracy and robustness.

Stage 5: Response synthesis

Finally, raw query results including numbers, data frames, and execution logs are synthesized into natural language narratives that users receive as actionable answers. Full query transparency is maintained: users can inspect the generated SQL and underlying data at any time, building trust in the system’s outputs.

Key strategies for production-quality results

Architecture alone isn’t enough. The following strategies, learned from deploying this solution at scale, are essential for achieving the accuracy, safety, and responsiveness that production use demands.

Let end users shape the prompts

Even among experienced users, individuals often have differing default interpretations of ambiguous terms and varying expectations regarding responses to vague questions. We recommend building a customization interface, such as a web application, so table owners and designated power users can customize prompts within governed boundaries. Customizations should pass through validation guardrails that enforce content policies, restrict prompt injection attempts, and make sure modifications stay within approved templates and parameters. This helps prevent unrestricted free-text modifications while still incorporating domain knowledge and preferences into the system. This customization capability proves essential for achieving the nuanced understanding that different business domains require. Your solution should accommodate these variations rather than enforcing a one-size-fits-all approach.

Treat SQL validation as a safety-critical layer

Prompt engineering alone can’t remove errors that produce syntactically valid but semantically incorrect SQL. These errors are particularly dangerous because they return plausible-looking results that can silently erode user trust or drive incorrect decisions. Because SQL is a well-defined language, deterministic validators can catch a broad class of these errors before the query reaches your database. In internal testing, this validation layer effectively avoided serious errors in generated queries. Prioritize it as a non-negotiable safety mechanism.

Optimize aggressively for latency

Users accustomed to conversational AI expect near-instant responses. While retrieving live data and performing calculations inherently takes longer than answering from a static knowledge base, latency must still be actively managed as a first-class user experience concern. Performance analysis reveals that the workflow involves multiple steps, and the cumulative time across those steps represents the largest opportunity relative to SQL execution time alone.

To optimize, focus on:

Parallel agent execution – Process multi-part questions concurrently rather than sequentially. This can dramatically reduce total time for complex queries.
High-performance analytical storage – Use column-oriented databases that excel at the aggregation-heavy workloads typical in business intelligence.
Token optimization – Minimize input and output tokens per agent interaction through prompt optimization and response format standardization. Reduce reliance on tool-calling agentic frameworks where each call forces the agent to re-ingest growing context.

With these optimizations, in our deployment, simple SQL queries are typically generated in approximately 3–5 seconds. Actual response times will vary based on factors such as data warehouse performance, query complexity, model selection, and knowledge graph size. We recommend benchmarking against your own environment to establish realistic latency targets for interactive business analysis.

Build security and governance in from the start

Implement Row-Level Security (RLS) integration so that users only ever see data they are authorized to access. The system maintains composite entitlement tables that enforce access control policies from your existing organizational systems. When a user submits a query, appropriate RLS filters are automatically injected into the generated SQL before execution. They’re transparent to the user, but rigorous in enforcement. Design this layer to uphold strict data governance standards without adding friction to the user experience.

Implementation results and impact

After you follow the architecture and strategies outlined in this post, a text-to-SQL solution can deliver significant improvements in data accessibility and analytical productivity:

Speed improvements deliver answers to complex business questions in minutes, compared to hours or days with traditional approaches. Questions requiring multi-table joins, temporal calculations, and hierarchical aggregations that previously required custom SQL development become accessible through natural language.
Analytical democratization helps non-technical business users across sales operations, financial planning, and executive leadership perform sophisticated data analysis without SQL expertise. This typically reduces analytical workload on data engineering teams, allowing them to focus on strategic initiatives rather than repetitive query requests.
Complex query handling supports multi-dimensional revenue analysis with the following capabilities:
- automatic segmentation
- year-over-year and month-over-month trending with variance explanations
- customer intelligence at granular levels with usage patterns
- forecast variance analysis with target comparisons
- cross-functional benchmarking across time periods and business units

Looking forward

Text-to-SQL solutions powered by Amazon Bedrock represent a significant step forward in making data analytics accessible to business users. The multi-agent architecture using Amazon Bedrock Agents supports complex query decomposition and parallel processing, while knowledge graphs provide business context and semantic understanding. Together, these components deliver accurate, fast, and accessible analytics that empower business users to make data-driven decisions without technical barriers.

As you build your own solution, consider expanding knowledge graph coverage to additional business domains, optimizing response latency through advanced caching strategies, and integrating with more enterprise data sources. Amazon Bedrock Guardrails offer enhanced output validation and safety capabilities worth exploring, while Amazon Bedrock Flows provide sophisticated orchestration patterns for agentic workflows.

The FM flexibility, agent orchestration capabilities, and knowledge base integration available through Amazon Bedrock continue to evolve, making data analysis increasingly intuitive and powerful for business users across organizations.

To build your own text-to-SQL solution, explore the Amazon Bedrock User Guide, participate in an Amazon Bedrock Workshop, and review our guide on Building generative AI agents with Amazon Bedrock. For the latest developments, see What’s New with AWS.

Acknowledgments

We extend our sincere gratitude to our executive sponsors and mentors whose vision and guidance made this initiative possible: Aizaz Manzar, Director of AWS Global Sales; Ali Imam, Head of Startup Segment; and Akhand Singh, Head of Data Engineering.

Build AI-powered employee onboarding agents with Amazon Quick

Pegah Ojaghi — Mon, 06 Apr 2026 18:00:06 +0000

Enterprises often struggle to onboard new team members at scale. Human resources (HR) teams spend time on manual tasks that delay productivity, such as processing documents to answering repeated questions about benefits and policies. For organizations with many new hires, these steps make it harder to keep onboarding consistent and compliant. Organizations lose substantial amounts of time per day per new hire during onboarding, with new employees typically reaching only a fraction of their potential productivity in the first month. Amazon Quick is a fully managed agentic service. With it, HR departments can create no-code onboarding agents that answer new-hire questions, track compliance across existing tools, and clear tickets automatically so that new hires can ramp faster with less manual work.

In this post, we walk through building a custom HR onboarding agent with Quick. We show how to configure an agent that understands your organization’s processes, connects to your HR systems, and automates common tasks, such as answering new-hire questions and tracking document completion. You can adapt this solution to your onboarding workflow so new hires get consistent answers and HR teams reclaim time previously spent on routine inquiries.

Key components of Amazon Quick

Quick transforms employee onboarding from scattered documents and manual processes into an intelligent, connected experience through the following integrated components:

Knowledge bases – Indexed content from external sources like SharePoint, OneDrive, and Confluence, as well as internal content including internal websites, file uploads, and Amazon Simple Storage Service (Amazon S3) buckets. A knowledge base serves as a single searchable repository, so new hires get comprehensive answers from multiple sources instead of searching through disconnected files.
Actions (action connectors) – Secure, permission-aware integrations that enable AI agents to take real action in HR onboarding scenarios—creating ServiceNow IT equipment requests, sending Slack welcome messages to team channels, or updating onboarding workflows in project management tools—rather than just providing links to forms.
Spaces – Focused environments that organize team-centered assets including files, business intelligence artifacts (such as dashboards and topics), knowledge bases, and actions with sharing controls for team collaboration.

Quick can help HR teams create specialized onboarding assistants that combine knowledge access with automated tasks. You can use the built-in system agent (“My assistant”) for immediate help or create custom chat agents tailored to your organization’s specific onboarding needs, such as a dedicated HR onboarding assistant that knows your company policies and can automatically handle common requests like IT setup or benefits enrollment.

Solution overview

This solution uses a custom chat agent in Quick for employee onboarding. Without an agent, HR might switch between wikis, SharePoint, ticketing, chat, and email to coordinate each step. With Quick, the agent presents the latest checklist from the HR space, answers with approved language, opens requests through actions, notifies stakeholders, and points the employee to the next step. Confirmations and status remain in the HR tools, and the agent reads or updates them through actions or flows. The following diagram illustrates the solution architecture.

Implementing the solution consists of the following high-level steps:

Create the chat agent in Quick.
Attach the HR space and link knowledge sources.
Add actions.
Test with real questions and tasks, then share with employees.

Quick provides two types of chat agents that facilitate this onboarding solution: the system chat agent (“My assistant”) and custom chat agents. The system chat agent (“My assistant”) – “My assistant” appears on the Amazon Quick console by default and helps users ask questions and complete tasks using resources they are allowed to access. Users can interact with the system agent in multiple ways:

Ask general questions using the agent’s built-in knowledge by choosing General knowledge.
Upload their own files directly in chat (up to 20 files per conversation) for analysis and questions.
Control the conversation scope by choosing from three modes: All data & apps (searches across all accessible resources), General knowledge (uses only built-in knowledge), or Specific data & apps (targets particular spaces, dashboards, topics, knowledge bases, or actions). For example, a user might upload their employee handbook and ask, “What’s our remote work policy?” or select the HR space and ask, “How do I enroll in the health insurance plan?” The system agent is available immediately with no configuration required and adapts its responses based on the selected scope and available resources.

Custom agents help you build specialized assistants for your business needs. You configure behavior (purpose, tone, response format); attach spaces with dashboards, topics, and knowledge bases for grounded answers; and link action connectors so the agent can perform tasks in tools like Jira, Slack, ServiceNow, Salesforce, Outlook, or Teams. You can share custom agents with specific users or groups. Custom agents offer the following capabilities:

Use case-specific responses – Define the agent’s persona and response style tailored to specific business workflows and requirements.
Guidance through reference documents – Upload specific documents that serve as response templates for consistent messaging and process guides for following specific steps.
Comprehensive data integration – Link spaces to the agent to give it access to different types of searchable content and knowledge sources, including dashboards for analytics, topics for structured datasets, knowledge bases for external, unstructured document repositories, and local files uploaded directly to the space for additional information. This helps the agent answer questions using different relevant data within the organization’s permission structure.
Automated actions – Add action connectors so users can create Jira tickets, send Slack messages, update Salesforce, or open ServiceNow requests directly from chat.
Collaboration – Test, refine, and share agents with teammates. Administrators can control who can create and customize agents through user subscriptions and custom permissions.

You can use the system chat agent for general assistance across Quick, or create a custom agent tailored to a workflow such as HR onboarding. In that case, you define instructions, attach the HR space or knowledge base, and enable actions for requests and notifications.

In the following sections, we walk through the steps to implement this solution using two personas: the HR administrator who sets up and shares the agent, and the employee who completes onboarding tasks with the agent.

Prerequisites

Before you begin, make sure you have completed the following steps:

Create an AWS account. For more information, see Create an AWS account.
Confirm you have access to Quick.
At least one Amazon Quick Enterprise subscription to configure actions and create knowledge bases. Users who only use the shared agent can be on the Amazon Quick Professional subscription
Go to Get started with Atlassian Cloud and create a free site, selecting both Confluence and Jira on the Free plan (up to 10 users).
1. In Confluence, create an “HR Onboarding” space to store your HR content.
2. In Jira, create a simple HR onboarding project that the agent can use for access or equipment requests in the Add actions section.
Download the ZIP file from the HR onboarding workshop materials page.
From the HR documents folder in the ZIP file, upload the following files into your HR Onboarding Confluence space:
1. employee_handbook.pdf
2. leave_policy.pdf
3. onboarding_checklist.pdf
4. performance_review_guidelines.pdf
5. public_holidays.csv (optional, used later for reporting or analytics)

If your organization already uses a corporate Confluence site, you might not have permission to create spaces or upload sample files unless you request additional access from your Confluence administrator. To experience the value of Quick without waiting on admin changes, use a separate Atlassian Cloud site to follow this post.

Implementation Steps

This procedure uses two personas: the HR administrator who sets up and shares the agent, and the employee who completes onboarding tasks with the agent.

HR administrator

The following sequence diagram shows how the HR administrator creates, configures, and shares the HR onboarding agent in Quick.

Create chat agent

First, you create the chat agent itself, which becomes the single place where new hires ask questions and get guided through onboarding:

On the Quick console, choose Chat agents in the navigation pane, then choose Create.
Enter a simple natural language prompt describing what you want your agent to do (for example, “Help new employees with HR onboarding questions and equipment requests”).

Quick will automatically expand your prompt into a detailed persona and response instructions and scan your available resources to link relevant spaces and action connectors to the agent.

Review the generated agent configuration and refine as needed, updating the preview to save your versions within the session.
Choose Launch chat agent when you are satisfied.

Configure behavior

Next, you shape how the agent should respond so its tone, scope, and guardrails match your HR policies and HR brand:

Agent metadata – Update the agent’s name, description, welcome message, and starter prompts to help users discover and use the chat agent properly. These elements serve as the first impression and guide users on how to interact effectively with your HR assistant.
Agent instructions – Review and update the automatically generated persona instructions, response format, tone, and length settings from the previous step. The system-generated inputs provide a solid foundation, but you can fine-tune to match your organization’s specific HR communication style and requirements.
Reference documents – Upload specific guidance documents that provide the highest priority instructions for agent behavior. These reference documents will be followed as prescribed while you can use the instruction fields to provide high-level guidance on behavior and goals.

Connect HR knowledge

Now you connect your HR knowledge sources so the agent answers from approved handbooks and policies instead of inventing its own language:

Create or choose an existing HR space that holds handbooks, policies, and checklists. By configuring the agent’s knowledge scope to focus specifically on HR-related content, you make sure responses stay within appropriate boundaries and don’t access unrelated organizational data.
Choose Upload files to upload files to the space, including:
1. Employee handbooks and policy documents
2. Benefits information and FAQ documents
3. Training materials and guides
Link knowledge sources such as SharePoint or a wiki.
Link the configured space to your agent so it can access this approved searchable content for grounded responses.

Add actions

After the agent can answer questions, you add actions so it can also trigger work in your HR tools, such as tickets, requests, and notifications:

Open the Actions card and choose Link actions.
Select from available action connectors that you have already configured. For the HR onboarding use case, this could include tools such as Jira (to create and update tickets), ServiceNow (to manage incidents), or Microsoft Outlook (to send emails).

Only action connectors configured with the necessary OAuth details can be linked to the agent, so end-users can authenticate individually during their chat. Update your reference documents and persona instructions to specify when to invoke specific action connectors. For example: “When an employee requests equipment, use the ServiceNow connector to create a hardware request ticket,” or “For access requests, create a Jira ticket in the IT-Access project with priority set to ‘Normal.’”

Customize, test, and share

Finally, customize the agent with a welcome message and suggested prompts. You can test the agent with realistic scenarios, tune the experience, and share it with a pilot group so HR can validate the workflow before broad rollout. Test with real questions and tasks using the preview chat.

When you’re ready, launch the agent, and it will be available in your personal library for private use. To share with others, choose Share and add users and user groups as viewers to use the agent. You can also select other users from your team to be owners to edit and test the agent along with you. HR managers can share the custom agent with new employees by using the sharing options in the navigation pane to grant access to specific team members or groups.

Employee

The following sequence diagram shows how an employee uses the onboarding agent to complete required tasks and track their Day 1 progress in one place.

Use the onboarding agent

After the agent is published and shared with employees as viewers, they can open it from the link HR provides (for example, in their Day 1 email or HR portal) or from the chat agents list in Quick, and then use it as follows:

The employee opens the shared HR onboarding agent from the link or from the chat agents list and starts a new Day 1 conversation.
The agent shows the latest onboarding checklist from the HR Onboarding space and provides links to required forms, training, and internal pages so the employee can move through the steps in order.
The employee asks policy or benefits questions in plain language, and the agent answers using content from the HR Onboarding space and connected HR knowledge sources so responses match HR-approved language.
In this setup, when the employee requests equipment or application access, the agent uses a Jira action connector to create an issue in the HR onboarding project and returns the issue key and link so you can see the request end to end without touching production HR systems.
For sensitive steps such as I-9 verification, tax forms, or direct deposit, the agent directs the employee to the appropriate HR system or secure portal instead of collecting documents in chat so sensitive data stays in the right place.

As an employee, the experience is simple: they open a single chat, see their Day 1 checklist, ask questions in natural language, and let the agent open requests and point them to the right systems. Instead of juggling emails, portals, and tickets, onboarding feels like a guided conversation where each next step is clear.

You have now set up the HR Onboarding Confluence space with sample HR documents, created a custom onboarding agent in Quick, configured its behavior, connected HR knowledge, and added Jira actions for requests. You can use this setup as a proof of concept with a small group of new hires or HR partners, then extend it by adding more content, additional actions, or new spaces for other HR workflows such as performance reviews or policy updates.

Guardrails and safety

Quick includes built-in safety and content controls for chat agents, so you can follow along with this post using the default settings in your account. If you want to experiment with policy controls as part of this proof of concept, you can also add a small list of blocked words or phrases so the agent avoids specific terms in HR responses (for example, informal slang or discouraged wording). Blocked terms are configured on the Quick console and applied across agents in your account. For step-by-step instructions and additional security options such as access control and encryption, see the Amazon Quick User Guide.

Quick tiers

Quick offers two user subscriptions: Professional and Enterprise. Professional supports everyday use of chat agents and spaces, running Amazon Quick Flows and Amazon Quick Research, and viewing Amazon Quick Sight dashboards, with the ability to create and share custom agents and spaces. Enterprise includes everything in Professional plus advanced authoring features such as configuring actions, creating knowledge bases, building automations in Amazon Quick Automate, and authoring dashboards in Quick Sight, with larger monthly usage allowances. A 30‑day free trial is available for up to 25 users per account. For details, refer to Amazon Quick pricing.

Conclusion

This post showed how to build an HR onboarding chat agent in Quick, attach HR content, add actions and optional flows, and share it with employees. Start with a pilot that covers your most frequent questions and two or three requests, review usage, and refine the agent’s instructions and content. For next steps, expand the HR space, add additional actions as needed, and review the Quick documentation for advanced configuration. Beyond onboarding, HR teams can explore building agents for employee self-service, performance management, talent acquisition, learning and development, analytics, and off-boarding processes to transform their entire HR operations.

Ready to transform your workplace productivity? Get started with Quick, explore pricing options that fit your needs. Click here to begin building your own HR agent, explore our official documentation for detailed implementation guidance, or contact your AWS account team to discuss how Quick can transform your organization’s approach to data-driven decision-making.

About the authors

Accelerate agentic tool calling with serverless model customization in Amazon SageMaker AI

Lauren Mullennex — Mon, 06 Apr 2026 17:54:00 +0000

Agentic tool calling is what makes AI agents useful in production. It’s how they query databases, trigger workflows, retrieve real-time data, and act on a user’s behalf. But base models frequently hallucinate tools, pass bad parameters, and attempt actions when they should ask for clarification. These failures erode trust and block production deployment.

You can use Serverless model customization in Amazon SageMaker AI to fix these problems without managing infrastructure. With Reinforcement Learning with Verifiable Rewards (RLVR), the model generates its own candidate responses, receives a reward signal indicating quality, and updates its behavior to favor what works. You select a model, configure a technique, point to your data and reward function, and SageMaker AI handles the rest. In this post, we walk through how we fine-tuned Qwen 2.5 7B Instruct for tool calling using RLVR. We cover dataset preparation across three distinct agent behaviors, reward function design with tiered scoring, training configuration and results interpretation, evaluation on held-out data with unseen tools, and deployment. By the end, our fine-tuned model improved tool call reward by 57% over the base model on scenarios that it didn’t see during training.

Because tool calling has a naturally verifiable objective, whether the model called the right function with the right parameters, it maps well to RLVR. The challenge with self-managed reinforcement learning (RL) is the operational overhead. GPU procurement, memory orchestration between rollout and training phases, reward infrastructure, and checkpointing add up quickly. Hyperparameter sensitivity adds another layer of complexity. SageMaker AI takes on that work so you can focus on your model, your data, and your reward function.

SageMaker AI supports model families including Amazon Nova, GPT-OSS, Llama, Qwen, and DeepSeek, with techniques including Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), RLVR, and Reinforcement Learning from AI Feedback (RLAIF). Training and validation metrics are tracked through integrated MLflow.

Why RLVR for tool calling

SFT requires labeled examples of each behavior that you want the model to learn. For tool calling, that means examples of calling a tool, asking for clarification, and refusing. But tool calling also requires the model to decide between those behaviors, and SFT can struggle to generalize that decision-making beyond the specific patterns in its training data.

RLVR works differently. For each prompt, the model generates multiple candidate responses (we use eight). A reward function verifies which ones are correct. The model then updates its policy to favor what worked, using Group Relative Policy Optimization (GRPO). GRPO compares each candidate’s reward score against the mean score of the group and reinforces responses that score above average. Over time, the model learns the format of a tool call and when to call compared to when to ask.

Prerequisites

To use serverless model customization in SageMaker AI, you must have the following prerequisites:

An AWS account
An AWS IAM role with the required permissions
A SageMaker AI domain with Studio access
An Amazon Simple Storage Service (Amazon S3) bucket

Fine-tune Qwen 2.5 7B Instruct in SageMaker AI

To get started, we open Amazon SageMaker AI Studio and choose Models in the left navigation pane to browse the foundation models (FM) that are available for customization.

In the Customize model menu, select Qwen 2.5 7B Instruct, and choose Customize with UI. This opens the customization configuration page where you select your technique, point to your training data and reward function, and configure hyperparameters. We selected Reinforcement Learning from Verifiable Rewards (RLVR) as our customization technique.

Prepare your training data

A tool calling dataset needs to teach more than correct API invocations. Production agents face three distinct situations:

The user provides enough information, and the model should call a tool.
The user’s request is missing required parameters, and the model should ask for clarification.
The request is harmful or out of scope, and the model should refuse.

We generated 1,500 synthetic training examples from our tool schemas (weather, flights, translation, currency conversion, statistics) using Kiro, the Amazon AI-powered IDE, to produce prompts with realistic variation in phrasing and specificity across the three behaviors. Here’s an example of the prompt we used:

Generate 1,500 JSONL training examples for RLVR tool-calling
fine-tuning across 5 tool schemas: get_weather_forecast,
search_flights, translate_text, currency_convert, and
get_statistics.

Each line must follow this format:
{"prompt": [{"role": "system", "content": "..."}, {"role": "user", "content": "..."}], "reward_model": {"ground_truth": "..."}}

Distribute examples across three behaviors:
1. Execute (60%): User provides all required params → ground_truth is the tool call JSON
2. Clarify (25%): User is missing required params → ground_truth is a clarifying question
3. Refuse (15%): Request is harmful or out of scope → ground_truth is a polite refusal

Vary phrasing between formal, casual, and terse.
Output valid JSONL only, no commentary.

This is a practical path for teams that don’t yet have production logs to draw from. For organizations already running agentic workflows, real user prompts and tool calls from production will yield even higher-quality training data.

Each training example contains a prompt (a system instruction and user request) and a ground truth in the reward_model field that the reward function scores against. Here are examples of each behavior.

Execute when the user provides everything the tool needs:

{
  "prompt": [
    {"role": "system", "content": "You are a helpful assistant. When using tools, respond with: [...]"},
    {"role": "user", "content": "Get weather for San Francisco"}
  ],
  "reward_model": {
    "ground_truth": "[{"name": "get_weather_forecast", "arguments": {"city": "San Francisco"}}]"
  }
}

Clarify when a required parameter is missing:

{
  "prompt": [
    {"role": "system", "content": "You are a helpful assistant. When using tools, respond with: [...]"},
    {"role": "user", "content": "Get the weather"}
  ],
  "reward_model": {
    "ground_truth": "To provide you with the weather information, could you please specify the location?"
  }
}

Execute with multiple parameters:

{
  "prompt": [
    {"role": "system", "content": "You are a helpful assistant. When using tools, respond with: [...]"},
    {"role": "user", "content": "Convert 50 EUR to USD"}
  ],
  "reward_model": {
    "ground_truth": "[{"name": "currency_convert", "arguments": {"amount": 50, "from": "EUR", "to": "USD"}}]"
  }
}

Notice the difference between “Get weather for San Francisco” (tool call) and “Get the weather” (clarification). This is the kind of distinction GRPO learns well. For each prompt, the model generates eight candidates, the reward function scores them, and the scores are averaged across the group. Candidates above the mean get reinforced, and over time the model picks up when to call and when to ask.

Define your reward function

The reward function defines what correct means for our use case. We write it as a Python function that receives the model’s response and the ground truth from the training data and returns a numerical score. Ours extracts tool calls from the model’s response, parses them as JSON, and compares against the ground truth.

The full function handles response extraction, flexible parsing for alternative formats during early training, and edge cases around JSON type mismatches. Here is the core scoring logic:

# After extracting and parsing tool calls from model response and ground truth:

# Compare tool names
pred_names = {tool.get('name', '') for tool in pred_tools}
gt_names = {tool.get('name', '') for tool in gt_tools}

if pred_names == gt_names:
    # Right function(s) - check if arguments also match
    perfect_match = True
    for pred_tool in pred_tools:
        for gt_tool in gt_tools:
            if pred_tool.get('name') == gt_tool.get('name'):
                if pred_tool.get('arguments') != gt_tool.get('arguments'):
                    perfect_match = False
    score = 1.0 if perfect_match else 0.5
elif pred_names & gt_names:
    # Partial overlap in function names
    score = 0.5
else:
    # Wrong function entirely
    score = 0.0

The three tiers (1.0, 0.5, and 0.0) give GRPO a richer learning signal. If several of the eight candidates get the function right but miss a parameter, the 0.5 score distinguishes them from completely wrong answers. This helps the model recognize that it’s on the right track.

For clarification and refusal cases where the ground truth is natural language (no TOOLCALL tags), the reward function checks whether the model also avoided calling a tool. An unnecessary API call when the model should have asked a question earns 0.0.

Configure and launch training

On the customization configuration page, we point to our training dataset and reward function, then set our hyperparameters. We use a batch size of 128, learning rate of 5e-6, 3 epochs, and 8 rollouts per prompt.

The rollouts setting is the core GRPO mechanism. For each training prompt, the model generates eight different responses, the reward function scores each one, and responses that score above the group average get reinforced. Training and validation metrics are logged to MLflow. In this example, training takes approximately 40 minutes.

Training results

Train Reward Statistics (top left) is the chart to focus on. The mean reward across the roll outs started around 0.28 and climbed to 0.65–0.68 over 30 steps, more than doubling. The steepest gains happen in the first 10 steps as the model learns the basic tool calling format and decision structure. It then flattens after step 20 as it converges.

The other charts confirm healthy training:

Policy Entropy decreases, meaning the model is getting more confident rather than guessing.
Gradient Norm stabilizes, meaning updates are getting smaller and more refined.
Mean Advantage Estimate converges toward zero, indicating that the model’s policy is stabilizing and the average response quality is aligning with the reward baseline.

Evaluate the fine-tuned model

After the training job is complete, you can see the models that you created in the My Models tab. To expand the details, choose View details on one of your models.

You can choose Continue customization to iterate further by adjusting hyperparameters or training with a different technique. Choose Evaluate to compare your customized model against the base model.

We evaluate on a separate test set of 300 examples that were excluded from training. The evaluation dataset covers the same three behaviors but includes tools, phrasings, and scenarios that the model hasn’t seen. It tests search_restaurants, get_stock_price, and calculate_standard_deviation, none of which appeared during training. It also includes refusal cases for harmful requests like generating violent content or creating malware, testing whether the model generalizes safe behavior to new threats.

The evaluation runs standard NLP metrics alongside our custom reward function against the held-out set.

Tool Call Reward is our custom metric and the most direct measure of what we trained for. It jumped from 0.35 to 0.55, a 57% improvement. In practical terms, this means that the fine-tuned model makes the correct tool calling decision significantly more often. It calls the right function with the right parameters when it should, asks for clarification when information is missing, and refuses when appropriate.

F1 Score Quasi, Rouge1, and RougeL all improved by 14–19 percentage points, reflecting better generation of correct function names, parameter keys, and values across the board.

Exact Match doubled from 11% to 21%. This metric requires character-for-character output matching, so even small formatting differences count as a miss. The 21% exact match alongside 64.6% F1 suggests that the model often gets the semantics right even when surface formatting differs slightly.

These results are all on held-out data with tools and scenarios that the model didn’t see during training. The improvements carry over to new tools, which means the model picked up general tool calling patterns rather than memorizing the training set.

Here are responses from the fine-tuned model on prompts that it didn’t see during training:

Tool execution:

> Prompt: “What’s the stock price of AMZN?” > > Model response: [{"name": "get_stock_price", "arguments": {"ticker": "AMZN"}}]

Clarification:

> Prompt: “Find me a restaurant nearby” > > Model response: “I’d be happy to help you find a restaurant! Could you please provide the cuisine type or location you’re interested in?”

Near-miss:

> Prompt: “What’s the standard deviation of 10, 20, 30, 40, 50?” > > Model response: [{"name": "calculate_standard_deviation", "arguments": {"numbers": "10, 20, 30, 40, 50"}}]

In the near-miss case, the model selected the correct tool but passed the numbers as a string instead of an array. This earns a 0.5 reward score (right function, wrong parameter format) and represents the kind of error that you’d target in the next iteration through additional training data or reward function refinement.

Deploy the fine-tuned model

With evaluation confirming improvement, deploy the fine-tuned model directly from the model details page. Choose Deploy, and select your deployment target: either a SageMaker AI endpoint or Amazon Bedrock. You can also download the model weights from Amazon S3 for self-managed deployment.

Conclusion

In this post, we fine-tuned Qwen 2.5 7B Instruct for agentic tool calling using RLVR and GRPO through serverless model customization in Amazon SageMaker AI. We prepared a dataset spanning three tool-calling behaviors (execute, clarify, refuse), defined a tiered reward function, trained the model in about 40 minutes, evaluated on held-out data with unseen tools and scenarios, and deployed. The fine-tuned model improved tool call reward by 57% over the base model.

To push accuracy further, you can expand your training data with additional tools, edge cases, and multi-turn conversations to cover more of the scenarios that your agents encounter in production. You can also refine your reward function to penalize specific failure modes, like the string-vs-array parameter issue shown in the previous section, or add partial credit for other near-miss patterns. If you’re running agentic workflows, your production logs are a high-quality source of training data that can make the model even more effective for your specific use case. Beyond tool calling, RLVR applies to other reasoning tasks where correctness is verifiable, such as multi-step planning, structured data extraction, or code generation.

While this post walks through the UI workflow, an SDK for programmatic access is also available. To learn more, see the SageMaker AI model customization documentation.

To get started, try serverless AI model customization in Amazon SageMaker AI with your own use cases.

About the authors

Building Intelligent Search with Amazon Bedrock and Amazon OpenSearch for hybrid RAG solutions

Arpit Gupta — Mon, 06 Apr 2026 17:49:32 +0000

Agentic generative AI assistants represent a significant advancement in artificial intelligence, featuring dynamic systems powered by large language models (LLMs) that engage in open-ended dialogue and tackle complex tasks. Unlike basic chatbots, these implementations possess broad intelligence, maintaining multi-step conversations while adapting to user needs and executing necessary backend tasks.

These systems retrieve business-specific data in real-time through API calls and database lookups, incorporating this information into LLM-generated responses or providing it alongside them using predefined standards. This combination of LLM capabilities with dynamic data retrieval is known as Retrieval-Augmented Generation (RAG).

For example, an agentic assistant handling hotel booking would first query a database to find properties that match the guest’s specific requirements. The assistant would then make API calls to retrieve real-time information about room availability and current rates. This retrieved data can be handled in two ways: either the LLM can process it to generate a comprehensive response, or it can be displayed alongside an LLM-generated summary. Both approaches allow guests receive precise, current information that’s integrated into their ongoing conversation with the assistant.

In this post, we show how to implement a generative AI agentic assistant that uses both semantic and text-based search using Amazon Bedrock, Amazon Bedrock AgentCore, Strands Agents and Amazon OpenSearch.

Information retrieval approaches in RAG systems

Generally speaking, information retrieval supporting RAG capabilities in agentic generative AI implementations revolves around real-time querying of the backend data sources or communicating with an API. The responses are then factored into the subsequent steps performed by the implementation. From a high-level system design and implementation perspective, this step is not specific to generative AI-based solutions: Databases, APIs, and systems relying on integration with them have been around for a long time. There are certain information retrieval approaches that have emerged alongside agentic AI implementations, most notably, semantic search-based data lookups. They retrieve data based on the meaning of the search phrase as opposed to keyword or pattern lexical similarity. Vector embeddings are precomputed and stored in vector databases, enabling efficient similarity calculations at query time. The core principle of Vector Similarity Search (VSS) involves finding the closest matches between these numerical representations using mathematical distance metrics such as cosine similarity or Euclidean distance. These mathematical functions are particularly efficient when searching through large corpora of data because the vector representations are precomputed. Bi-encoder models are commonly used in this process. They separately encode the query and documents into vectors, enabling efficient similarity comparisons at scale without requiring the model to process query-document pairs together. When a user submits a query, the system converts it into a vector and searches for content vectors positioned closest to it in the high-dimensional space. This means that even if exact keywords don’t match, the search can find relevant results based on conceptual semantic similarity. Moreover, in situations where search terms are lexically but not semantically close to entries in the dataset, semantic similarity search will “prefer” semantically similar entries.

For example, given the vectorized dataset: [“building materials”, “plumbing supplies”, “2×2 multiplication result”], the search string “2×4 lumber board” will most likely produce “building materials” as the top matching candidate. Combining semantic search with LLM-driven agents supports natural language alignment across the user-facing and backend data retrieval components of the solution. LLMs process natural language Input provided by the user while semantic search capabilities allow for data retrieval based on the natural language Input formulated by LLMs depending on the end user – agent communication cadence.

The challenge: When semantic search alone isn’t enough

Consider a real-world scenario: A customer is searching for a hotel property and wants to find “a luxury hotel with ocean views in Miami, Florida.” Semantic search excels at understanding concepts like “luxury” and “ocean views,” it may struggle with precise location matching. The search might return highly relevant luxury oceanfront properties based on semantic similarity, but these could be in California, the Caribbean, or anywhere else with ocean access, not specifically in Miami as requested. This limitation arises because semantic search prioritizes conceptual similarity over exact attribute matching. In cases where users need both semantic understanding (luxury, ocean views) and precise filtering (Miami, Florida), relying solely on semantic search produces suboptimal results. This is where hybrid search becomes essential. It combines the semantic understanding of natural language descriptions with the precision of text-based filtering on structured attributes like location, dates, or specific metadata. To address this, we introduce a hybrid search approach that performs both:

Semantic search to understand natural language descriptions and find semantically similar content
Text-based search to facilitate precise matching on structured attributes like locations, dates, or identifiers

When a user provides a search phrase, an LLM first analyzes the query to identify specific attributes (such as location) and maps them to searchable values (for example, “Northern Michigan” → “MI”). These extracted attributes are then used as filters in conjunction with semantic similarity scoring, making sure that results are both conceptually relevant and precisely matched to the user’s requirements. The following tables provide a simplified view of the semantic search flow with clear text hotel descriptions provided for context:

Vector store data:

hotel-1	Description: The Artisan Loft hotel anchors the corner of Green and Randolph Streets in Big City’s bustling Southwest Loop, occupying a thoughtfully renovated 1920s brick warehouse that celebrates the neighborhood’s industrial heritage. Guests find themselves mere steps from the famed Restaurant Row, with acclaimed dining spots and trendy boutiques dotting the surrounding blocks. Description Vector: […] Location: Big City, USA
hotel-2	Description: Perched on a rugged cliff overlooking the dramatic coastline of Big Sur, The Cypress Haven emerges from the landscape as if it were carved from the earth itself. This intimate 42-room sanctuary seamlessly integrates into its surroundings with living roof gardens, floor-to-ceiling windows, and natural materials including local stone and reclaimed redwood. Each spacious suite features a private terrace suspended over the Pacific, where guests can spot migrating whales while soaking in Japanese cedar ofuro tubs. Description Vector: […] Location: Beach City, USA
hotel-3	Description: Nestled in a centuries-old maple forest just outside the Berkshires, Woodland Haven Lodge offers an intimate escape where luxury meets mindful simplicity. This converted 19th-century estate features 28 thoughtfully appointed rooms spread across the main house and four separate cottages, each with wraparound porches and floor-to-ceiling windows that frame the surrounding woodlands. Description Vector: […] Location: Quiet City, USA
hotel-4	Description: Nestled in the heart of Central City’s bustling downtown district, the Skyline Oasis hotel stands as a beacon of luxury and modernity. This 45-story glass and steel tower offers breathtaking panoramic views of the city’s iconic skyline and the nearby Central River. With 500 elegantly appointed rooms and suites, the Skyline Oasis caters to both business travelers and tourists seeking a premium urban experience. The hotel boasts a rooftop infinity pool, a Michelin-starred restaurant, and a state-of-the-art fitness center. Its prime location puts guests within walking distance of Central City’s major attractions, including the Museum of Modern Art, the Central City Opera House, and the vibrant Riverfront District. Description Vector: […] Location: Central City, USA

Search Phrase	Looking for a hotel by the ocean
Search Results	hotel-2

Search example:

Search phrase: “Looking for a hotel by the ocean”
Semantic search result: hotel-2 (The Cypress Haven)

Hybrid search example:

Search phrase: “Looking for a hotel with a nice restaurant in downtown Central City”
Hybrid search result: hotel-4 (best match considering both semantic relevance and precise location)

For more details on hybrid search implementations, refer to the Amazon Bedrock Knowledge Bases hybrid search blog post.

Introducing an agent-based solution

Consider a hotel search scenario where users have diverse needs. One user might ask “find me a cozy hotel,” requiring semantic understanding of “cozy.” Another might request “find hotels in Miami,” needing precise location filtering. A third might want “a luxury beachfront hotel in Miami,” requiring both approaches simultaneously. Traditional RAG implementations with fixed workflows cannot adapt dynamically to these varying requirements. Our scenario demands custom search logic that can combine multiple data sources and dynamically adapt retrieval strategies based on query characteristics. An agent-based approach provides this flexibility. The LLM itself determines the optimal search strategy by analyzing each query and selecting the appropriate tools.

Why agents?

Agent-based systems offer superior adaptability because the LLM determines the sequence of actions needed to solve problems, enabling dynamic decision routing, intelligent tool selection, and quality control through self-evaluation. The following sections show how to implement a generative AI agentic assistant that uses both semantic and text-based search using Amazon Bedrock, Amazon Bedrock AgentCore, Strands Agents and Amazon OpenSearch.

Architecture overview

Figure 1 shows a modern, serverless architecture that you can use for an intelligent search assistant. It combines the foundation models in Amazon Bedrock, Amazon Bedrock AgentCore (for agent orchestration), and Amazon OpenSearch Serverless (for hybrid search capabilities).

Client interaction layer
Client applications interact with the system through Amazon API Gateway, which provides a secure, scalable entry point for user requests. When a user asks a question like “Find me a beachfront hotel in Northern Michigan,” the request flows through API Gateway to Amazon Bedrock AgentCore.

Agent orchestration with Amazon Bedrock AgentCore
Amazon Bedrock AgentCore serves as the orchestration engine, managing the complete agent lifecycle and coordinating interactions between the user, the LLM, and available tools. AgentCore implements the agentic loop—a continuous cycle of reasoning, action, and observation—where the agent:

Analyzes the user’s query using Bedrock’s foundation models
Decides which tools to invoke based on the query requirements
Executes the appropriate hybrid search tool with extracted parameters
Evaluates the results and determines if additional actions are needed
Responds to the user with synthesized information

Throughout this process, Amazon Bedrock Guardrails enforce content safety and policy adherence, maintaining appropriate responses.

Hybrid search with OpenSearch Serverless
The architecture integrates Amazon OpenSearch Serverless as the vector store and search engine. OpenSearch stores both vectorized embeddings (for semantic understanding) and structured text fields (for precise filtering). This approach supporting our hybrid search approach. When the agent invokes the hybrid search tool, OpenSearch executes queries that combine:

Semantic matching using vector similarity for conceptual understanding
Text-based filtering for precise constraints like location or amenities

Monitoring and security
The architecture includes Amazon CloudWatch for monitoring system performance and usage patterns. AWS IAM manages access control and security policies across components.

Why this architecture?
This serverless design provides several key advantages:

Low-latency responses for real-time conversational interactions
Auto-scaling to handle varying workloads without manual intervention
Cost-effectiveness through pay-as-you-go pricing with no idle infrastructure
Production-ready with built-in monitoring, logging, and security features

The combination of the AgentCore orchestration capabilities with hybrid search functionality of OpenSearch allows our assistant to dynamically adapt its search strategy based on user intent, something that rigid RAG pipelines cannot achieve.

Figure 1

Figure Note: The code samples and architecture artifacts provided in this document are intended for demonstration and reference purposes only and are not production-ready.

Implementation with Strands and Amazon Bedrock AgentCore

To build our hybrid search agent, we use Strands, an open-source AI agent framework that simplifies developing LLM-powered applications with tool-calling capabilities. Strands allow us to define our hybrid search function as a “tool” that the agent can intelligently invoke based on user queries. For comprehensive details on Strands architecture and patterns, see the Strands documentation.

Here’s how we define our hybrid search tool:

from strands import tool

@tool
def hybrid_search(query_text: str, country: str = None, city: str = None):
    """
    Performs hybrid search combining semantic understanding with location filtering.
    The agent calls this when users provide both descriptive preferences and location.
    
    Args:
        query_text: Natural language description of what to search for
        country: Optional country filter
        city: Optional city filter
    """
    # Generate embeddings for semantic search
    vector = generate_embeddings(query_text)
    
    # Build hybrid query combining vector similarity and text filters
    query = {
        "bool": {
            "must": [
                {"knn": {"embedding_field": {"vector": vector, "k": 10}}}
            ],
            "filter": []
        }
    }
    
    # Add location filters if provided
    if country:
        query["bool"]["filter"].append({"term": {"country": country}})
    if city:
        query["bool"]["filter"].append({"term": {"city": city}})
    
    # Execute search in OpenSearch
    response = opensearch_client.search(index="hotels", body=query)
    
    return format_results(response)

Once we’ve defined our tools, we integrate them with Amazon Bedrock AgentCore for deployment and runtime orchestration. Amazon Bedrock AgentCore enables you to deploy and operate highly effective agents securely at scale using any framework and model. It provides purpose-built infrastructure to securely scale agents and controls to operate trustworthy agents.

For detailed information about integrating Strands with Amazon Bedrock AgentCore, see the AgentCore-Strands integration tutorial.

Hybrid search implementation deep dive

A key differentiator of our AI assistant solution is its advanced hybrid search capability. While many RAG implementations rely solely on semantic search, our architecture extends beyond this. We’ve used the full potential of OpenSearch, enabling semantic, text-based, and hybrid searches, all within a single, efficient query. The following sections explore the technical details of this implementation.

The two-pronged implementation
Our hybrid search implementation is built on two fundamental components: optimized data storage and versatile query handling.

1. Optimized data storage

The approach to data storage is important for efficient hybrid search.

Data categorization: We systematically categorize our data into two main types:
- Semantic search candidates: This includes detailed descriptions, contexts, and explanations – content that benefits from understanding meaning beyond keywords.
- Text search candidates: This encompasses metadata, product identifiers, dates, and other structured fields.
Vector embedding: For our semantic data, we use AWS Bedrock’s embedding models. These transform text into high-dimensional vectors that capture semantic meaning effectively.
Text data optimization: Text data is stored in its original format, optimized for rapid traditional queries.
Unified index structure: Our OpenSearch index is designed to accommodate both vector embeddings and text fields concurrently, enabling flexible querying capabilities.

2. Versatile search functionality

Building on our optimized data storage, we’ve developed a comprehensive search function that our AI agent can utilize effectively:

Adaptive search types: Our search function is designed to perform semantic, text, or hybrid searches as required by the agent.
Semantic search implementation: For meaning-focused queries, we generate query embeddings using Amazon Bedrock and perform a k-NN (k-Nearest Neighbors) search in the vector space.
Text search capabilities: When precise matching is necessary, we use OpenSearch’s robust text query functionalities, including exact and fuzzy matching options.
Hybrid search execution: This is where we combine vector similarity with text matching in a unified query. Using OpenSearch’s bool query, we can adjust the balance between semantic and text relevance as needed.
Result integration: Regardless of the search type, our system consolidates and ranks results based on overall relevance, combining semantic understanding with precise text matching.

Reference pseudo code for hybrid search implementation:

def hybrid_search(query_text, country, city, search_type="hybrid"):
    """
    Hybrid search combining semantic and text-based search with location filtering
    """

   # 1. Generate embeddings for semantic search
    if search_type in ["semantic", "hybrid"]:
        vector = generate_embeddings(query_text)
    
    # 2. Build search query based on type
    if search_type == "semantic":
        query = build_semantic_query(vector)
    elif search_type == "text":
        query = build_text_query(country, city)
    else:  # hybrid search
        query = build_hybrid_query(vector, country, city)
    
    # 3. Execute search
    response = search_opensearch(query)
    
    # 4. Process and return results
    return format_results(response)

# Example usage:
results = hybrid_search(
    query_text="luxury hotel",
    country="USA",
    city="Miami"
)

OpenSearch supports multiple query types including text-based search, vector search (knn), and hybrid approaches that combine both methods. For detailed information about available query types and their implementations, refer to the OpenSearch query documentation.

Significance of the hybrid approach

The hybrid approach significantly enhances our AI assistant’s capabilities:

It supports highly accurate information retrieval, considering both context and content.
It adapts to various query types, maintaining consistent performance.
It provides more relevant and comprehensive responses to user inquiries.

In the domain of AI-powered search, our hybrid approach represents a significant advancement. It offers a level of flexibility and accuracy that substantially improves our assistant’s ability to retrieve and process information effectively.

Real-life use cases

Some of the use cases where hybrid search can be applicable include things like:

Real estate and property: Property search combining lifestyle preference understanding (“family-friendly”) with exact location and amenity filtering.
Legal and professional services: Case law research combining conceptual legal similarity with precise jurisdiction and date filtering for comprehensive legal research.
Healthcare and medical: Care teams ask “patients with chronic conditions requiring similar treatment protocols as John Doe” – combines semantic understanding of treatment complexity with exact medical record matching.
Media and entertainment: Content discovery system combining exact genre filtering with semantic plot understanding
E-commerce and retail: Natural language product discovery with filter precision – “comfortable winter shoes” finds semantic matches while applying exact size or price or brand filters.

These use cases demonstrate how hybrid search bridges the gap between natural language understanding and precise data filtering, enabling more intuitive and accurate information retrieval.

Conclusion

The integration of Amazon Bedrock, Amazon Bedrock AgentCore, Strands Agents, and Amazon OpenSearch Serverless represents a significant advancement in building intelligent search applications that combine the power of LLMs with sophisticated information retrieval techniques. This architecture blends semantic, text-based, and hybrid search capabilities to deliver more accurate and contextually relevant results than traditional approaches. By implementing an agent-based system using Amazon Bedrock AgentCore, state management and Strands tool abstractions, developers can create dynamic, conversational AI assistants that intelligently determine the most appropriate search strategies based on user queries. The hybrid search approach, which combines vector similarity with precise text matching, offers flexibility and accuracy in information retrieval, enabling AI systems to better understand user intent and deliver more comprehensive responses. As organizations continue to build AI solutions, this architecture provides a scalable, secure foundation that uses the full potential of AWS services while maintaining the adaptability needed for complex, real-world applications.

About the authors

From isolated alerts to contextual intelligence: Agentic maritime anomaly analysis with generative AI

Nikita Kozodoi — Mon, 06 Apr 2026 17:48:54 +0000

This post is co-written with Arad Ben Haim and Hannah Danan Moise from Windward.

Windward is a leading Maritime AI company, delivering mission-grade, multi-source intelligence for maritime-based operations. By fusing Automatic Identification System (AIS) data, remote sensing signals, proprietary AI models, and generative AI, Windward provides a 360° view of global maritime activity so defense and intelligence agencies, law enforcement, and commercial leaders can anticipate threats, protect critical assets, and stay in control at sea.

This blog post demonstrates how Windward helps enhance and accelerate alert investigation processes by combining geospatial intelligence with generative AI, enabling analysts to focus on decision-making rather than data collection. Prior to using Windward, maritime analysts spent hours manually gathering and correlating complex data to understand vessel behavior anomalies: unusual activity spikes, unexpected movements, deviations from known patterns. It required significant time and deep domain expertise. Windward’s Maritime AI automates this process, surfacing context and implications so analysts and companies can make informed decisions about maritime risks and opportunities with speed and precision.

Challenge

Maritime analysts rely on Windward’s system to stay ahead of complex global threats. As part of Windward’s ongoing commitment to facilitate a “mission-ready” user experience, the company continuously evolves how users move from detection to decision-making. While Windward Early Detection successfully identifies suspicious patterns, Windward further accelerated situational awareness by making the investigative process more fluid and automated.

To optimize the analytical workflow, Windward sought to enhance the correlation of external context through three key strategic improvements:

Unified Workflow: Minimizing the need to consult external data sources, facilitating a continuous and focused analytical environment.

Expertise Optimization: Automating the collection of weather, news, and alert data to allow domain experts to dedicate more time to strategic interpretation.

Comprehensive Coverage: Streamlining the synthesis of information to enable more rapid and in-depth investigation of multiple alerts simultaneously.

As a core component of MAI Expert, the first generative AI maritime agent, Windward partnered with the AWS Generative AI Innovation Center to deliver a solution that automatically contextualizes maritime anomalies. This collaboration helped enhance the user experience by correlating alerts with relevant public and proprietary data, integrating these findings seamlessly with Windward’s internal models, and uses generative AI to help deliver comprehensive, actionable risk assessments.

Solution overview

In collaboration with AWS, Windward developed a multi-step AI-powered solution that automatically fetches relevant data from a variety of internal and external data sources and uses this information to generate a textual description that contextualizes maritime anomaly events.Figure 1 depicts the end-to-end architecture of the solution deployed to AWS.

Figure 1. Solution architecture

Given an anomaly identified in the Windward Early Detection system, the solution extracts relevant metadata from the anomaly event using Windward’s internal database. The metadata includes the anomaly timestamp, region coordinates, anomaly type, vessel class, and other relevant information.

Next, the anomaly metadata is passed to the agentic analysis system powered by large language models (LLMs) on Amazon Bedrock. The multi-step anomaly analysis pipeline is orchestrated using AWS Step Functions. In the first step, the system queries multiple, diverse external data sources to provide relevant background on the anomaly, which is a key part of creating new value for our customers. These sources include:

Real-time news feed: Alerts and event signals discovered from public data are fetched and filtered based on the maritime anomaly’s time and location.
Intelligent web search: The system uses large language models to generate precise search queries, retrieving real-time web search results that provide up-to-date context for the anomaly.
Weather data: An external API is used to retrieve relevant weather data, such as temperature, wind speed, and precipitation, for the anomaly’s location and time.

Each data source is queried using a separate AWS Lambda function. After retrieving the data from the three sources, the pipeline moves to the second step. In the second step, a separate LLM—powered by Anthropic’s Claude through Amazon Bedrock—examines the data items and decides whether there is a need to fetch additional web search results. The LLM is instructed to make the decision after cross-checking the anomaly data against the retrieved data items and judging whether the data retrieved so far is sufficient to explain the anomaly or if some aspects related to the event are missing. The LLM either generates a new search query or a command to move to the next step of the pipeline. The Lambda function parses the LLM output and optionally triggers the web search function again to retrieve additional news that might provide important context about the anomaly, appending it to the previous search results. If there are no new search queries, the Step Function proceeds to the next Lambda function in the pipeline.

Figure 2. Self-reflection logic

After running self-reflection and additional data retrieval, the system performs two filtering and ranking steps to remove news items that are not related to the considered anomaly. First, it uses a re-ranking AI model, Amazon Rerank, which sorts the data items according to their relevance to the anomaly. This step is geared toward maintaining high recall, focusing on removing the most irrelevant data items to reduce the set of candidate items to process on the next stage. Second, each of the top-ranked items is further scored by the LLM across multiple dimensions, including time, location, matching vessel type, and others. The system assigns relevance scores between 0 and 100 and only keeps data items with a relevance score above a threshold determined by the solution developers. This step is more precise and is geared toward high precision, making sure only the most relevant news items are kept. The top-ranked data and news items are passed to the next step of the solution pipeline.Finally, the pipeline uses another LLM that uses the top-ranked data items to generate a contextualized report on the anomaly, summarizing its potential causes, risks, and implications. The concise report is written for Windward’s customers and directly cites the data sources used, which allows users to verify the information and learn additional details by following the links. Figure 3 provides an example of what the generated report looks like for one of the vessel activity anomalies.

Figure 3. Example Anomaly Report

Evaluation

The end-to-end system is evaluated on a set of existing maritime anomalies that occurred in the past. The evaluation consists of several stages. First, the summaries are automatically evaluated using an LLM-as-a-judge approach, a method that included human-alignment work for the LLM judges. The judge uses a set of six predefined criteria, including credibility, data quality, source diversity, coherence, and ethical bias. The judge evaluates each dimension on a scale between 1 and 100 and assigns the scores to each report. Figure 4 depicts example scores assigned to one of the generated reports by the LLM judge.Second, we calculate several deterministic metrics on the report quality. This includes the length of the report in characters, as well as the number of data sources explicitly cited in the text. These metrics help to judge the size and the credibility of the generated explanation.Finally, the selected summaries are also evaluated by human experts, who cross-check the generated summaries and retrieved data sources against their own search results and domain understanding.

Figure 4. Example LLM-as-a-judge scores

Conclusion

The initial agentic solution presented in this blog marked an important milestone in the development of Windward’s MAI Expert. Building on Windward’s already powerful system, this enhancement helped accelerate maritime alert investigation and enabled analysts to focus even more on decision-making rather than data collection.This approach combined geospatial intelligence with generative AI to streamline what was previously a manual, time-intensive process. High-quality anomaly summaries generated by the system helped analysts better understand the context of maritime events—unusual activity spikes, unexpected movements, deviations from known patterns—and make informed decisions about corresponding risks and opportunities.These capabilities expanded Windward’s value proposition across user segments. For existing users with deep maritime expertise, they further helped streamline workflows and reduce the time needed to derive relevant context. For users with limited maritime expertise, they opened new possibilities by surfacing critical insights without requiring manual correlation of complex datasets.

About the authors

Connecting MCP servers to Amazon Bedrock AgentCore Gateway using Authorization Code flow

Arko Dutta — Mon, 06 Apr 2026 14:41:46 +0000

Amazon Bedrock AgentCore Gateway provides a centralized layer for managing how AI agents connect to tools and MCP servers across your organization. It consolidates authentication, observability, and policy enforcement into a single endpoint, removing the need to configure and secure each MCP server connection individually.

In this post, we walk through how to configure AgentCore Gateway to connect to an OAuth-protected MCP server using the Authorization Code flow.

Using AgentCore Gateway as an MCP server endpoint

As organizations scale their AI agent deployments, the number of MCP servers that each team relies on grows quickly. Developers are adopting Amazon Bedrock AgentCore Gateway as a single endpoint for accessing multiple MCP servers. Instead of configuring each MCP server individually per IDE, teams point to one Gateway URL for consistent access to their full MCP toolset across tool.

This pattern is accelerating as teams move beyond custom MCP servers and adopt production-grade third-party ones, like those from AWS, GitHub, Salesforce, and Databricks. Many of these MCP servers are protected by their primary identity provider through federation, while others are secured by their own authorization servers. As the number of MCP servers per organization grows, managing connections, authentication, and routing at the IDE level becomes unsustainable. AgentCore Gateway centralizes this complexity, giving teams a single control plane for MCP access while giving developers a frictionless experience.

Many enterprise MCP servers require OAuth 2.0 authorization, where the agent must authenticate on behalf of a user before invoking tools. AgentCore Gateway now supports the OAuth 2.0 Authorization Code flow through Amazon Bedrock AgentCore Identity. With this, your agents can securely access protected MCP servers without embedding credentials in application code or managing the token lifecycle manually.

Key terms

AgentCore Gateway user – The end user who consumes the tools in Amazon Bedrock AgentCore Gateway with MCP clients. Gateway users don’t manage the AgentCore Gateway itself. They use the single AgentCore Gateway URL to access the tools available to them.
Admin user – The user that manages and maintains Amazon Bedrock AgentCore Gateway. This user is responsible for attaching MCP servers, tools, or APIs to the AgentCore Gateway so that AgentCore gateway users can consume them.
MCP server – In this post, we assume that the MCP server is protected by an OAuth 2.0 Authorization Code flow, which requires user interaction to complete authentication. This is distinct from machine-to-machine authentication methods such as Client Credentials or Token Exchange, where no user intervention is required. The patterns described in this post apply specifically to MCP servers that require user-delegated authorization.

How Authorization Code flow works

To provide support for the Authorization Code Grant type, we provide two ways for target creations.

Implicit sync during MCP Server target creation

In this method, the admin user completes the authorization code flow during CreateGatewayTarget, UpdateGatewayTarget, or SynchronizeGatewayTargets operations. This allows AgentCore Gateway to discover and cache the MCP server’s tools upfront.

Provide schema upfront during MCP Server targets creation

With this method, admin users provide the tool schema directly during CreateGatewayTarget or UpdateGatewayTarget operations, rather than AgentCore Gateway fetching them dynamically from the MCP server. AgentCore Gateway parses the provided schema and caches the tool definitions. This removes the need for the admin user to complete the authorization code flow during target creation or update. This is the recommended approach when human intervention isn’t possible during create/update operations. This method is beneficial when you don’t want to expose all the tools provided by the MCP server target.

Note: Because tool schemas are provided upfront with this method, the SynchronizeGatewayTargets operation isn’t supported. You can switch a target between Method 1 and Method 2 by updating the target configuration.

This means that AgentCore Gateway users can call list/tools without being prompted to authenticate with the MCP server authentication server, because this fetches the cached tools. The authorization code flow is only triggered when a Gateway user invokes a tool on that MCP server. This is particularly beneficial when multiple MCP servers are attached to a single Gateway. Users can browse the full tool catalog (cached tools) without authenticating to every MCP server and only complete the flow for the specific server whose tool they invoke.

URL Session Binding

URL session binding verifies that the user who initiated the OAuth authorization request is the same user who granted consent. When AgentCore Identity generates an authorization URL, it also returns a session-URI. After the user completes consent, the browser redirects back to a callback URL with the session-URI. The application is then responsible for calling the CompleteResourceTokenAuth API, presenting both the user’s identity and the session-URI. AgentCore Identity validates that the user who started the flow is the same user who completed it before exchanging the authorization code for an access token. This helps avoid a scenario where a user accidentally shares the authorization URL, and someone else completes the consent, which would grant access tokens to the wrong party. The authorization URL and session URI are only valid for 10 minutes, further limiting the window for misuse. Session binding applies during admin target creation (implicit sync) and during tool invocation.

Solution overview

In this post, we show how to attach the GitHub MCP server to Amazon Bedrock AgentCore Gateway using Method 1 (admin-initiated sync during target creation) and Method 2 (providing the tool schema upfront during target creation). The accompanying code is available in this repository.

Prerequisites

You must follow the following prerequisites along with this post.

GitHub OAuth Apps setup
- Go to https://github.com/settings/apps → New GitHub App
- Fill in details:
  1. GitHub App name: AgentCore Gateway GitHub MCP
  2. Homepage URL (The full URL to your GitHub App’s website): The Homepage URL appears as a clickable link when user see your OAuth app, letting them learn more about your app. It helps users verify the legitimacy of the app requesting access to their GitHub account.
  3. Authorization callback URL: The Authorization callback URL (redirect URI) is the URL GitHub redirects the user to after they authorize (or deny) your OAuth app. For now, let’s put https://example.com/auth, we will come back and change this value.
  4. Advanced Settings: Here we go over the recommended defaults. However, please ensure to follow security best practices based on your organizations polices.
    1. Expire user authorization tokens: Disable – If enabled, this will allow AgentCore Identity to automatically refresh tokens for the user.
    2. Request user authorization (OAuth) during installation: Disable.
    3. Device Flow: Disable – Allows authorization on devices that don’t have a browser (for example, CLI tools, smart TVs, CI environments).
    4. Webhook: Disable.
    5. User permissions: Use case dependent, keep it default for now – These are granted when the user goes through the OAuth authorization flow. Only request what you need, users see these permissions on the consent screen and excessive permissions reduce trust.

- Choose Create GitHub App.
- Make sure to note down the app Client ID (different to the App ID).
- Under your Oauth app general settings, choose Generate a new client secret. Make sure to note down the client secret as GitHub only shows it once upon creation.

IAM permissions: You need appropriate IAM permissions to run the code from this blog post. These are the minimum IAM permissions required.
Code repository: First clone the GitHub repository, and then open github-mcp-server.ipynb. We recommend following the console instructions on this blog post to understand the concepts and then look at the code walkthrough.
```
git clone https://github.com/awslabs/amazon-bedrock-agentcore-samples.git

cd 01-tutorials/02-AgentCore-gateway/05-mcp-server-as-a-target/03-authorization-code-flow
```
GitHub credential provider: In this step we will setup Agentcore Identity Credential Provider. On the Amazon Bedrock AgentCore console, go to AgentCore Identity and create an OAuth client.
1. Provide a name for the OAuth Client, choose the included GitHub provider, and fill in the GitHub OAuth App client ID and client secret.
2. Copy the AgentCore Identity OAuth client callback URL, and make sure to go back to GitHub OAuth provider you created and update the Authorization callback URL.

Implicit sync during MCP Server target creation

In this section, we will introduce how implicit sync during MCP Server target creation works. Make sure that the AgentCore Gateway execution role has GetWorkloadAccessTokenForUserId and CompleteResourceTokenAuth permissions. First, let’s start by understanding the flow.

The admin user calls CreateGatewayTarget, providing the MCP server endpoint, the AgentCore Identity Credential Provider, and return URL. This tells AgentCore Gateway which MCP server to connect to and which credential provider to use for obtaining OAuth 2.0 tokens. This same flow also applies to UpdateGatewayTarget and SynchronizeGatewayTargets operations.
AgentCore Gateway requests a workload access token from the AgentCore Identity Credential Provider, passing the AgentCore Gateway workload identity and a user ID in the format {gatewayId}{targetId}{uuid}. This workload access token identifies the AgentCore Gateway as an authorized caller for subsequent credential operations.
Using the workload access token, AgentCore Gateway requests an OAuth 2.0 access token from the AgentCore Identity Credential Provider. This provides the admin user with an authorization URL and a session-URI. At this stage, the target is in Needs Authorization status.
The admin opens the authorization URL in their browser, signs in, and grants the requested permissions to the AgentCore Gateway.
After the admin grants consent, the OAuth 2.0 authorization server sends an authorization code to the AgentCore Identity Credential Provider’s registered callback endpoint.
The credential provider redirects the admin browser to the return URL, with the session URI. The admin application calls CompleteResourceTokenAuth, presenting the user id and the session-URI returned in step 2. The credential provider validates that the user who initiated the authorization flow (step 3) is the same user who completed consent. This revents token hijacking if the authorization URL was accidentally shared. If the flow was initiated from the AWS Console, this step is handled automatically. If initiated from another context, the admin is responsible for calling the CompleteResourceTokenAuth API directly.
After successful session binding validation, the credential provider exchanges the authorization code with the OAuth 2.0 authorization server for an OAuth 2.0 access token.
This access token is used to list the tools on MCP server target; returned tool definitions from the target are cached at AgentCore Gateway.

Note that a subsequent update or synchronization to the target won’t reuse the access token. Instead, AgentCore Identity will get a new access token from Authorization Server.

Target creation

First, let’s start by creating an Amazon Bedrock AgentCore Gateway and Target and see how implicit sync works during MCP Server target creation.

When creating an AgentCore Gateway, you must use MCP version 2025-11-25 or later. Keep everything else default and select MCP server target. Provide the MCP server endpoint, and for OAuth client, select the AgentCore Identity OAuth Client created during the prerequisites section.

Under additional configuration, make sure to select Authorization code grant (3LO). The Authorization code grant (3LO) option will be disabled if the AgentCore Gateway wasn’t created with MCP version 2025-11-25 or later. Here, you must also provide the return URL. During the session binding process after the authorization code flow, users will be returned to this URL, both during implicit sync and tool invocation. You can override the return URL value during invocation. For more information, see Example: Authorization code grant in the Amazon Bedrock AgentCore Developer Guide. You can provide scopes and additional parameters such as audience when configuring the target. These parameters are included in the request when AgentCore Identity reaches out to the authorization server’s /authorize endpoint.

After creating the target, the target will be in Needs authorization status. At this point, admin users are required to complete the authorization request, either directly from the AWS console or by navigating to the authorization URL directly. It’s important to note that if the flow is completed from the AWS console, session binding is handled automatically. If initiated from another context, the admin is responsible for calling the CompleteResourceTokenAuth API directly. For more information, see the code sample in GitHub.

This is how the consent flow looks like when initiated from the AWS Console.

After a few seconds you will see the target is in Ready status with authorization status Authorized.

Provide schema upfront during MCP Server targets creation

In this section, we introduce how to provide the schema upfront during MCP Server targets creation. This is the recommended approach when human intervention isn’t possible during create/update operations.

In this step, we create an Amazon Bedrock AgentCore Gateway and Target and provide schema upfront during the MCP Server targets creation. The process remains the same. During target creation selection, select Use pre-defined list tools and paste the GitHub tools definitions. You can copy the tool definition from the GitHub repository.

The target in this case becomes immediately ready, with authorization status No authorization required.

Demo

After successful target creation, either using the implicit sync method or by providing the schema upfront, AgentCore Gateway users can discover and invoke tools using the MCP protocol. In this section, we look at the tools/list and tools/call flows from AgentCore Gateway.

The gateway user sends a tools/list request to AgentCore Gateway with their inbound authorization token. Because tool definitions were cached during target creation, AgentCore Gateway returns the cached tool definitions immediately.
The gateway user sends tools/call request to AgentCore Gateway with their inbound authorization token. This triggers the OAuth authorization code flow for the specific MCP server target, because AgentCore Gateway needs an access token to call the MCP server on behalf of this user.
AgentCore Gateway requests a workload access token from AgentCore Identity, passing the workload identity and the user’s JWT from the inbound authorization header.
Using the workload access token, AgentCore Gateway requests an OAuth 2.0 access token from the credential provider. Because no valid token exists yet for this user, the credential provider returns an authorization URL and a session-URI instead.
AgentCore Gateway passes the authorization URL and session URI back to the gateway user. The user opens the authorization URL in their browser, signs in to the OAuth 2.0 authorization server, and grants the requested permissions. The sample URL elicitation response from AgentCore Gateway is as follows:

{    
      "jsonrpc": "2.0",                                                     
      "id": 3,    
      "error": {   
          "code": -32042,     
          "message": "This request requires more information.",   
          "data": {
            "elicitations": [{
               "mode": "url",
               "elicitationId": "<ID>",     
			   "url": "<identity_url>/?request_uri=urn%3Aietf%3A...",
               "message": "Please login to this URL for authorization."
              }]      
          }       
      }
  	}

After the user grants consent, the OAuth 2.0 authorization server sends an authorization code to the AgentCore Identity Credential Provider’s registered callback endpoint.
The credential provider redirects the user’s browser to the return URL with the session URI. The user’s application calls CompleteResourceTokenAuth, presenting the user’s JWT and the session-URI. The credential provider validates that the user who initiated the authorization flow (Step 4) is the same user who completed consent.
After successful session binding validation, the credential provider exchanges the authorization code with the OAuth 2.0 authorization server for an OAuth 2.0 access token. The credential provider caches this token in the Token Vault under the workload identity and user identity.
When the gateway user issues a tools/call request again, AgentCore Gateway gets the cached token, using workload identity and user identity, from AgentCore Identity and uses that to call the MCP server.

Let us now look at a demo of the end-to-end flow where we send tools/list and tools/call requests to AgentCore Gateway.

Clean up

When you’re done using this solution, make sure to clean up all the resources. Follow the instructions in the code repository.

Conclusion

In this post, we demonstrated how to connect an OAuth-protected MCP server to Amazon Bedrock AgentCore Gateway using the Authorization Code flow. By centralizing authentication through AgentCore Gateway, teams can manage credentials securely using Amazon Bedrock AgentCore Identity while giving developers seamless access to protected tools from MCP client.

While this example focuses on the GitHub MCP server, the code repository includes integration examples for other popular third-party MCP servers, and a guide for hosting your own MCP server with authorization code flow support on AgentCore Runtime as an AgentCore Gateway target. We encourage you to explore these examples and adapt them to your organization’s MCP server landscape.

Resources

To learn more, refer to the following resources:

About the authors

Simulate realistic users to evaluate multi-turn AI agents in Strands Evals

Ishan Singh — Thu, 02 Apr 2026 17:34:02 +0000

Evaluating single-turn agent interactions follows a pattern that most teams understand well. You provide an input, collect the output, and judge the result. Frameworks like Strands Evaluation SDK make this process systematic through evaluators that assess helpfulness, faithfulness, and tool usage. In a previous blog post, we covered how to build comprehensive evaluation suites for AI agents using these capabilities. However, production conversations rarely stop at one turn.

Real users engage in exchanges that unfold over multiple turns. They ask follow-up questions when answers are incomplete, change direction when new information surfaces, and express frustration when their needs go unmet. A travel assistant that handles “Book me a flight to Paris” well in isolation might struggle when the same user follows up with “Actually, can we look at trains instead?” or “What about hotels near the Eiffel Tower?” Testing these dynamic patterns requires more than static test cases with fixed inputs and expected outputs.

The core difficulty is scale because you can’t manually conduct hundreds of multi-turn conversations every time your agent changes, and writing scripted conversation flows locks you into predetermined paths that miss how real users behave. What evaluation teams need is a way to generate realistic, goal-driven users programmatically and let them converse naturally with an agent across multiple turns. In this post, we explore how ActorSimulator in Strands Evaluations SDK addresses this challenge with structured user simulation that integrates into your evaluation pipeline.

Why multi-turn evaluation is fundamentally harder

Single-turn evaluation has a straightforward structure. The input is known ahead of time, the output is self-contained, and the evaluation context is limited to that single exchange. Multi-turn conversations break every one of these assumptions.

In a multi-turn interaction, each message depends on everything that came before it. The user’s second question is shaped by how the agent answered the first. A partial answer draws a follow-up about whatever was left out, a misunderstanding leads the user to restate their original request, and a surprising suggestion can send the conversation in a new direction.

These adaptive behaviors create conversation paths that can’t be predicted at test-design time. A static dataset of I/O pairs, no matter how large, can’t capture this dynamic quality because the “correct” next user message depends on what the agent just said.

Manual testing covers this gap in theory but fails in practice. Testers can conduct realistic multi-turn conversations, but doing so for every scenario, across every persona type, after every agent change is not sustainable. As the agent’s capabilities grow, the number of conversation paths grows combinatorially, well beyond what teams can explore manually.

Some teams turn to prompt engineering as a shortcut, asking a large language model (LLM) to “act like a user” during testing. Without structured persona definitions and explicit goal tracking, these approaches produce inconsistent results. The simulated user’s behavior drifts between runs, making it difficult to compare evaluations over time or identify genuine regressions versus random variation. A structured approach to user simulation can bridge this gap by combining the realism of human conversation with the repeatability and scale of automated testing.

What makes a good simulated user

Simulation-based testing is well established in other engineering disciplines. Flight simulators test pilot responses to scenarios that would be dangerous or impossible to reproduce in the real world. Game engines use AI-driven agents to explore millions of player behavior paths before release. The same principle applies to conversational AI. You create a controlled environment where realistic actors interact with your system under conditions you define, then measure the outcomes.

For AI agent evaluation, a useful simulated user starts with a consistent persona. One that behaves like a technical expert in one turn and a confused novice in the next produces unreliable evaluation data. Consistency means to maintain the same communication style, expertise level, and personality traits through every exchange, just as a real person would.

Equally important is goal-driven behavior. Real users come to an agent with something they want to accomplish. They persist until they achieve it, adjust their approach when something is not working, and recognize when their goal has been met. Without explicit goals, a simulated user tends to either end conversations too early or continue asking questions indefinitely, neither of which reflects real usage.

The simulated user must also respond adaptively to what the agent says, not follow a predetermined script. When the agent asks a clarifying question, the actor should answer it in character. If the response is incomplete, the actor follows up on whatever was left out rather than moving on. If the conversation drifts off topic, the actor steers it back toward the original goal. These adaptive behaviors make simulated conversations valuable as evaluation data because they exercise the same conversation dynamics your agent faces in production.

Building persona consistency, goal tracking, and adaptive behavior into a simulation framework is what differentiates structured user simulation from ad-hoc prompting. ActorSimulator in Strands Evals is designed around exactly these principles.

How ActorSimulator works

ActorSimulator implements these simulation qualities through a system that wraps a Strands Agent configured to behave as a realistic user persona. The process begins with profile generation. Given a test case containing an input query and an optional task description, ActorSimulator uses an LLM to create a complete actor profile. A test case with input “I need help booking a flight to Paris” and task description “Complete flight booking under budget” might produce a budget-conscious traveler with beginner-level experience and a casual communication style. Profile generation gives each simulated conversation a distinct, consistent character.

With the profile established, the simulator manages the conversation turn by turn. It maintains the full conversation history and generates each response in context, keeping the simulated user’s behavior aligned with their profile and goals throughout. When your agent addresses only part of the request, the simulated user naturally follows up on the gaps. A clarifying question from your agent gets a response that stays consistent with the persona. The conversation feels organic because every response reflects both the actor’s persona and everything said so far.

Goal tracking runs alongside the conversation. ActorSimulator includes a built-in goal completion assessment tool that the simulated user can invoke to evaluate whether their original objective has been met. When the goal is satisfied or the simulated user determines that the agent cannot complete their request, the simulator emits a stop signal and the conversation ends. If the maximum turn count is reached before the goal is met, the conversation also stops. This gives you a signal that the agent might not be resolving user needs efficiently. This mechanism makes sure conversations have a natural endpoint rather than running indefinitely or cutting off arbitrarily.

Each response from the simulated user also includes structured reasoning alongside the message text. You can inspect why the simulated user chose to say what they said, whether they were following up on missing information, expressing confusion, or redirecting the conversation. This transparency is valuable during evaluation development because you can see the reasoning behind each turn, making it more straightforward to trace where conversations succeed or go off track.

Getting started with ActorSimulator

To get started, you will need to install the Strands Evaluation SDK using: pip install strands-agents-evals. For a step-by-step setup, you can refer to our documentation or our previous blog for more details. Putting these concepts into practice requires minimal code. You define a test case with an input query and a task description that captures the user’s goal. ActorSimulator handles profile generation, conversation management, and goal tracking automatically.

The following example evaluates a travel assistant agent through a multi-turn simulated conversation.

from strands import Agent
from strands_evals import ActorSimulator, Case, Experiment

# Define your test case
case = Case(
    input="I want to plan a trip to Tokyo with hotel and activities",
    metadata={"task_description": "Complete travel package arranged"}
)

# Create the agent you want to evaluate
agent = Agent(
    system_prompt="You are a helpful travel assistant.",
    callback_handler=None
)

# Create user simulator from test case
user_sim = ActorSimulator.from_case_for_user_simulator(
    case=case,
    max_turns=5
)

# Run the multi-turn conversation
user_message = case.input
conversation_history = []

while user_sim.has_next():
    # Agent responds to user
    agent_response = agent(user_message)
    agent_message = str(agent_response)
    conversation_history.append({
        "role": "assistant",
        "content": agent_message
    })

    # Simulator generates next user message
    user_result = user_sim.act(agent_message)
    user_message = str(user_result.structured_output.message)
    conversation_history.append({
        "role": "user",
        "content": user_message
    })

print(f"Conversation completed in {len(conversation_history) // 2} turns")

The conversation loop continues until has_next() returns False, which happens when the simulated user’s goals are met or simulated user determines that the agent cannot complete the request or the maximum turn limit is reached. The resulting conversation_history contains the full multi-turn transcript, ready for evaluation.

Integration with evaluation pipelines

A standalone conversation loop is useful for quick experiments, but production evaluation requires capturing traces and feeding them into your evaluator pipeline. The next example combines ActorSimulator with OpenTelemetry telemetry collection and Strands Evals session mapping. The task function runs a simulated conversation and collects spans from each turn, then maps them into a structured session for evaluation.

from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.sdk.trace.export.in_memory_span_exporter import InMemorySpanExporter
from strands import Agent
from strands_evals import ActorSimulator, Case, Experiment
from strands_evals.evaluators import HelpfulnessEvaluator
from strands_evals.telemetry import StrandsEvalsTelemetry
from strands_evals.mappers import StrandsInMemorySessionMapper

# Setup telemetry for capturing agent traces
telemetry = StrandsEvalsTelemetry()
memory_exporter = InMemorySpanExporter()
span_processor = BatchSpanProcessor(memory_exporter)
telemetry.tracer_provider.add_span_processor(span_processor)

def evaluation_task(case: Case) -> dict:
    # Create simulator
    user_sim = ActorSimulator.from_case_for_user_simulator(
        case=case,
        max_turns=3
    )

    # Create agent
    agent = Agent(
        system_prompt="You are a helpful travel assistant.",
        callback_handler=None
    )

    # Accumulate spans across conversation
    all_target_spans = []
    user_message = case.input

    while user_sim.has_next():
        memory_exporter.clear()
        agent_response = agent(user_message)
        agent_message = str(agent_response)

        # Capture telemetry
        turn_spans = list(memory_exporter.get_finished_spans())
        all_target_spans.extend(turn_spans)

        # Generate next user message
        user_result = user_sim.act(agent_message)
        user_message = str(user_result.structured_output.message)

    # Map to session for evaluation
    mapper = StrandsInMemorySessionMapper()
    session = mapper.map_to_session(
        all_target_spans,
        session_id="test-session"
    )

    return {"output": agent_message, "trajectory": session}

# Create evaluation dataset
test_cases = [
    Case(
        name="booking-simple",
        input="I need to book a flight to Paris next week",
        metadata={
            "category": "booking",
            "task_description": "Flight booking confirmed"
        }
    )
]

evaluator = HelpfulnessEvaluator()
dataset = Experiment(cases=test_cases, evaluator=evaluator)

# Run evaluations
report = Experiment.run_evaluations(evaluation_task)
report.run_display()

This approach captures complete traces of your agent’s behavior across conversation turns. The spans include tool calls, model invocations, and timing information for every turn in the simulated conversation. By mapping these spans into a structured session, you make the full multi-turn interaction available to evaluators like GoalSuccessRateEvaluator and HelpfulnessEvaluator, which can then assess the conversation as a whole, rather than isolated turns.

Custom actor profiles for targeted testing

Automatic profile generation covers most evaluation scenarios well, but some testing goals require specific personas. You might want to verify that your agent handles an impatient expert user differently from a patient beginner, or that it responds appropriately to a user with domain-specific needs. For these cases, ActorSimulator accepts a fully defined actor profile that you control.

from strands_evals.types.simulation import ActorProfile
from strands_evals import ActorSimulator
from strands_evals.simulation.prompt_templates.actor_system_prompt import (
    DEFAULT_USER_SIMULATOR_PROMPT_TEMPLATE
)

# Define a custom actor profile
actor_profile = ActorProfile(
    traits={
        "personality": "analytical and detail-oriented",
        "communication_style": "direct and technical",
        "expertise_level": "expert",
        "patience_level": "low"
    },
    context="Experienced business traveler with elite status who values efficiency",
    actor_goal="Book business class flight with specific seat preferences and lounge access"
)

# Initialize simulator with custom profile
user_sim = ActorSimulator(
    actor_profile=actor_profile,
    initial_query="I need to book a business class flight to London next Tuesday",
    system_prompt_template=DEFAULT_USER_SIMULATOR_PROMPT_TEMPLATE,
    max_turns=10
)

By defining traits like patience level, communication style, and expertise, you can systematically test how your agent performs across different user segments. An agent that scores well with patient, non-technical users but poorly with impatient experts reveals a specific quality gap that you can address. Running the same goal across multiple persona configurations turns user simulation into a tool for understanding your agent’s strengths and weaknesses by user type.

Best practices for simulation-based evaluation

These best practices help you get the most out of simulation-based evaluation:

Set max_turns based on task complexity, using 3-5 for focused tasks and 8-10 for multi-step workflows. If most conversations reach the limit without completing the goal, increase it.
Write specific task descriptions that the simulator can evaluate against. “Help the user book a flight” is too vague to judge completion reliably, while “flight booking confirmed with dates, destination, and price” gives a concrete target.
Use auto-generated profiles for broad coverage across user types and custom profiles to reproduce specific patterns from your production logs, such as an impatient expert or a first-time user.
Focus on patterns across your test suite rather than individual transcripts. Consistent redirects from the simulated user suggests that the agent is drifting off topic, and declining goal completion rates after an agent change points to a regression.
Start with a small set of test cases covering your most common scenarios and expand to edge cases and additional personas as your evaluation practice matures.

Conclusion

We showed how ActorSimulator in Strands Evals enables systematic, multi-turn evaluation of conversational AI agents through realistic user simulation. Rather than relying on static test cases that capture only single exchanges, you can define goals and personas and let simulated users interact with your agent across natural, adaptive conversations. The resulting transcripts feed directly into the same evaluation pipeline that you use for single-turn testing, giving you helpfulness scores, goal success rates, and detailed traces across every conversation turn.

To get started, explore the working examples in the Strands Agents samples repository. For teams evaluating agents deployed through Amazon Bedrock AgentCore, the following AgentCore evaluations sample demonstrate how to simulate interactions with deployed agents. Start with a handful of test cases representing your most common user scenarios, run them through ActorSimulator, and evaluate the results. As your evaluation practice matures, expand to cover more personas, edge cases, and conversation patterns.

About the authors

Scaling seismic foundation models on AWS: Distributed training with Amazon SageMaker HyperPod and expanding context windows

Haotian An — Thu, 02 Apr 2026 13:30:57 +0000

This post is cowritten with Altay Sansal and Alejandro Valenciano from TGS.

TGS, a geoscience data provider for the energy sector, supports companies’ exploration and production workflows with advanced seismic foundation models (SFMs). These models analyze complex 3D seismic data to identify geological structures vital for energy exploration. To help enhance their next-generation models as part of their AWS infrastructure modernization, TGS partnered with the AWS Generative AI Innovation Center (GenAIIC) to optimize their SFM training infrastructure.

This post describes how TGS achieved near-linear scaling for distributed training and expanded context windows for their Vision Transformer-based SFM using Amazon SageMaker HyperPod. This joint solution cut training time from 6 months to just 5 days while enabling analysis of seismic volumes larger than previously possible.

Addressing seismic foundation model training challenges

TGS’s SFM uses a Vision Transformer (ViT) architecture with Masked AutoEncoder (MAE) training designed by the TGS team to analyze 3D seismic data. Scaling such models presents several challenges:

Data scale and complexity – TGS works with large volumes of proprietary 3D seismic data stored in domain-specific formats. The sheer volume and structure of this data required efficient streaming strategies to maintain high throughput and help prevent GPU idle time during training.
Training efficiency – Training large FMs on 3D volumetric data is computationally intensive. Accelerating training cycles would enable TGS to incorporate new data more frequently and iterate on model improvements faster, delivering more value to their clients.
Expanded analytical capabilities – The geological context a model can analyze depends on how much 3D volume it can process at once. Expanding this capability would allow the models to capture both local details and broader geological patterns simultaneously.

Understanding these challenges highlights the need for a comprehensive approach to distributed training and infrastructure optimization. The AWS GenAIIC partnered with TGS to develop a comprehensive solution addressing these challenges.

Solution overview

The collaboration between TGS and the AWS GenAIIC focused on three key areas: establishing an efficient data pipeline, optimizing distributed training across multiple nodes, and expanding the model’s context window to analyze larger geological volumes. The following diagram illustrates the solution architecture.

The solution uses SageMaker HyperPod to help provide a resilient, scalable training infrastructure with automatic health monitoring and checkpoint management. The SageMaker HyperPod cluster is configured with AWS Identity and Access Management (IAM) execution roles scoped to the minimum permissions required for training operations, deployed within a virtual private cloud (VPC) with network isolation and security groups restricting communication to authorized training nodes. Terabytes of training data streams directly from Amazon Simple Storage Service (Amazon S3), alleviating the need for intermediate storage layers while maintaining high throughput. AWS CloudTrail logs API calls to Amazon S3 and SageMaker services, and Amazon S3 access logging is enabled on training data buckets to provide a detailed audit trail of data access requests. The distributed training framework uses advanced parallelization techniques to efficiently scale across multiple nodes, and context parallelism methods enable the model to process significantly larger 3D volumes than previously possible.

The final cluster configuration consisted of 16 Amazon Elastic Compute Cloud (Amazon EC2) P5 instances for the worker nodes integrated through the SageMaker AI flexible training plans, each containing:

8 NVIDIA H200 GPUs with 141GB HBM3e memory per GPU
192 vCPUs
2048 GB system RAM
3200 Gbps EFAv3 networking for ultra-low latency communication

Optimizing the training data pipeline

TGS’s training dataset consists of 3D seismic volumes stored in the TGS-developed MDIO format—an open source format built on Zarr arrays designed for large-scale scientific data in the cloud. Such volumes can contain billions of data points representing underground geological structures.

Choosing the right storage approach

The team evaluated two approaches for delivering data to training GPUs:

Amazon FSx for Lustre – Copy data from Amazon S3 to a high-speed distributed file system that the nodes read from. This approach provides sub-millisecond latency but requires pre-loading and provisioned storage capacity.
Streaming directly from Amazon S3 – Stream data directly from Amazon S3 using MDIO’s native capabilities with multi-threaded libraries, opening multiple concurrent connections per node.

Settling on streaming directly from Amazon S3

The key architectural difference lies in how throughput scales with the cluster. With streaming directly from Amazon S3, each training node creates independent Amazon S3 connections, so aggregate throughput can scale linearly. With Amazon FSx for Lustre, the nodes share a single file system whose throughput is tied to provisioned storage capacity. Using Amazon FSx together with Amazon S3 requires only a small Amazon FSx storage volume, which limits the entire cluster to that volume’s throughput, creating a bottleneck as the cluster grows.

Comprehensive testing and cost analysis revealed streaming from Amazon S3 directly as the optimal choice for this configuration:

Performance – Achieved 4–5 GBps sustained throughput per node using multiple data loader processes with pre-fetching over HTTPS endpoints (TLS 1.2)—sufficient to fully utilize the GPUs.
Cost efficiency – Streaming from Amazon S3 alleviated the need for Amazon FSx provisioning, reducing storage infrastructure costs by over 90% while helping deliver 64-80 GBps cluster-wide throughput. The Amazon S3 pay-per-use model was more economical than provisioning high-throughput Amazon FSx capacity.
Better scaling – Streaming from Amazon S3 directly scales naturally—each node brings its own connection bandwidth, avoiding the need for complex capacity planning.
Operational simplicity – No intermediate storage to provision, manage, or synchronize.

The team optimized Amazon S3 connection pooling and implemented parallel data loading to sustain high throughput across the 16 nodes.

Selecting the distributed training framework

When training large models across multiple GPUs, the model’s parameters, gradients, and optimizer states must be distributed across devices. The team evaluated different distributed training approaches to find the optimal balance between memory efficiency and training throughput:

ZeRO-2 (Zero Redundancy Optimizer Stage 2) – This approach partitions gradients and optimizer states across GPUs while keeping a full copy of model parameters on each GPU. This helps reduce memory usage while maintaining fast communication, because each GPU can directly access the parameters during the forward pass without waiting for data from other GPUs.
ZeRO-3 – This approach goes further by also partitioning model parameters across GPUs. Although this helps maximize memory efficiency (enabling larger models), it requires more frequent communication between GPUs to gather parameters during computation, which can reduce throughput.
FSDP2 (Fully Sharded Data Parallel v2) – PyTorch’s native approach similarly shards parameters, gradients, and optimizer states. It offers tight integration with PyTorch but involves similar communication trade-offs as ZeRO-3.

Comprehensive testing revealed DeepSpeed ZeRO-2 as the optimal framework for this configuration, delivering strong performance while efficiently managing memory:

ZeRO-2 – 1,974 samples per second (implemented)
FSDP2 – 1,833 samples per second
ZeRO-3 – 869 samples per second

This framework choice provided the foundation for achieving near-linear scaling across multiple nodes. The combination of these three key optimizations helped deliver the dramatic training acceleration:

Efficient distributed training – DeepSpeed ZeRO-2 enabled near-linear scaling across 128 GPUs (16 nodes × 8 GPUs)
High-throughput data pipeline – Streaming from Amazon S3 directly sustained 64–80 GBps aggregate throughput across the cluster

Together, these improvements helped reduce training time from 6 months to 5 days—enabling TGS to iterate on model improvements weekly rather than semi-annually.

Expanding analytical capabilities

One of the most significant achievements was expanding the model’s field of view—how much 3D geological volume it can analyze simultaneously. A larger context window allows the model to capture both fine details (small fractures) and broad patterns (basin-wide fault systems) in a single pass, helping provide insights that were previously undetectable within the constraints of smaller analysis windows for TGS’s clients. The implementation by the TGS and AWS teams involved adapting the following advanced techniques to enable ViTs to process substantially larger 3D seismic volumes:

Ring attention implementation – Each GPU processes a portion of the input sequence while circulating key-value pairs to neighboring GPUs, gradually accumulating attention results across the distributed system. PyTorch provides an API that makes this straightforward:

from torch.distributed.tensor.parallel import context_parallel

# Wrap attention computation with context parallelism
with context_parallel(
    buffers=[query, key, value],  # Tensors to shard
    buffer_seq_dims=[1, 1, 1]      # Dimension to shard along (sequence dimension)
):
    # Standard scaled dot-product attention - automatically becomes Ring Attention
    attention_output = torch.nn.functional.scaled_dot_product_attention(
        query, key, value, attn_mask=None
    )

Dynamic mask ratio adjustment – The MAE training approach required making sure unmasked patches plus classification tokens are evenly divisible across devices, necessitating adaptive masking strategies.
Decoder sequence management – The decoder reconstructs the full image by processing both the unmasked patches from the encoder and the masked patches. This creates a different sequence length that also needs to be divisible by the number of GPUs.

The preceding implementation enabled processing of substantially larger 3D seismic volumes as illustrated in the following table.

Metric	Previous (Baseline)	With Context Parallelism
Maximum input size	640 × 640 × 1,024 voxels	1,536 × 1,536 × 2,048 voxels
Context length	102,400 tokens	1,170,000 tokens
Volume increase	1×	4.5×

The following figure provides an example of 2D model context size.

This expansion allows TGS’s models to capture geological features across broader spatial contexts, helping enhance the analytical capabilities they can offer to clients.

Results and impact

The collaboration between TGS and the AWS GenAIIC delivered substantial improvements across multiple dimensions:

Significant training acceleration – The optimized distributed training architecture reduced training time from 6 months to 5 days—an approximate 36-fold speedup, enabling TGS to iterate faster and incorporate new geological data more frequently into their models.
Near-linear scaling – The solution demonstrated strong scaling efficiency from single-node to 16-node configurations, achieving approximately 90–95% parallel efficiency with minimal performance degradation as the cluster size increased.
Expanded analytical capabilities – The context parallelism implementation enables training on larger 3D volumes, allowing models to capture geological features across broader spatial contexts.
Production-ready, cost-efficient infrastructure – The SageMaker HyperPod based solution with streaming from Amazon S3 helps provide a cost-effective foundation that scales efficiently as training requirements grow, while helping deliver the resilience, flexibility, and operational efficiency needed for production AI workflows.

These improvements establish a strong foundation for TGS’s AI-powered analytics system, delivering faster model iteration cycles and broader geological context per analysis to clients while helping protect TGS’s valuable data assets.

Lessons learned and best practices

Several key lessons emerged from this collaboration that might benefit other organizations working with large-scale 3D data and distributed training:

Systematic scaling approach – Starting with a single-node baseline establishment before progressively expanding to larger clusters enabled systematic optimization at each stage while managing costs effectively.
Data pipeline optimization is critical – For data-intensive workloads, thoughtful data pipeline design can provide strong performance. Direct streaming from object storage with appropriate parallelization and prefetching delivered the throughput needed without complex intermediate storage layers.
Batch size tuning is nuanced – Increasing batch size doesn’t always improve throughput. The team found excessively large batch size can create bottlenecks in preparing and transferring data to GPUs. Through systematic testing at different scales, the team identified the point where throughput plateaued, indicating the data loading pipeline had become the limiting factor rather than GPU computation. This optimal balance maximized training efficiency without over-provisioning resources.
Framework selection depends on your specific requirements – Different distributed training frameworks involve trade-offs between memory efficiency and communication overhead. The optimal choice depends on model size, hardware characteristics, and scaling requirements.
Incremental validation – Testing configurations at smaller scales before expanding to full production clusters helped identify optimal settings while controlling costs during the development phase.

Conclusion

By partnering with the AWS GenAIIC, TGS has established an optimized, scalable infrastructure for training SFMs on AWS. The solution helps accelerate training cycles while expanding the models’ analytical capabilities, helping TGS deliver enhanced subsurface analytics to clients in the energy sector. The technical innovations developed during this collaboration—particularly the adaptation of context parallelism to ViT architectures for 3D volumetric data—demonstrate the potential for applying advanced AI techniques to specialized scientific domains. As TGS continues to expand its subsurface AI system and broader AI capabilities, this foundation can support future enhancements such as multi-modal integration and temporal analysis.

To learn more about scaling your own FM training workloads, explore SageMaker HyperPod for resilient distributed training infrastructure, or review the distributed training best practices in the SageMaker documentation. For organizations interested in similar collaborations, the AWS Generative AI Innovation Center partners with customers to help accelerate their AI initiatives.

Acknowledgement

Special thanks to Andy Lapastora, Bingchen Liu, Prashanth Ramaswamy, Rohit Thekkanal, Jared Kramer, Arun Ramanathan and Roy Allela for their contribution.

About the authors

Control which domains your AI agents can access

Kosti Vasilakakis — Thu, 02 Apr 2026 13:28:19 +0000

AI agents that can browse the web open powerful possibilities—from research automation to real-time data gathering. However, giving an AI agent unrestricted internet access raises security and compliance concerns. What if the agent accesses unauthorized websites? What if sensitive data is exfiltrated to external domains?

Amazon Bedrock AgentCore provides managed tools that enable AI agents to interact with the web (Browser), execute code (Code Interpreter), and host agents (Runtime). When deployed in an Amazon Virtual Private Cloud (Amazon VPC), you can control tool network access using AWS Network Firewall to implement domain-based filtering. AWS Network Firewall also provides you with managed rules to help reduce access to botnets, known-malware domains, and other high-risk resources.

In this post, we show you how to configure AWS Network Firewall to restrict AgentCore resources to an allowlist of approved internet domains. You can use this architecture to:

Permit access only to specified domains (for example, wikipedia.org, stackoverflow.com)
Explicitly block certain categories (e.g., social media sites) using rule templates
Log the connection attempts for audit and compliance alignment
Apply a default-deny policy for unspecified domains

This post focuses on domain-level filtering using SNI inspection — the first layer of a defense-in-depth approach. For DNS-level filtering and content inspection techniques, see Going further at the end of this post. For inbound access control (restricting who can invoke your agents), you can also see Resource-based policies for Amazon Bedrock AgentCore. These support conditions like aws:SourceIp, aws:SourceVpc, and aws:SourceVpce. These controls are complementary layers in a defense in depth strategy.

Why this matters: Enterprise security requirements

Customers deploying AI agents in regulated industries have consistent security requirements around network ingress and egress control:

Enterprise organizations with high security requirements

Customers in regulated industries conducting security reviews for AI agent deployments consistently ask about network isolation and egress control, requiring detailed explanations of how agent traffic is controlled and audited. These customers want assurance that agent runtime endpoints remain private, and that additional security controls like web application firewall protections are available.

Multi-tenant SaaS providers

Enterprise software as a service (SaaS) providers require DNS-level allowlisting and denylisting because their multi-tenant architectures need per-customer network policies. For example, Customer A might need to allow domains that Customer B blocks. Common requirements include:

Execution-specific blocking (prevent access to certain domains during specific browser launches)
Regional restrictions (block website categories in specific regions)
Category-based rules (disable gambling or social media sites through pre-packaged rule sets)

Security vulnerability mitigation and compliance audit requirements

Security teams evaluating AI agents have identified that agents can be tricked into navigating to unintended sites through prompt injection attacks. Custom URL allowlists reduce the attack surface by restricting the browser to approved domains, regardless of what the agent is instructed to do. Domain-based egress filtering provides the logging and access control visibility that security teams often need for their security monitoring processes.

Solution overview

The solution deploys AgentCore Browser in a private subnet with no direct internet access. The outbound traffic routes through AWS Network Firewall, which inspects TLS Server Name Indication (SNI) headers to determine the destination domain and apply filtering rules. You can also monitor Network Firewall actions taken to restrict traffic through the native Network Firewall integration with Amazon CloudWatch metrics.

Figure 1: AgentCore deployment with AWS Network Firewall and domain-based egress filtering

The architecture includes:

Private subnet: Hosts AgentCore Browser instances with no public IP addresses
Public subnet: Contains the NAT Gateway for outbound connectivity
Firewall subnet: Hosts the Network Firewall endpoint
Four route tables: Control traffic flow through the firewall for both outbound requests and return traffic

Traffic flow

AgentCore Runtime executes the agent and invokes the AgentCore Browser tool
AgentCore Browser initiates an HTTPS request from the private subnet
The private subnet route table directs traffic to the NAT Gateway in the public subnet
The NAT Gateway translates the private IP address and forwards the request to the Network Firewall endpoint
Network Firewall inspects the TLS SNI header to identify the destination domain
If the domain matches an allowlist rule, the firewall forwards traffic to the Internet Gateway
The Internet Gateway routes approved traffic to the external destination
Return traffic follows the symmetric path back through the firewall to the agent

This architecture helps make sure that the browser traffic is inspected and filtered, regardless of the destination.

Note: SNI-based filtering helps control which domains agents connect to at the TLS layer. For DNS-level control, including controls to help prevent DNS tunneling and exfiltration, pair this with Amazon Route 53 Resolver DNS Firewall. DNS Firewall helps address a limitation of SNI inspection: an agent could potentially resolve a blocked domain through DNS and connect by IP address directly.

Prerequisites

Before you begin, make sure that you have:

An AWS account with permissions to create VPC resources, Network Firewall, and IAM roles
AWS Command Line Interface (AWS CLI) version 2.x configured with appropriate credentials
Access to Amazon Bedrock AgentCore
Basic familiarity with VPC networking concepts

Walkthrough

For the complete step-by-step VPC and Network Firewall setup, see the Amazon Bedrock AgentCore VPC configuration documentation.

This section highlights the AgentCore Browser-specific configuration.

Step 1: Deploy resources using the CloudFormation template

Launch the CloudFormation template from the repository. You can keep the stack default values. However, make sure to add a stack name (for example, “agentcore-egress“) to the “Stack name” field, choose an Availability Zone on the “Availability Zone” menu, and include a valid existing bucket name on the “BucketConfigForOutput” parameter. Wait for the stack creation to complete, which typically takes 10 minutes. Continue with the following steps after the stack status changes to CREATE_COMPLETE.

Step 2: Review the IAM execution role

AgentCore Browser requires an IAM role with a trust policy for the Amazon bedrock-agentcore.amazonaws.com service:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Service": "bedrock-agentcore.amazonaws.com"
      },
      "Action": "sts:AssumeRole"
    }
  ]
}

Step 3: Configure the Network Firewall allowlist

Create a stateful rule group with your approved domains. Note the leading dot (.) to match subdomains:

cat > allowlist-rules.json << 'EOF'
{
  "RulesSource": {
    "RulesSourceList": {
      "Targets": [
        ".wikipedia.org",
        ".stackoverflow.com",
        ".docs.aws.amazon.com",
        ".amazonaws.com",
        ".pypi.org",
        ".pythonhosted.org"
      ],
      "TargetTypes": ["HTTP_HOST", "TLS_SNI"],
      "GeneratedRulesType": "ALLOWLIST"
    }
  },
  "StatefulRuleOptions": {
    "RuleOrder": "STRICT_ORDER"
  }
}
EOF

aws network-firewall create-rule-group \
  --rule-group-name browser-allowed-domains \
  --type STATEFUL \
  --capacity 100 \
  --rule-group file://allowlist-rules.json \
  --region us-east-2

Important: Include .amazonaws.com in your allowlist if the browser requires AWS service access or use VPC Endpoints as an alternative.

Security consideration: The .amazonaws.com domain is a broad allowlist that permits access to hosted endpoints on AWS, including public Amazon Simple Storage Service (Amazon S3) buckets, Amazon API Gateway endpoints, and AWS Lambda function URLs. For tighter control, use VPC Endpoints for AWS service access and allowlist only the specific external domains your agents need.

For Code Interpreter: Consider adding “.pypi.org” and “.pythonhosted.org” if you need a pip package installation. Most common packages are pre-installed, making these domains optional for your use case.

Step 4: Configure the firewall policy

The firewall policy must use aws:drop_established as the default action. This allows TCP handshakes to complete (required for TLS SNI inspection) while dropping connections to non-allowed domains:

cat > firewall-policy.json << 'EOF'
{
  "StatelessDefaultActions": ["aws:forward_to_sfe"],
  "StatelessFragmentDefaultActions": ["aws:forward_to_sfe"],
  "StatefulRuleGroupReferences": [
    {
      "ResourceArn": "arn:aws:network-firewall:us-east-2:ACCOUNT_ID:stateful-rulegroup/browser-allowed-domains",
      "Priority": 1
    }
  ],
  "StatefulEngineOptions": {
    "RuleOrder": "STRICT_ORDER"
  },
  "StatefulDefaultActions": ["aws:drop_established"]
}
EOF

Do not use aws:drop_strict because it blocks TCP SYN packets before the TLS handshake, preventing SNI inspection.

Step 5: Create the security group

Create a security group that allows outbound traffic. The Network Firewall handles domain filtering, so the security group permits the egress:

# Create security group
aws ec2 create-security-group \
  --group-name agentcore-egress-sg \
  --description "AgentCore tools - egress only, filtered by Network Firewall" \
  --vpc-id vpc-XXXXXXXXX \
  --region us-east-2

# Allow all outbound traffic (Network Firewall handles filtering)
aws ec2 authorize-security-group-egress \
  --group-id sg-XXXXXXXXX \
  --protocol -1 \
  --port -1 \
  --cidr 0.0.0.0/0 \
  --region us-east-2

# Remove default inbound rules if present (AgentCore tools don't need inbound)
aws ec2 revoke-security-group-ingress \
  --group-id sg-XXXXXXXXX \
  --protocol -1 \
  --port -1 \
  --cidr 0.0.0.0/0 \
  --region us-east-2

Step 6: Create the AgentCore Browser

Create the browser with VPC configuration pointing to your private subnet:

aws bedrock-agentcore-control create-browser \
  --name my_secure_browser \
  --execution-role-arn arn:aws:iam::ACCOUNT_ID:role/AgentCoreBrowserExecutionRole \
  --network-configuration '{
    "networkMode": "VPC",
    "vpcConfig": {
      "securityGroups": ["sg-XXXXXXXXX"],
      "subnets": ["subnet-XXXXXXXXX"]
    }
  }' \
  --region us-east-2

Step 6b: Create AgentCore Code Interpreter (Optional)

You can also deploy AgentCore Code Interpreter in the same VPC with the same firewall protection:

aws bedrock-agentcore-control create-code-interpreter \
  --name my_secure_code_interpreter \
  --network-configuration '{
    "networkMode": "VPC",
    "vpcConfig": {
      "securityGroups": ["sg-XXXXXXXXX"],
      "subnets": ["subnet-XXXXXXXXX"]
    }
  }' \
  --region us-east-2

AgentCore Code Interpreter uses the same network path as Browser. If you need pip to install packages, make sure .pypi.org and .pythonhosted.org are in your allowlist.

Step 6c: Deploy agent on AgentCore Runtime (Optional)

For container-based agent deployments, use the same VPC configuration:

aws bedrock-agentcore-control create-agent-runtime \
  --agent-runtime-name my_vpc_agent \
  --role-arn arn:aws:iam::ACCOUNT_ID:role/AgentCoreRuntimeRole \
  --agent-runtime-artifact '{
    "containerConfiguration": {
      "containerUri": "ACCOUNT_ID.dkr.ecr.us-east-2.amazonaws.com/my-agent:latest"
    }
  }' \
  --network-configuration '{
    "networkMode": "VPC",
    "networkModeConfig": {
      "securityGroups": ["sg-XXXXXXXXX"],
      "subnets": ["subnet-XXXXXXXXX"]
    }
  }' \
  --protocol-configuration '{"serverProtocol": "HTTP"}' \
  --region us-east-2

AgentCore Runtime domain requirements depend on your model provider. Include .amazonaws.com for Amazon Bedrock model API calls or add the appropriate domains for other model providers your agent uses. Additionally, allow custom domains that your agent must access.

Step 7: Test the Configuration

Start a browser session and verify that the firewall rules work correctly:

# Start browser session
aws bedrock-agentcore start-browser-session \
  --browser-identifier my_secure_browser-ABC123xyz \
  --region us-east-2

Use the returned WebSocket URL with a browser automation tool like Playwright to test both allowed and blocked domains:

# test_firewall_rules.py

from playwright.sync_api import sync_playwright
import boto3
from botocore.auth import SigV4Auth
from botocore.awsrequest import AWSRequest

WEBSOCKET_URL = "wss://your-session-url"  # From start-browser-session response
REGION = "us-east-2"

# Sign the WebSocket URL with SigV4
session = boto3.Session(region_name=REGION)
credentials = session.get_credentials().get_frozen_credentials()
request = AWSRequest(method="GET", url=WEBSOCKET_URL.replace("wss://", "https://"))
SigV4Auth(credentials, "bedrock-agentcore", REGION).add_auth(request)
headers = dict(request.headers)

def test_domain(page, url, expected_success):
    try:
        response = page.goto(url, timeout=10000)
        success = response and response.status < 400
        status = "PASS" if success == expected_success else "FAIL"
        print(f"{status}: {url} - {'loaded' if success else 'blocked'}")
        return success == expected_success
    except Exception as e:
        success = False
        status = "PASS" if not expected_success else "FAIL"
        print(f"{status}: {url} - blocked ({type(e).__name__})")
        return not expected_success

with sync_playwright() as p:
    browser = p.chromium.connect_over_cdp(WEBSOCKET_URL, headers=headers)
    page = browser.new_page()

    # Test allowed domains (should load)
    test_domain(page, "https://wikipedia.org", expected_success=True)
    test_domain(page, "https://docs.aws.amazon.com", expected_success=True)

    # Test blocked domains (should timeout/fail)
    test_domain(page, "https://example.com", expected_success=False)
    test_domain(page, "https://twitter.com", expected_success=False)

    browser.close()

Expected results:

Allowed domains (.wikipedia.org, .amazonaws.com) should load successfully.
Blocked domains should time out after the TCP handshake or return connection errors.

Note: Some allowed domains like docs.aws.amazon.com depend on CDN resources from domains such as awsstatic.com and cloudfront.net. If pages on allowed domains fail to render fully, add the required CDN domains to your allowlist.

You can also check the firewall logs in CloudWatch for blocked connection attempts:

# View recent alert logs (blocked connections)
aws logs filter-log-events \
  --log-group-name "/aws/network-firewall/agentcore-egress/alerts" \
  --filter-pattern '{ $.event.alert.action = "blocked" }' \
  --region us-east-2 \
  --start-time $(($(date +%s) - 300))000

# Verify firewall sync status before testing
aws network-firewall describe-firewall \
  --firewall-name agentcore-egress-firewall \
  --region us-east-2 \
  --query 'FirewallStatus.ConfigurationSyncStateSummary'

Troubleshooting: If allowed domains are blocked, verify:

Firewall sync status shows IN_SYNC (rule changes take a few minutes)
Domain entries include the leading dot (.wikipedia.org not wikipedia.org)
Route tables are configured correctly for symmetric routing
If you receive HTTP 403 errors on allowed domains, this is typically bot detection by the destination site, not a firewall block. Check CloudWatch ALERT logs to confirm—blocked connections will have explicit alert entries.

Best practices

Use STRICT_ORDER evaluation: This facilitates predictable rule processing when combining allowlists and denylists.
Include .amazonaws.com for AWS service access: Or use VPC Endpoints to avoid routing AWS API calls through the internet.
Configure the IGW ingress route table: This is critical for symmetric routing. Without it, return traffic bypasses the firewall.
Enable both ALERT and FLOW logs: ALERT logs capture blocked connections; FLOW logs provide connection metadata for the traffic.
Wait for firewall sync: Rule changes take a few minutes to propagate. Verify ConfigurationSyncStateSummary: IN_SYNC before testing.
Configure HOME_NET for multi-VPC architectures: By default, Network Firewall domain inspection only filters traffic originating from the deployment VPC’s Classless Inter-Domain Routing (CIDR) range. If you use a centralized firewall with AWS Transit Gateway to inspect traffic from multiple VPCs, you must configure the HOME_NET variable in your rule group to include the source CIDR ranges. Without this, traffic from other VPCs can bypass domain filtering.

Limitations and cost considerations

Content inspection requires TLS inspection: By default, domain filtering operates on unencrypted TLS metadata (SNI headers) and can’t inspect encrypted request or response bodies. To inspect HTTPS content, enable TLS inspection on your Network Firewall and add Suricata rules that match on HTTP body content. SNI/Host header bypass risk: Network Firewall uses TLS SNI headers and HTTP Host headers—not IP addresses—to determine destination domains. If these headers are manipulated, traffic could bypass domain filtering. For high-security deployments, combine domain rules with IP-based rules for critical blocked destinations, or add DNS filtering as an additional layer. Additionally, consider pairing SNI-based rules with Route 53 DNS Firewall to help prevent agents from resolving blocked domains through DNS and connecting by IP address directly.
HOME_NET scope in multi-VPC deployments: By default, Network Firewall domain inspection only applies to traffic originating from the deployment VPC’s CIDR range. If you use a centralized firewall with AWS Transit Gateway (multiple VPCs routing through a shared firewall), you must configure the HOME_NET variable in your rule group to include the source CIDR ranges. Without this, traffic from spoke VPCs bypasses domain inspection. See Stateful domain list rule groups for details.
Costs will vary based on your usage. See NAT Gateway pricing and Network Firewall pricing for current rates.

Clean up

Delete resources in this order to avoid ongoing charges:

Delete the AgentCore Browser
Delete the Network Firewall (disable protection settings first)
Delete the NAT Gateway
Release the Elastic IP address
Delete the subnets and route tables
Detach and delete the Internet Gateway
Delete the VPC

Note: AgentCore Browser and Code Interpreter create elastic network interfaces in your VPC. After deleting these resources, wait a few minutes for the network interface to release before deleting the security group, subnet, or VPC. If deletion fails, check for lingering network interfaces in the subnet and wait for them to detach.

Related resources

For more information, see the following resources.

Amazon Bedrock AgentCore VPC configuration – VPC networking setup for AgentCore tools
Deployment models for AWS Network Firewall – Architecture patterns for centralized and distributed firewall deployments
Amazon Bedrock AgentCore documentation – Browser, Code Interpreter, and Agent Runtime configuration
AWS Network Firewall rule groups – Domain list rule configuration reference

Going further

Domain filtering through SNI inspection is one layer of egress security. Depending on your requirements, consider these additional mitigations:

Technique	What it does	Helps in scenarios where	Reference
Route 53 DNS Firewall	Helps block or allow DNS queries by domain and prevent DNS tunneling and exfiltration.	You need DNS-level filtering or protection against DNS-based data exfiltration.	Protect against advanced DNS threats
TLS inspection + Suricata DLP	Decrypt HTTPS, inspect request/response bodies with Suricata rules, help block sensitive data patterns (PII, credentials).	You need data loss prevention (DLP) for agent-generated traffic.	TLS inspection for encrypted egress traffic
Centralized inspection architecture	Route traffic from multiple VPCs through a shared inspection VPC with Network Firewall.	You run multiple AgentCore deployments and want centralized policy enforcement.	Deploy centralized traffic filtering

When using TLS inspection, configure custom certificates on your AgentCore resources to trust the Network Firewall’s re-signing CA.

Conclusion

By combining Amazon Bedrock AgentCore tools with AWS Network Firewall, you can give AI agents controlled web access while maintaining security and compliance alignment. The domain-based filtering approach helps you define precisely which websites agents can access, block unwanted destinations, and log the connection attempts for audit purposes. This architecture addresses the security concerns raised by enterprise customers:

FSI compliance: Provides the network isolation and audit logging required for CISO-level security reviews.
Multi-tenant control: Enables per-customer or per-execution domain policies for SaaS providers.
Prompt injection defense: Restricts agent navigation to approved domains, helping reduce the attack surface for prompt injection.
Audit evidence: Generates CloudWatch logs that support compliance audit requirements.

For enterprises deploying AI agents that need internet access for research, data gathering, or API integrations, this pattern provides a production-ready approach to maintaining strict control over where that access leads. Rather than maintaining custom squid proxies or complex network infrastructure, you can use AWS managed services to implement enterprise-grade egress filtering in hours, not weeks.

For more information about AgentCore Browser, see the AgentCore Browser documentation.

About the authors

Rocket Close transforms mortgage document processing with Amazon Bedrock and Amazon Textract

Jeremy Little, Chris Day — Thu, 02 Apr 2026 12:59:31 +0000

This post is cowritten by Jeremy Little and Chris Day from Rocket Close.

Rocket Close, a Detroit-based title and appraisal management company within the Rocket Companies environment, has enhanced mortgage document processing by transforming a time-consuming manual process into an efficient automated solution. Processing approximately 2,000 abstract package files daily, with each file averaging 75 pages, the company faced a major operational challenge: manual extraction took on average 10 hours per package, creating considerable resource allocation burdens and workflow bottlenecks.

Through a strategic partnership with the AWS Generative AI Innovation Center (GenAIIC), Rocket Close developed an intelligent document processing solution that has significantly reduced processing time, making the process 15 times faster. The solution, which uses Amazon Textract for OCR processing and Amazon Bedrock for foundation models (FMs), achieves a strong 90% overall accuracy in document segmentation, classification, and field extraction. Amazon Bedrock is a fully managed service that provides a serverless and more secure way to build and scale generative AI applications. It offers a single API to access a choice of leading FMs from various AI companies. Designed to scale to over 500,000 documents annually, this transformation positions Rocket Close at the forefront of technological innovation in the mortgage industry, supporting faster customer service and sustainable business growth.

This post explores how this solution was developed and implemented, demonstrating how generative AI can transform document-intensive processes in the mortgage industry.

Challenges of manual processing at scale

Rocket Close processes a high volume of complex documentation as part of its title and appraisal management services. Rocket Close is dedicated to helping clients realize their dream of homeownership and financial freedom by making complex processes simpler through technology-driven solutions. By analyzing a wide range of data points, Rocket Close can quickly and accurately assess the risk associated with a loan, so they can make more informed lending decisions and get their clients the financing they need.Rocket Close faced a critical bottleneck that threatened their growth and profitability:

Volume overload – 2,000 abstract packages daily, each averaging 75 pages
Time-intensive workflow – 10 hours per package due to recent volume spikes, with an estimated 30 minutes of actual manual processing effort per package
Financial impact – Considerable costs per file, with complex cases resulting in even higher expenses, totaling millions in annual processing costs
Scalability limits – Manual processes couldn’t keep pace with growing demand
Quality concerns – Human error and inconsistencies in data extraction

With approximately 1,000 hours of manual processing effort required daily, Rocket Close needed a solution that could maintain accuracy while dramatically reducing processing time.

Understanding abstract document packages

Abstract document packages are comprehensive collections of legal documents related to property ownership and transactions. These packages typically contain 50–100 pages of various document types bundled together, often with inconsistent formatting, varying quality, and complex structures. Each package requires thorough examination to extract critical information about property ownership, liens, mortgages, and legal status. The packages present unique challenges for automated processing due to their heterogeneous nature. Documents within a single package might include typed texts, layouts, handwritten notes, tables, forms, signatures, and stamps. Additionally, the ordering and presence of specific documents can vary significantly between packages, requiring sophisticated document segmentation and classification capabilities.

The solution handles over 60 different document classes that fall into several major categories:

Mortgage documents – These include primary mortgage instruments such as mortgage agreements, deeds of trust, and security instruments. These documents establish the terms of loans secured by real property and contain critical information about loan amounts, interest rates, and repayment terms.
Chain of title documents – This category encompasses various deed types (warranty deed, quitclaim deed, special warranty deed) that document the historical transfers of property ownership. These documents establish the legal chain of title and are essential for verifying clean ownership.
Judgment documents – These include civil judgments, abstracts of judgment, and various notices of lien that might affect property ownership. These documents record legal claims against property owners that might impact title status.
Tax documents – This category includes tax-related filings such as notice of federal tax lien and notice of state tax lien that represent potential claims against the property for unpaid taxes.
Legal documents – These encompass various legal filings, including pending lawsuits, complaints for foreclosure, affidavits of heirship, and other court documents that might affect property ownership status.

Solution architecture

The AWS GenAIIC and Rocket Close teams collaboratively developed a solution that uses generative AI capabilities to automate the abstract package processing workflow. The following diagram shows the overall solution pipeline of the two-stage process using Amazon Textract for OCR processing and Amazon Bedrock for intelligent information extraction.

The first stage of the pipeline uses Amazon Textract to convert document images into machine-readable text. The system processes PDF documents through advanced OCR features that detect layout, tables, forms, and signatures while preserving the document’s structural hierarchy. The extracted content is then converted to markdown format, maintaining both human readability and machine processability, and stored in Amazon Simple Storage Service (Amazon S3) and locally for further processing.

The second stage uses Amazon Bedrock FMs to perform comprehensive document analysis and data extraction. The system first classifies and segments documents by analyzing their content and creating a table of contents, using domain-specific knowledge resources. Then, based on the document type, it extracts relevant data fields using specialized prompts combined with domain knowledge. The extracted information is converted into standardized JSON format for seamless integration with other systems.

The solution’s effectiveness relies on several innovative technical approaches:

Advanced prompt engineering – The team developed specialized prompts that strategically guide the behavior of the large language model (LLM) for different document processing tasks. Document analysis prompts combine content with classification guidelines to facilitate accurate document segmentation, and information extraction prompts incorporate field definitions and domain knowledge to target specific data elements within documents. These carefully crafted prompts include illustrative examples and precise formatting instructions that enable the model to produce consistent, structured outputs across various document types and formats.
Domain-specific knowledge integration – The system incorporates industry-specific knowledge to help enhance extraction accuracy through several complementary approaches. A data field to document class mapping makes sure the system targets the appropriate information in each document type, and comprehensive data dictionaries provide clear field definitions and expected formats for extraction. Mortgage industry glossaries help the system accurately interpret specialized terminology and acronyms common in the financial domain. This domain knowledge is dynamically incorporated into prompts during processing, significantly improving the system’s ability to extract accurate information from complex documents.
Domain-aware evaluation framework – The project’s success hinged on a sophisticated evaluation system that used more than basic accuracy metrics. The solution includes a comprehensive framework with metrics tailored to different field types, facilitating accurate assessment of extraction quality across the mortgage domain.

The team implemented specialized approaches including exact and fuzzy string matching, numeric comparisons with configurable tolerance, and mortgage-specific metrics for state codes, deed types, transaction types, and document references. Domain-specific matching functions handle variations in specialized content, and field-type specific metrics apply appropriate comparison methods.

Results and impact

The proof of concept demonstrated strong results that exceeded expectations and validated the approach’s effectiveness for Rocket Close’s document processing needs.

The solution underwent rigorous performance testing across multiple evaluation rounds. The initial validation phase tested 28 random samples containing 655 data fields, achieving an overall accuracy of 90.53%. This early success demonstrated the viability of the approach and provided confidence to proceed with more extensive testing.

The second round focused on targeted testing with 52 samples that had 1:1 mapping to ground truth data, encompassing 2,249 data fields. The system achieved 91.28% accuracy during this phase, confirming consistent performance across different document types and validating the extraction methodology against verified baseline data. This phase was particularly important for establishing confidence in the Amazon Textract and custom processing pipeline’s ability to handle diverse document formats.

The final evaluation involved large-scale verification that processed 1,792 samples containing over 44,000 data fields, achieving an overall accuracy of 89.71%. This extensive testing validated the solution’s scalability and reliability across a representative sample of Rocket Close’s document volume, demonstrating that the AWS infrastructure maintains high accuracy even when processing large batches of diverse documents in parallel.

This solution, powered by AWS, helps deliver considerable business value across multiple dimensions. The automated system reduces processing time from 30 minutes per package to under 2 minutes, making processing 15 times faster. This acceleration enables faster customer service and higher throughput. From a financial perspective, the solution considerably reduces processing costs, delivering notable savings per file. With thousands of files processed daily (approximately 2,000 files), this represents potential annual savings at an enterprise scale. The automated system also delivers enhanced quality and consistency, maintaining 90% overall accuracy while reducing human error and standardizing output formats. This consistency improves downstream processes and decision-making, facilitating reliable data for business operations. Furthermore, the cloud-based architecture provides improved scalability by handling increasing document volumes without proportional staffing increases, supporting business growth without linear cost increases. It’s designed to scale elastically to handle over 500,000 documents annually, with the ability to automatically scale during peak processing periods, positioning Rocket Close for future expansion without infrastructure constraints.

Lessons learned

The proof of concept engagement revealed several valuable insights that can guide similar document processing implementations on AWS.

Prompt engineering proved critical, because carefully crafted prompts that incorporate domain knowledge significantly improve extraction accuracy. The team developed specialized prompts that combine document content with classification guidelines and domain-specific knowledge.

The two-stage pipeline architecture demonstrated strong effectiveness for this use case. Separating OCR and LLM processing allows for better optimization of each stage. Amazon Textract handles the complex task of extracting text from various document formats while preserving structural information, and Amazon Bedrock (using Anthropic’s Claude) focuses on understanding the content and extracting relevant information.

Domain-specific knowledge integration emerged as another key success factor. Incorporating mortgage-specific terminology and document understanding significantly improves results. The solution uses data dictionaries, glossaries, and document class definitions to help enhance extraction accuracy.

The engagement also highlighted evaluation complexity as an important consideration. Developing sophisticated, domain-aware evaluation metrics is essential for accurately measuring performance. The evaluation framework employs specialized metrics tailored to different field types, including state code matching, deed type matching, and transaction type matching.

Finally, scalability considerations proved crucial from the initial design phase. The solution architecture must be designed from the start to handle high volumes of documents efficiently. The two-stage pipeline approach with Amazon Textract and Amazon Bedrock helps provide the necessary scalability.

What’s next

Following the successful proof of concept, Rocket Close is positioned to move forward with production implementation.

The next phase involves moving from POC to production deployment with a containerized architecture that can handle enterprise-scale document processing. The team plans to establish continuous improvement processes by creating feedback loops to improve extraction accuracy over time. This iterative approach allows the system to learn from processing results and adapt to evolving document patterns.

An important consideration for long-term success is developing a model update strategy. Rocket Close will create a strategy for updating LLM models as new versions become available from Amazon Bedrock, making sure the solution benefits from the latest advancements in language model capabilities.

Finally, the proven approach will be expanded to additional workflows beyond the initial scope. Rocket Close plans to apply the solution to loan and mortgage payoff processing, purchase agreement processing, and title clearance documentation, extending the benefits of automated document processing across more of their operations.

Conclusion

The Rocket Close and AWS Generative AI Innovation Center collaboration demonstrates the transformative potential of generative AI in document-intensive industries. By automating the complex task of abstract package processing, Rocket Close has positioned itself to achieve major operational efficiencies, cost savings, and improved scalability. The solution’s strong 90% overall accuracy, combined with the dramatic reduction in processing time from hours to minutes, showcases how generative AI can solve real-world business challenges in the mortgage and title industry.

As Rocket Close moves toward production implementation, the foundation established during this proof of concept will enable continued innovation and process optimization across their document processing workflows.